Keywords: Matplotlib | Stacked Bar Plot | Bottom Parameter Calculation | NumPy Arrays | Data Visualization
Abstract: This paper provides an in-depth analysis of common bottom parameter calculation errors when creating stacked bar plots with Matplotlib. Through a concrete case study, it demonstrates the abnormal display phenomena that occur when bottom parameters are not correctly accumulated. The article explains the root cause lies in the behavioral differences between Python lists and NumPy arrays in addition operations, and presents three solutions: using NumPy array conversion, list comprehension summation, and custom plotting functions. Additionally, it compares the simplified implementation using the Pandas library, offering comprehensive technical references for various application scenarios.
Problem Background and Phenomenon Description
When creating stacked bar plots with Matplotlib, a common yet easily overlooked issue is the accurate calculation of bottom parameters. The core principle of stacked bar plots involves vertically stacking multiple data series, where the total height of each bar represents the sum of all series for that category. However, incorrect bottom parameter calculations can lead to abnormal plot displays, particularly with large data values or multiple series.
Case Study Analysis
Consider a typical scenario where a user needs to plot a stacked bar chart with four data series, expecting each vertical stack to sum to 100. The original code uses Python lists to store data and plots through layer-by-layer stacking:
p1 = plt.bar(ind, dataset[1], width, color='r')
p2 = plt.bar(ind, dataset[2], width, bottom=dataset[1], color='b')
p3 = plt.bar(ind, dataset[3], width, bottom=dataset[2], color='g')
p4 = plt.bar(ind, dataset[4], width, bottom=dataset[3], color='c')
This implementation has a fundamental issue: starting from the third series, the bottom parameter is set only to the value of the previous series, not the cumulative sum of all preceding series. For example, for series 3, the bottom should be the sum of series 1 and 2, not just series 2. This error causes abnormal plot displays, particularly at certain tick positions (such as X-axis ticks 65, 70, 75, 80) where completely unreasonable stacking results appear.
Root Cause Analysis
The core issue lies in the behavioral differences between Python lists and NumPy arrays in addition operations. When using Python lists for addition, it performs list concatenation rather than element-wise addition. For example:
list1 = [1, 2, 3]
list2 = [4, 5, 6]
result = list1 + list2 # Results in [1, 2, 3, 4, 5, 6], not [5, 7, 9]
To achieve element-wise addition, lists must be converted to NumPy arrays or other appropriate accumulation methods must be used.
Solution 1: Using NumPy Array Conversion
The most direct solution is to convert data to NumPy arrays, leveraging their element-wise addition capabilities:
import numpy as np
# Method 1: Dynamic conversion during plotting
dataset1 = np.array(dataset[1])
dataset2 = np.array(dataset[2])
dataset3 = np.array(dataset[3])
dataset4 = np.array(dataset[4])
p1 = plt.bar(ind, dataset1, width, color='r')
p2 = plt.bar(ind, dataset2, width, bottom=dataset1, color='b')
p3 = plt.bar(ind, dataset3, width, bottom=dataset1+dataset2, color='g')
p4 = plt.bar(ind, dataset4, width, bottom=dataset1+dataset2+dataset3, color='c')
This method ensures that the bottom of each series is the cumulative sum of all preceding series, resulting in correct stacking effects.
Solution 2: List Comprehension Summation
If avoiding NumPy dependencies is desired, Python's built-in list comprehension and zip functions can achieve element-wise summation:
p1 = plt.bar(ind, dataset[1], width, color='r')
p2 = plt.bar(ind, dataset[2], width, bottom=dataset[1], color='b')
p3 = plt.bar(ind, dataset[3], width, bottom=[sum(x) for x in zip(dataset[1], dataset[2])], color='g')
p4 = plt.bar(ind, dataset[4], width, bottom=[sum(x) for x in zip(dataset[1], dataset[2], dataset[3])], color='c')
Although slightly more verbose, this method uses only Python standard library features, making it suitable for lightweight applications.
Solution 3: Custom Plotting Function
For scenarios requiring frequent stacked bar plot creation, a general-purpose plotting function can be encapsulated. Below is a fully functional implementation example:
def plot_stacked_bar(data, series_labels, category_labels=None, show_values=False, value_format="{}", y_label=None, colors=None, grid=True, reverse=False):
"""General function for plotting stacked bar charts
Parameters:
data -- 2D data array, each row represents a data series
series_labels -- List of series labels for legend display
category_labels -- List of category labels for X-axis ticks
show_values -- Whether to display value labels on bars
value_format -- Format string for value labels
y_label -- Y-axis label
colors -- List of colors
grid -- Whether to display grid
reverse -- Whether to reverse series display order
"""
import numpy as np
import matplotlib.pyplot as plt
ny = len(data[0])
ind = list(range(ny))
axes = []
cum_size = np.zeros(ny)
data = np.array(data)
if reverse:
data = np.flip(data, axis=1)
category_labels = reversed(category_labels)
for i, row_data in enumerate(data):
color = colors[i] if colors is not None else None
p = plt.bar(ind, row_data, bottom=cum_size, label=series_labels[i], color=color)
cum_size += row_data
if show_values:
plt.bar_label(p, label_type='center', fmt=value_format)
if category_labels:
plt.xticks(ind, category_labels)
if y_label:
plt.ylabel(y_label)
plt.legend()
if grid:
plt.grid()
Usage example:
plt.figure(figsize=(10, 6))
series_labels = ['a', 'b', 'c', 'd']
category_labels = ['60.0', '65.0', '70.0', '75.0', '80.0']
data = [
[0.0, 25.0, 48.94, 83.02, 66.67],
[0.0, 50.0, 36.17, 11.32, 26.67],
[0.0, 12.5, 10.64, 3.77, 4.45],
[100.0, 12.5, 4.26, 1.89, 2.22]
]
plot_stacked_bar(
data,
series_labels,
category_labels=category_labels,
show_values=True,
value_format="{:.1f}",
colors=['red', 'blue', 'green', 'cyan'],
y_label="Percentage (%)"
)
plt.tight_layout()
plt.show()
Pandas Simplification Approach
For scenarios already using Pandas for data processing, its built-in stacked bar plot functionality can be leveraged:
import pandas as pd
import matplotlib.pyplot as plt
# Create DataFrame
data = {
'a': [0.0, 25.0, 48.94, 83.02, 66.67],
'b': [0.0, 50.0, 36.17, 11.32, 26.67],
'c': [0.0, 12.5, 10.64, 3.77, 4.45],
'd': [100.0, 12.5, 4.26, 1.89, 2.22]
}
index = ['60.0', '65.0', '70.0', '75.0', '80.0']
df = pd.DataFrame(data, index=index)
# Plot stacked bar chart
ax = df.plot(kind='bar', stacked=True, figsize=(10, 6))
ax.set_ylabel('Percentage (%)')
ax.legend(title='Series', bbox_to_anchor=(1.0, 1), loc='upper left')
plt.tight_layout()
plt.show()
Pandas' plot method automatically handles bottom parameter calculations, significantly simplifying code implementation.
Best Practice Recommendations
1. Data Preprocessing: Convert data to NumPy arrays before plotting to ensure correct element-wise operations.
2. Bottom Parameter Validation: In complex scenarios, calculate and verify that cumulative sums for each category meet expectations.
3. Code Maintainability: Encapsulate frequently used stacked bar plots as functions or classes to improve code reusability.
4. Performance Considerations: For large datasets, NumPy array operations are generally more efficient than pure Python list operations.
5. Visualization Optimization: Appropriately set colors, labels, and layouts to ensure chart readability and aesthetics.
Conclusion
The key to correctly plotting stacked bar charts with Matplotlib lies in accurate bottom parameter calculations. This paper analyzes common error causes through specific cases and provides multiple solutions. Whether using NumPy array conversion, list comprehension summation, or custom function encapsulation, the core principle is ensuring each series' bottom is the cumulative sum of all preceding series. For Pandas users, leveraging its built-in functionality further simplifies implementation. Understanding these technical details helps create accurate and aesthetically pleasing stacked bar chart visualizations.