A Comprehensive Guide to Creating Stacked Bar Charts with Pandas and Matplotlib

Keywords: Python | Pandas | Matplotlib | Stacked Bar Chart | Data Visualization

Abstract: This article provides a detailed tutorial on creating stacked bar charts using Python's Pandas and Matplotlib libraries. Through a practical case study, it demonstrates the complete workflow from raw data preprocessing to final visualization, including data reshaping with groupby and unstack methods. The article delves into key technical aspects such as data grouping, pivoting, and missing value handling, offering complete code examples and best practice recommendations to help readers master this essential data visualization technique.

Introduction and Problem Context

In the field of data analysis and visualization, stacked bar charts are a commonly used chart type that effectively displays the compositional structure of different categorical data. This article is based on an actual technical Q&A case, providing a detailed explanation of how to create stacked bar charts using Python's Pandas and Matplotlib libraries. The original problem involved generating specific visualizations from CSV data containing site names and fault types (ABUSE/NFF).

Data Preparation and Preprocessing

First, we need to understand the structure of the original data. The sample data contains two columns: Site Name and Abuse/NFF (fault type). The data may include missing values (represented by "-"), which is common in real-world datasets. Proper data preprocessing is the first step toward creating effective visualizations.

Let's create a simulated DataFrame to demonstrate the processing workflow:

import pandas as pd
import matplotlib.pyplot as plt

# Create sample data
data = {
    'Site Name': ['NORTH ACTON', 'WASHINGTON', 'WASHINGTON', 'BELFAST', 'CROYDON'],
    'Abuse/NFF': ['ABUSE', '-', 'NFF', '-', '-']
}
df = pd.DataFrame(data)
print("Original data:")
print(df)

Data Reshaping and Grouped Statistics

The key step in creating stacked bar charts is reshaping the data into a format suitable for plotting. We need to count the number of different fault types for each site. Pandas' groupby method combined with unstack can efficiently accomplish this task:

# Method 1: Data reshaping using groupby and unstack
df_grouped = df.groupby(['Site Name', 'Abuse/NFF'])['Site Name'].count().unstack('Abuse/NFF').fillna(0)
print("\nReshaped data:")
print(df_grouped)

This code performs the following key operations:

groupby(['Site Name', 'Abuse/NFF']): Groups data by site name and fault type
['Site Name'].count(): Counts records in each group
.unstack('Abuse/NFF'): Transforms fault types from row index to column index
.fillna(0): Fills missing values with 0 to ensure data integrity

Visualization Implementation

Once the data is reshaped, creating stacked bar charts with Matplotlib becomes straightforward:

# Create stacked bar chart
plt.figure(figsize=(10, 6))
df_grouped.plot(kind='bar', stacked=True, colormap='Set2')

# Add chart decorations
plt.title('Site Fault Type Distribution', fontsize=14)
plt.xlabel('Site Name', fontsize=12)
plt.ylabel('Fault Count', fontsize=12)
plt.legend(title='Fault Type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In-depth Technical Analysis

The method mentioned in Answer 2, df.groupby(['NFF', 'ABUSE']).size().unstack(), while concise, has potential issues. This approach assumes column names directly correspond to fault types, whereas in the actual data, fault types are values in the Abuse/NFF column. Additionally, the size() method counts all rows including NaN values, while count() only counts non-NaN values, creating differences when handling missing data.

A more robust implementation should consider the following factors:

# Handle missing values and abnormal data
df_clean = df.copy()
df_clean['Abuse/NFF'] = df_clean['Abuse/NFF'].replace('-', pd.NA)

# Use more explicit column selection
df_final = df_clean.groupby(['Site Name', 'Abuse/NFF']).size().unstack(fill_value=0)

# Ensure consistent column order
if 'ABUSE' in df_final.columns and 'NFF' in df_final.columns:
    df_final = df_final[['ABUSE', 'NFF']]

Best Practices and Extended Applications

In practical applications, we can further optimize the visualization:

# Add data labels and percentage display
ax = df_grouped.plot(kind='bar', stacked=True, figsize=(12, 7))

# Calculate and display percentages
for container in ax.containers:
    # Get total height of each bar
    total = sum([rect.get_height() for rect in container])
    
    # Add labels to each bar segment
    for rect in container:
        height = rect.get_height()
        if height > 0:
            percentage = f'{height/total*100:.1f}%' if total > 0 else '0%'
            ax.text(rect.get_x() + rect.get_width() / 2,
                   rect.get_y() + height / 2,
                   f'{int(height)}\n({percentage})',
                   ha='center', va='center',
                   fontsize=9, color='white')

Conclusion

Through this detailed tutorial, we have demonstrated the complete workflow for creating stacked bar charts using Pandas and Matplotlib. Key steps include data preprocessing, data reshaping using groupby and unstack, handling missing values, and final visualization implementation. This approach is not only applicable to the fault data in the example but can also be widely used in various scenarios requiring display of categorical data composition structures. After mastering these techniques, readers can flexibly adapt the code to suit different data characteristics and visualization requirements.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.