Keywords: Matplotlib | Scatter Plot | Categorical Variables | Data Visualization | Python Plotting | Color Mapping
Abstract: This article provides a comprehensive guide on creating scatter plots with different colors for categorical levels using Matplotlib in Python. Through analysis of the diamonds dataset, it demonstrates three implementation approaches: direct use of Matplotlib's scatter function with color mapping, simplification via Seaborn library, and grouped plotting using pandas groupby method. The paper delves into the implementation principles, code details, and applicable scenarios for each method while comparing their advantages and limitations. Additionally, it offers practical techniques for custom color schemes, legend creation, and visualization optimization, helping readers master the core skills of categorical coloring in pure Matplotlib environments.
Introduction
In data visualization, scatter plots are fundamental tools for displaying relationships between two continuous variables. When data contains categorical variables, using different colors for distinct categories significantly enhances chart readability and information content. This article explores in detail how to implement this functionality in Python using Matplotlib, based on the diamonds dataset.
Data Preparation and Import
First, import necessary libraries and load the data:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from matplotlib.lines import Line2D
# Load diamonds dataset
df = sns.load_dataset('diamonds')
# Examine data structure
print(df.head())
print(f"Dataset shape: {df.shape}")
print(f"Color categories: {df['color'].unique()}")
The diamonds dataset contains various attributes of diamonds, where the color column includes letter grades from D to J, representing diamond color grades. Our objective is to distinguish these grades with different colors in the scatter plot.
Basic Matplotlib Implementation
Matplotlib's scatter function provides a c parameter to directly specify colors for each data point:
fig, ax = plt.subplots(figsize=(10, 8))
# Define color mapping dictionary
colors = {
'D': 'tab:blue', 'E': 'tab:orange', 'F': 'tab:green',
'G': 'tab:red', 'H': 'tab:purple', 'I': 'tab:brown', 'J': 'tab:pink'
}
# Create scatter plot
scatter = ax.scatter(df['carat'], df['price'],
c=df['color'].map(colors),
alpha=0.6, s=20)
# Set axis labels
ax.set_xlabel('Carat')
ax.set_ylabel('Price (USD)')
ax.set_title('Diamond Price vs Carat by Color Grade')
# Add legend
handles = [Line2D([0], [0], marker='o', color='w',
markerfacecolor=v, label=k, markersize=10)
for k, v in colors.items()]
ax.legend(title='Color Grade', handles=handles,
bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
The core of this approach lies in df['color'].map(colors), which maps categorical variables to specific color values. Matplotlib's built-in color names like tab:blue provide excellent visual distinction.
Simplified Implementation with Seaborn
While users may prefer pure Matplotlib solutions, understanding Seaborn alternatives remains valuable:
# Method 1: Using scatterplot function
plt.figure(figsize=(10, 8))
sns.scatterplot(data=df, x='carat', y='price', hue='color')
plt.title('Diamond Price vs Carat (Seaborn)')
plt.show()
# Method 2: Using lmplot function
sns.lmplot(data=df, x='carat', y='price', hue='color',
fit_reg=False, height=8, aspect=1.2)
plt.title('Diamond Price vs Carat (Seaborn lmplot)')
plt.show()
Seaborn automatically handles color mapping and legend generation, significantly simplifying the code. Its hue parameter is specifically designed for color differentiation of categorical variables.
Pandas GroupBy Approach
Another pure Matplotlib method utilizes pandas' groupby functionality:
fig, ax = plt.subplots(figsize=(10, 8))
# Use the same color mapping
grouped = df.groupby('color')
for color_grade, group in grouped:
group.plot(ax=ax, kind='scatter', x='carat', y='price',
label=color_grade, color=colors[color_grade],
alpha=0.6, s=20)
ax.set_xlabel('Carat')
ax.set_ylabel('Price (USD)')
ax.set_title('Diamond Price vs Carat (GroupBy Method)')
ax.legend(title='Color Grade', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
Although this method involves slightly more code, it provides finer control over each group's plotting parameters, suitable for scenarios requiring customizations per category.
Advanced Techniques and Optimization
Dynamic Color Generation
When the number of categories is uncertain, colors can be generated dynamically:
import numpy as np
# Get all unique color categories
unique_colors = df['color'].unique()
# Generate colors using colormap
cmap = plt.cm.Set3
color_map = {color: cmap(i/len(unique_colors))
for i, color in enumerate(unique_colors)}
fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(df['carat'], df['price'],
c=df['color'].map(color_map), alpha=0.6, s=20)
# Dynamically generate legend
handles = [Line2D([0], [0], marker='o', color='w',
markerfacecolor=color_map[color], label=color, markersize=10)
for color in unique_colors]
ax.legend(handles=handles, title='Color Grade',
bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
Handling Numerous Categories
When dealing with many categories, visual distinguishability must be considered:
# Use cyclic colors
from itertools import cycle
color_cycle = cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])
color_map = {color: next(color_cycle) for color in df['color'].unique()}
# Or use specialized palettes
import seaborn as sns
palette = sns.color_palette("husl", len(df['color'].unique()))
color_map = dict(zip(df['color'].unique(), palette))
Performance Comparison and Selection Guidelines
The three main methods each have distinct advantages:
- Direct Matplotlib: Best performance, finest control, suitable for production environments
- Seaborn: Concise code, aesthetically pleasing defaults, ideal for rapid exploration
- pandas GroupBy: Clear grouping logic, suitable for complex grouping operations
For large datasets, directly using Matplotlib's scatter function typically offers the best performance by avoiding loop overhead.
Conclusion
This article comprehensively detailed multiple methods for creating scatter plots with different colors for categorical variables using Matplotlib in Python. The core understanding involves color mapping mechanisms and Matplotlib's plotting principles. While advanced libraries like Seaborn provide convenient wrappers, mastering pure Matplotlib implementations is crucial for custom requirements and production environment deployments. By appropriately selecting color schemes, optimizing legend design, and considering performance factors, one can create both aesthetically pleasing and practically useful data visualizations.