Keywords: Matplotlib | scatter_plot | color_mapping | class_labels | data_visualization
Abstract: This paper comprehensively explores techniques for assigning distinct colors to data points in scatter plots based on class labels using Python's Matplotlib library. Beginning with fundamental principles of simple color mapping using ListedColormap, the article delves into advanced methodologies employing BoundaryNorm and custom colormaps for handling multi-class discrete data. Through comparative analysis of different implementation approaches, complete code examples and best practice recommendations are provided, enabling readers to master effective categorical information encoding in data visualization.
Introduction and Problem Context
In the field of data visualization, scatter plots serve as fundamental tools for displaying two-dimensional data distributions. When data points possess class labels, encoding these categories through color significantly enhances chart readability and information expressiveness. This paper systematically explores color mapping techniques based on class labels in Matplotlib, building upon technical discussions from Stack Overflow.
Basic Color Mapping Methods
For scenarios with limited category counts, the most straightforward approach utilizes ListedColormap. The core concept involves predefining a color list for each category, then passing class labels through the c parameter to the scatter function.
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
# Example data
x = [4, 8, 12, 16, 1, 4, 9, 16]
y = [1, 4, 9, 16, 4, 8, 12, 3]
label = [0, 1, 2, 3, 0, 1, 2, 3]
colors = ['red', 'green', 'blue', 'purple']
# Create figure
fig = plt.figure(figsize=(8, 8))
# Key step: Color mapping using ListedColormap
plt.scatter(x, y, c=label, cmap=matplotlib.colors.ListedColormap(colors))
# Add colorbar and configure ticks
cb = plt.colorbar()
loc = np.arange(0, max(label), max(label) / float(len(colors)))
cb.set_ticks(loc)
cb.set_ticklabels(colors)
This method's advantage lies in its simplicity and clarity of color-category correspondence. However, when dealing with numerous categories, manually defining color lists becomes impractical and难以保证颜色的视觉区分度.
Advanced Discrete Color Mapping Techniques
For datasets containing大量类别, more generalized solutions are required. The approach proposed in Answer 1 achieves scalable discrete color encoding by combining BoundaryNorm with custom colormaps.
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
# Define number of categories
N = 23
# Generate example data
np.random.seed(42)
x = np.random.rand(1000)
y = np.random.rand(1000)
tag = np.random.randint(0, N, 1000)
# Create figure and axes
fig, ax = plt.subplots(1, 1, figsize=(6, 6))
# Extract all colors from jet colormap
cmap = plt.cm.jet
cmaplist = [cmap(i) for i in range(cmap.N)]
# Create custom colormap
cmap = cmap.from_list('Custom cmap', cmaplist, cmap.N)
# Define boundaries and create BoundaryNorm object
bounds = np.linspace(0, N, N + 1)
norm = mpl.colors.BoundaryNorm(bounds, cmap.N)
# Generate scatter plot
scat = ax.scatter(x, y, c=tag, s=np.random.randint(100, 500, N),
cmap=cmap, norm=norm)
# Add colorbar
cb = plt.colorbar(scat, spacing='proportional', ticks=bounds)
cb.set_label('Custom cbar')
ax.set_title('Discrete color mappings')
plt.show()
Technical highlights of this approach include:
- Colormap Extraction: Extracting all color values from continuous colormaps (e.g., jet) in preparation for discretization.
- Boundary Normalization: Using
BoundaryNormto partition continuous color space into discrete intervals, each corresponding to a category. - Proportional Spacing: Ensuring uniform distribution of each category's space in the colorbar through
spacing='proportional'.
Technical Comparison and Best Practices
Answer 2 proposes a simplified method using modulo operations to generate color indices:
import numpy
import pylab
xy = numpy.zeros((2, 1000))
xy[0] = range(1000)
xy[1] = range(1000)
colors = [int(i % 23) for i in xy[0]]
pylab.scatter(xy[0], xy[1], c=colors, cmap=pylab.cm.cool)
pylab.show()
While concise, this method lacks intuitive color-category correspondence and may produce color repetition when category counts exceed colormap color quantities.
Answer 3 demonstrates conditional list comprehension application:
arr1 = [1, 2, 3, 4, 5]
arr2 = [2, 3, 3, 4, 4]
labl = [0, 1, 1, 0, 0]
color = ['red' if l == 0 else 'green' for l in labl]
plt.scatter(arr1, arr2, color=color)
This approach suits scenarios with minimal categories but suffers from poor scalability, requiring conditional statements for each category.
Implementation Details and Considerations
Several critical details require attention in practical applications:
- Label Encoding: Category labels should be integers, consecutively numbered from 0, to ensure proper correspondence with colormaps.
- Color Selection: For categorical data, perceptually uniform colormaps like Set3, Set2, or tab20c are recommended, avoiding perceptually flawed colormaps like jet.
- Colorbar Customization: Precise control over colorbar display content is achievable through
set_ticks()andset_ticklabels(). - Performance Considerations: For extremely large datasets (>10^6 points), using
marker='.'and adjusting alpha values improves rendering performance.
Application Scenarios and Extensions
Color mapping by class labels finds extensive applications across multiple domains:
- Machine Learning Visualization: Displaying decision boundaries and sample distributions of classification algorithms.
- Bioinformatics: Labeling different cell types or gene expression patterns.
- Geographic Information Systems: Distinguishing various land use types or administrative divisions.
For more complex applications, consider these extensions:
# Example handling non-consecutive category labels
unique_labels = np.unique(tag)
norm_labels = np.searchsorted(unique_labels, tag)
# Use normalized labels for color mapping
Conclusion
This paper systematically introduces color mapping techniques based on class labels in Matplotlib. From simple ListedColormap approaches to sophisticated BoundaryNorm solutions, different methods suit varying application scenarios. For most applications, the generalized method from Answer 1 is recommended, balancing flexibility, scalability, and usability. In practice, developers should select appropriate technical solutions based on specific requirements, while注意色彩感知、性能优化等实践细节。
By properly applying these techniques, data scientists and developers can create information-rich, visually appealing scatter plots that effectively communicate categorical information within data, supporting deeper data analysis and insight discovery.