Creating Category-Based Scatter Plots: Integrated Application of Pandas and Matplotlib

Keywords: Scatter Plot | Data Grouping | Matplotlib | Pandas | Data Visualization

Abstract: This article provides a comprehensive exploration of methods for creating category-based scatter plots using Pandas and Matplotlib. By analyzing the limitations of initial approaches, it introduces effective strategies using groupby() for data segmentation and iterative plotting, with detailed explanations of color configuration, legend generation, and style optimization. The paper also compares alternative solutions like Seaborn, offering complete technical guidance for data visualization.

Problem Background and Initial Approach Analysis

In data visualization practice, there is often a need to display scatter plots grouped by categorical variables. The user initially attempted to use the ax.scatter() method with the c=df['key1'] parameter for color setting:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.normal(10,1,30).reshape(10,3), 
                 index=pd.date_range('2010-01-01', freq='M', periods=10), 
                 columns=('one', 'two', 'three'))
df['key1'] = (4,4,4,6,6,6,8,8,8,8)

fig1 = plt.figure(1)
ax1 = fig1.add_subplot(111)
ax1.scatter(df['one'], df['two'], marker='o', c=df['key1'], alpha=0.8)
plt.show()

While this approach can change marker colors based on the numerical categorical variable key1, it suffers from two main limitations: first, the color mapping is based on a continuous numerical scale, which is unsuitable for discrete categories; second, it cannot automatically generate categorical legends, reducing chart readability.

Efficient Plotting Strategy Based on Grouping

For visualization needs involving discrete categorical variables, using the groupby() method combined with iterative plotting provides a more appropriate solution. The core concept involves grouping data by the categorical column and then creating separate scatter plot sequences for each group:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Set random seed for reproducible results
np.random.seed(1974)

# Generate sample data
num = 20
x, y = np.random.random((2, num))
labels = np.random.choice(['a', 'b', 'c'], num)
df = pd.DataFrame(dict(x=x, y=y, label=labels))

# Group by label
groups = df.groupby('label')

# Create figure and axes
fig, ax = plt.subplots()
ax.margins(0.05)  # Add 5% padding

# Iteratively plot each group
for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=12, label=name)

# Add legend
ax.legend()
plt.show()

The advantages of this method include: independent plotting of each group for easy customization of marker styles; automatic legend generation through the label parameter; and use of the plot method instead of scatter, which is more suitable for discrete categorical scenarios.

Style Optimization and Professional Configuration

To achieve visualization effects consistent with Pandas' default style, one can integrate Pandas' style sheets and color generators:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

np.random.seed(1974)

# Data generation (same as above)
num = 20
x, y = np.random.random((2, num))
labels = np.random.choice(['a', 'b', 'c'], num)
df = pd.DataFrame(dict(x=x, y=y, label=labels))

groups = df.groupby('label')

# Apply Pandas style
plt.rcParams.update(pd.tools.plotting.mpl_stylesheet)
colors = pd.tools.plotting._get_standard_colors(len(groups), color_type='random')

fig, ax = plt.subplots()
ax.set_color_cycle(colors)
ax.margins(0.05)

for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=12, label=name)

# Optimize legend display
ax.legend(numpoints=1, loc='upper left')
plt.show()

Key technical points include: plt.rcParams.update() for loading Pandas style sheets; _get_standard_colors() for generating standard color sequences; set_color_cycle() for ensuring color consistency; and the numpoints=1 parameter for optimizing legend marker display.

Alternative Solutions and Technical Comparison

Beyond native Matplotlib-based methods, the Seaborn library offers a more concise API:

import seaborn as sns
import pandas as pd
import numpy as np

np.random.seed(1974)

df = pd.DataFrame(
    np.random.normal(10, 1, 30).reshape(10, 3),
    index=pd.date_range('2010-01-01', freq='M', periods=10),
    columns=('one', 'two', 'three'))
df['key1'] = (4, 4, 4, 6, 6, 6, 8, 8, 8, 8)

sns.scatterplot(x="one", y="two", data=df, hue="key1")

Seaborn's scatterplot function automatically handles categorical variables through the hue parameter, with built-in color mapping and legend generation. For multivariate data analysis, sns.pairplot(vars=["one","two","three"], data=df, hue="key1") can generate a scatter plot matrix.

Technical Summary

Creating scatter plots for discrete categorical variables requires special attention to data grouping strategies and visualization configuration. The loop-based plotting method using groupby() offers maximum flexibility for complex customization needs, while advanced libraries like Seaborn simplify implementation for common scenarios. Key decision factors include: data scale, customization requirements, performance considerations, and team technology stack consistency. In practical applications, it is recommended to choose the most suitable technical solution based on specific needs.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.

Problem Background and Initial Approach Analysis

Efficient Plotting Strategy Based on Grouping

Style Optimization and Professional Configuration

Alternative Solutions and Technical Comparison

Technical Summary

Cite this article