Plotting Multiple Distributions with Seaborn: A Practical Guide Using the Iris Dataset

Keywords: Seaborn | Distribution Visualization | Kernel Density Estimation | Multiple Distribution Comparison | Python Data Visualization

Abstract: This article provides a comprehensive guide to visualizing multiple distributions using Seaborn in Python. Using the classic Iris dataset as an example, it demonstrates three implementation approaches: separate plotting via data filtering, automated handling for unknown category counts, and advanced techniques using data reshaping and FacetGrid. The article delves into the advantages and limitations of each method, supplemented with core concepts from Seaborn documentation, including histogram vs. KDE selection, bandwidth parameter tuning, and conditional distribution comparison.

Introduction

Data distribution visualization is a fundamental step in exploratory data analysis, quickly revealing key characteristics such as variable range, central tendency, skewness, and multimodality. When comparing distribution differences across groups, plotting multiple distributions on the same graph is particularly effective. This article uses the classic Iris dataset to systematically explain various implementation schemes for plotting multiple distribution density plots with Seaborn.

Data Preparation and Basic Distribution Plotting

First, load the necessary libraries and dataset:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
import seaborn as sns
import matplotlib.pyplot as plt

# Load and transform the Iris dataset
iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                      columns=iris['feature_names'] + ['target'])

The dataset contains 150 samples, each with 4 features (sepal length, sepal width, petal length, petal width) and 1 target variable (3 iris species). The target variable takes values 0, 1, and 2, corresponding to Setosa, Versicolor, and Virginica iris species respectively.

Method 1: Distribution Plotting Based on Data Filtering

The most straightforward approach is to filter data for different categories and plot them sequentially:

# Filter data subsets by target value
target_0 = iris_df.loc[iris_df['target'] == 0]
target_1 = iris_df.loc[iris_df['target'] == 1]
target_2 = iris_df.loc[iris_df['target'] == 2]

# Plot density plots for three distributions separately
sns.distplot(target_0[['sepal length (cm)']], hist=False, rug=True, label='Setosa')
sns.distplot(target_1[['sepal length (cm)']], hist=False, rug=True, label='Versicolor')
sns.distplot(target_2[['sepal length (cm)']], hist=False, rug=True, label='Virginica')

plt.legend()
plt.show()

This method is simple and intuitive, but becomes inefficient and error-prone when the number of categories is unknown or large.

Method 2: Automated Multiple Distribution Plotting

By identifying unique target values and plotting in a loop, we can handle any number of categories:

# Get unique target values
unique_targets = iris_df['target'].unique()

# Create list of data subsets for each target value
target_subsets = [iris_df.loc[iris_df['target'] == val] for val in unique_targets]

# Automatically loop through and plot all distributions
for i, subset in enumerate(target_subsets):
    sns.distplot(subset[['sepal length (cm)']], hist=False, rug=True, 
                 label=f'Class {unique_targets[i]}')

plt.legend()
plt.show()

This approach offers better scalability and is suitable for scenarios where the number of categories changes dynamically.

In-Depth Analysis of Distribution Visualization Techniques

Seaborn provides multiple distribution visualization tools. Understanding their core concepts is crucial for proper selection and usage.

Histogram vs. Kernel Density Estimation Comparison

Histograms approximate probability density functions through discrete binning and counting, while Kernel Density Estimation (KDE) uses Gaussian kernel smoothing to generate continuous density curves. KDE is generally easier to interpret when comparing multiple distributions but may obscure the true discrete nature of data.

Importance of Bandwidth Parameter

The smoothness of KDE is controlled by the bandwidth parameter:

# Smaller bandwidth highlights details but may overfit
sns.displot(iris_df, x='sepal length (cm)', hue='target', kind='kde', bw_adjust=0.25)

# Larger bandwidth produces smoother curves but may lose features
sns.displot(iris_df, x='sepal length (cm)', hue='target', kind='kde', bw_adjust=2)

Choosing the appropriate bandwidth requires balancing detail preservation and smoothness.

Strategies for Conditional Distribution Comparison

When using the hue parameter to compare conditional distributions, Seaborn offers multiple display options:

# Layered display (default)
sns.displot(iris_df, x='sepal length (cm)', hue='target', kind='kde')

# Fill curve areas for enhanced readability
sns.displot(iris_df, x='sepal length (cm)', hue='target', kind='kde', fill=True)

# Use step elements instead of smooth curves
sns.displot(iris_df, x='sepal length (cm)', hue='target', element='step')

Modern Seaborn Practice: The displot Function

With the release of Seaborn 0.11.0, distplot has been deprecated in favor of the more unified displot function:

# Modern syntax using displot
sns.displot(data=iris_df, x='sepal length (cm)', hue='target', 
           kind='kde', fill=True, palette='Set1', height=5, aspect=1.5)

As a figure-level function, displot automatically creates and manages figure layout, providing a more consistent API and richer customization options.

Advanced Techniques: Data Reshaping and Facet Plotting

For more complex data comparison scenarios, combine data reshaping with facet plotting:

# Transform data to long format
long_df = iris_df.melt(id_vars=['target'], var_name='feature', value_name='measurement')

# Create facet grid to compare all features
g = sns.FacetGrid(long_df, col='feature', hue='target', palette='Set1', col_wrap=2)
g.map(sns.kdeplot, 'measurement', fill=True)
g.add_legend()

This approach is particularly suitable for simultaneously comparing distribution patterns of multiple variables across different categories.

Statistical Considerations in Distribution Visualization

When comparing distributions with different sample sizes, statistical normalization should be considered:

# Density normalization (area sums to 1)
sns.displot(iris_df, x='sepal length (cm)', hue='target', kind='kde', common_norm=False)

# Probability normalization (height sums to 1)
sns.displot(iris_df, x='sepal length (cm)', hue='target', stat='probability')

common_norm=False ensures each distribution is normalized independently, facilitating comparison across groups with different sample sizes.

Practical Application Recommendations

When selecting distribution visualization methods, consider the following factors:

Data Characteristics: Continuous data suits KDE, discrete data prefers histograms
Comparison Needs: Few distributions use layered display, many distributions consider faceting or ECDF
Audience Background: Technical audiences accept KDE, non-technical audiences may understand histograms better
Publication Medium: Print publications require higher resolution and clarity, web displays prioritize interactivity

Conclusion

Multiple distribution visualization is a powerful tool for exploratory data analysis. Through the various methods provided by Seaborn, researchers can select the most appropriate implementation scheme based on specific needs. From simple manual filtering to automated loop processing, to advanced data reshaping techniques, each method has its applicable scenarios. Understanding the core concepts and technical details of distribution visualization helps data scientists more effectively communicate patterns and insights within data.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.