Comparative Analysis of Three Methods for Plotting Percentage Histograms with Matplotlib

Keywords: Matplotlib | Histogram | Percentage Visualization | Data Distribution | Python Plotting

Abstract: This paper provides an in-depth exploration of three implementation methods for creating percentage histograms in Matplotlib: custom formatting functions using FuncFormatter, normalization via the density parameter, and the concise approach combining weights parameter with PercentFormatter. The article analyzes the implementation principles, advantages, disadvantages, and applicable scenarios of each method, with detailed examination of the technical details in the optimal solution using weights=np.ones(len(data))/len(data) with PercentFormatter(1). Code examples demonstrate how to avoid global variables and correctly handle data proportion conversion. The paper also contrasts differences in data normalization and label formatting among alternative methods, offering comprehensive technical reference for data visualization.

Introduction and Problem Context

In the field of data visualization, histograms serve as essential tools for displaying data distribution characteristics. Standard histograms typically use frequency (the count of data points within each bin) as the vertical axis, but many practical applications require relative frequency or percentage representation. For instance, in statistical analysis, quality control, and market research, percentage displays provide more intuitive understanding of component proportions within the whole.

Basic Histogram Plotting and Vertical Axis Issues

When using Matplotlib's hist() function to create basic histograms, the default vertical axis represents frequency. The following code demonstrates this fundamental approach:

import matplotlib.pyplot as plt

data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]
plt.hist(data, bins=len(set(data)))
plt.show()

While this representation is straightforward, it doesn't facilitate intuitive comparison between different datasets or bins. When total data volume varies, the comparability of frequency histograms becomes compromised.

Method 1: Custom Formatting with FuncFormatter

FuncFormatter allows users to customize axis label formatting. The basic approach involves creating a function that converts frequency values to percentage strings. The key challenge lies in passing the total data count to the formatting function.

The original problem presented a solution using global variables:

from matplotlib.ticker import FuncFormatter
import matplotlib

def to_percent(y, position):
    global n
    s = str(round(100 * y / n, 3))
    if matplotlib.rcParams['text.usetex'] is True:
        return s + r'$\%$'
    else:
        return s + '%'

n = len(data)
formatter = FuncFormatter(to_percent)
plt.gca().yaxis.set_major_formatter(formatter)

Although functional, this method exhibits significant design flaws: the use of global variables breaks function encapsulation and reusability, increases code coupling, and hinders maintenance and testing.

A more elegant solution utilizes closures or lambda functions to pass additional parameters:

def create_percent_formatter(total):
    def to_percent(y, position):
        return f'{100 * y / total:.1f}%'
    return to_percent

formatter = FuncFormatter(create_percent_formatter(len(data)))

This approach avoids global variables and enhances code modularity.

Method 2: Normalization Using the Density Parameter

Matplotlib's hist() function provides a density parameter that, when set to True, normalizes the histogram so the total area of all bars equals 1. This essentially converts frequencies to probability densities.

plt.hist(data, bins=len(set(data)), density=True)

However, this method produces probability density rather than percentages. For equal-width bins, probability density multiplied by bin width equals the bin's probability. Consequently, vertical axis values are typically less than 1 and require further conversion for percentage display.

To obtain percentage representation, combination with FuncFormatter is necessary:

def density_to_percent(y, position):
    # Assuming equal-width bins, multiply by bin width
    bin_width = (max(data) - min(data)) / len(set(data))
    return f'{100 * y * bin_width:.1f}%'

plt.hist(data, bins=len(set(data)), density=True)
plt.gca().yaxis.set_major_formatter(FuncFormatter(density_to_percent))

This approach involves complex calculations and lacks intuitive handling for unequal-width bins.

Method 3: Combining Weights Parameter with PercentFormatter (Best Practice)

The most concise and effective solution combines the weights parameter with PercentFormatter. This method's core concept involves normalizing each data point's contribution through weighting.

Complete implementation code:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter

data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]
n = len(data)

# Create weights array with each data point weighted as 1/n
weights = np.ones(n) / n

# Plot histogram with weights
plt.hist(data, weights=weights, bins=len(set(data)), rwidth=0.8)

# Format vertical axis as percentage using PercentFormatter
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))

plt.xlabel('Data Value')
plt.ylabel('Percentage')
plt.title('Percentage Histogram')
plt.grid(True, alpha=0.3)
plt.show()

Technical principle analysis:

Weighting Mechanism: weights=np.ones(n)/n creates an array of length n with each element valued 1/n. When the hist() function calculates frequencies, instead of simple counting, it adds each data point's weight value to its corresponding bin. Since total weights sum to 1, each bin's bar height ultimately represents that bin's proportion of total data points.
PercentFormatter: The parameter 1 in PercentFormatter(1) represents the reference proportion. The formatter multiplies vertical axis values by 100 and adds a percentage symbol. For example, a vertical axis value of 0.43 displays as 43%.
Mathematical Verification: For the sample data, the first bin contains values 1000 (appearing twice) and 2000 (appearing once), totaling 3 data points. The weight sum is 3×(1/7)=3/7≈0.4286, which PercentFormatter displays as 42.86%, matching the actual percentage.

Method Comparison and Selection Guidelines

<table border="1"> <tr><th>Method</th><th>Advantages</th><th>Disadvantages</th><th>Applicable Scenarios</th></tr> <tr><td>Custom FuncFormatter</td><td>High flexibility, complete format control</td><td>Requires additional parameter passing, complex code</td><td>Special formatting or complex conversion needs</td></tr> <tr><td>Density parameter normalization</td><td>Built-in functionality, high standardization</td><td>Displays probability density rather than percentage</td><td>Probability density estimation, statistical modeling</td></tr> <tr><td>Weights + PercentFormatter</td><td>Concise code, intuitive results</td><td>Requires understanding of weighting mechanism</td><td>Most percentage histogram requirements</td></tr>

Advanced Applications and Considerations

1. Unequal-width Bin Handling: The weights method remains effective with unequal-width bins since each data point's weight is independent of bin width. The density method can produce misleading results as probability density relates to bin width.

2. Multiple Dataset Comparison: Percentage histograms are particularly suitable for comparing datasets of different scales. For example, when comparing customer age distributions across two time periods, percentage representation provides meaningful comparison even with different sample sizes.

3. Cumulative Percentage Histograms: By setting the cumulative=True parameter, cumulative percentage histograms can be created:

plt.hist(data, weights=weights, bins=len(set(data)), 
         cumulative=True, histtype='step')
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))

4. Performance Considerations: For large datasets, the weights method requires creating additional arrays, potentially increasing memory usage. However, this overhead is negligible in most practical applications.

Conclusion

This paper systematically analyzes three primary methods for implementing percentage histograms in Matplotlib. Considering code conciseness, result accuracy, and maintainability, the combination of weights parameter with PercentFormatter is recommended. This approach not only avoids poor programming practices like global variables but also provides clear, intuitive percentage representation through Matplotlib's built-in functionality. For scenarios requiring more complex format control, FuncFormatter's flexibility can be incorporated. Understanding these methods' principles and differences enables selection of the most appropriate implementation based on specific requirements, enhancing data visualization effectiveness and efficiency.

In practical applications, clear axis labeling should always be maintained to ensure accurate interpretation of visualization results. Percentage histograms, as fundamental yet powerful analytical tools, hold significant value in data exploration and result presentation.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.