Keywords: Matplotlib | Histogram Normalization | Python Data Visualization
Abstract: This paper thoroughly examines the core concepts of histogram normalization in Matplotlib, explaining the principles behind area normalization implemented by the normed/density parameters, and demonstrates through concrete code examples how to convert histograms to height normalization. The article details the impact of bin width on normalization, compares different normalization methods, and provides complete implementation solutions.
Fundamental Concepts of Histogram Normalization
In data visualization, histogram normalization is a common requirement, but Matplotlib's implementation often confuses beginners. When using parameters like plt.hist(k, normed=1) or plt.hist(k, density=True), what's actually implemented is area normalization rather than height normalization. This means the total area under the histogram equals 1, not the sum of bar heights.
Mathematical Principles of Area Normalization
Consider the example of array k=(3,3,3,3). When density=True is set, Matplotlib calculates the height of each bin such that:
∑(bin height × bin width) = 1
In practical implementation, bin width is determined by data range and number of bins. For k=(3,3,3,3), the default bin width is 0.1, so each bin's height must satisfy:
0.1 × height × 10 bins = 1
This explains why the maximum y-value reaches 10—because the bin width is small, larger heights are needed to make the total area equal to 1.
Converting from Area Normalization to Height Normalization
To achieve height normalization (where the sum of all bar heights equals 1), post-processing of the returned histogram object is required. Matplotlib's plt.hist() function returns three values:
x, bins, patches = plt.hist(k, density=True)
Where x stores the height of each bin, and patches are the individual bar objects. Height normalization can be implemented with:
total_height = sum(x)
for patch in patches:
patch.set_height(patch.get_height() / total_height)
Alternative Approach Using Weights Parameter
Another method for height normalization uses the weights parameter, as shown in Answer 2:
import numpy as np
weights = np.ones_like(k) / len(k)
plt.hist(k, weights=weights)
This approach directly assigns weights to each data point, making the final sum of bar heights equal to 1. However, it's important to note that this method fundamentally differs from the density parameter—it doesn't involve area calculations but rather simple weight distribution.
Practical Considerations
1. Version Compatibility: In Matplotlib 2.1 and later, the normed parameter has been deprecated in favor of density.
2. Impact of Bin Width: When data distribution is uneven or the number of bins changes, area normalization results vary significantly, while height normalization remains relatively stable.
3. Visualization Purpose: Area normalization is more suitable for probability density estimation, while height normalization is better for comparing frequency distributions across different datasets.
Complete Implementation Example
The following code demonstrates complete implementations of both normalization methods:
import matplotlib.pyplot as plt
import numpy as np
# Sample data
k = (1, 4, 3, 1)
# Method 1: Area normalization (probability density)
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
x1, bins1, patches1 = plt.hist(k, density=True, alpha=0.7, color='blue')
plt.title("Area Normalization (density=True)")
plt.xlabel("Value")
plt.ylabel("Probability Density")
# Method 2: Height normalization
plt.subplot(1, 2, 2)
x2, bins2, patches2 = plt.hist(k, density=True, alpha=0.7, color='green')
total_height = sum(x2)
for patch in patches2:
patch.set_height(patch.get_height() / total_height)
plt.title("Height Normalization (Post-processing)")
plt.xlabel("Value")
plt.ylabel("Normalized Frequency")
plt.tight_layout()
plt.show()
Conclusion
Matplotlib's histogram normalization functionality offers flexible options but requires clear understanding of the mathematical principles behind different parameters. Area normalization (density=True) is suitable for probability density estimation, while height normalization requires post-processing or the weights parameter. In practical applications, appropriate normalization methods should be selected based on specific requirements, with attention to version differences affecting code compatibility.