Calculating Cumulative Distribution Function for Discrete Data in Python

Keywords: Python | Cumulative Distribution Function | Discrete Data | NumPy | Matplotlib

Abstract: This article details how to compute the Cumulative Distribution Function (CDF) for discrete data in Python using NumPy and Matplotlib. It covers methods such as sorting data and using np.arange to calculate cumulative probabilities, with code examples and step-by-step explanations to aid in understanding CDF estimation and visualization.

The Cumulative Distribution Function (CDF) is a fundamental concept in statistics that describes the probability of a random variable being less than or equal to a specific value. For discrete data, empirical methods can be used to estimate the CDF, which is highly valuable in data analysis and machine learning. This article explores how to implement this calculation in Python, primarily using the NumPy and Matplotlib libraries.

Method 1: Sorting and Using np.arange

A straightforward and efficient way to compute the empirical CDF is by sorting the data and then using NumPy's np.arange function to generate cumulative probabilities. This method is suitable for most discrete datasets and is easy to implement. Here is a detailed step-by-step explanation with code:

import numpy as np
import matplotlib.pyplot as plt

# Generate sample data from a standard normal distribution
data = np.random.randn(1000)
# Sort the data in ascending order for cumulative probability calculation
sorted_data = np.sort(data)
# Calculate cumulative probabilities: use np.arange to create indices and normalize
cumulative_prob = np.arange(1, len(sorted_data) + 1) / len(sorted_data)
# Plot the CDF
plt.plot(sorted_data, cumulative_prob)
plt.xlabel('Data Values')
plt.ylabel('Cumulative Probability')
plt.title('Empirical Cumulative Distribution Function')
plt.grid(True)
plt.show()

In this code, we first generate 1000 random data points simulating samples from a standard normal distribution using np.random.randn. The data is then sorted in ascending order with np.sort, which is essential for CDF computation. Next, np.arange(1, len(sorted_data) + 1) / len(sorted_data) creates an array representing the cumulative probability for each data point, ranging from 1/n to 1. Finally, Matplotlib is used to plot the sorted data against the cumulative probability, visually presenting the CDF curve. This method assumes evenly spaced data points, but adjustments may be needed for datasets with repeated values.

Method 2: Using Unique Values and Cumulative Sum

For discrete data with repeated values, another approach involves using unique values and their frequencies to compute the CDF. This method produces a step-function representation of the CDF, providing a more accurate reflection of the data distribution. Here is the implementation code:

import numpy as np
import matplotlib.pyplot as plt

def ecdf(a):
    x, counts = np.unique(a, return_counts=True)
    cusum = np.cumsum(counts)
    return x, cusum / cusum[-1]

def plot_ecdf(a):
    x, y = ecdf(a)
    x = np.insert(x, 0, x[0])
    y = np.insert(y, 0, 0.0)
    plt.plot(x, y, drawstyle='steps-post')
    plt.xlabel('Data Values')
    plt.ylabel('Cumulative Probability')
    plt.title('Step Function Empirical CDF')
    plt.grid(True)
    plt.show()

# Example usage
example_data = np.array([7, 1, 2, 2, 7, 4, 4, 4, 5.5, 7])
plot_ecdf(example_data)

In this method, the np.unique function is used to obtain the unique values in the data and their occurrence counts. The cumulative sum is then computed and normalized to derive the cumulative probability for each unique value. To properly plot the step function, additional points are inserted at the beginning of the arrays, and drawstyle='steps-post' ensures that jumps occur after the data points. This approach is particularly useful for categorical datasets or those with many repeated values.

Overview of Other Methods

Beyond these methods, other Python libraries offer built-in functions for CDF calculation. For instance, the statsmodels library provides an ECDF class for quick empirical CDF generation, while scipy.stats's cumfreq method allows for finer control, such as binning. These methods are convenient but may require additional dependencies and sometimes assume prior knowledge of the data distribution. The choice of method should be based on data characteristics and application requirements.

In summary, by sorting data and calculating cumulative probabilities, we can effectively estimate the CDF for discrete data. These methods are practical for data exploration and model validation, and readers can select the most appropriate approach based on their specific context.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.

Method 1: Sorting and Using np.arange

Method 2: Using Unique Values and Cumulative Sum

Overview of Other Methods

Cite this article