A Comprehensive Guide to Calculating Summary Statistics of DataFrame Columns Using Pandas

Keywords: Pandas | DataFrame | Summary Statistics

Abstract: This article delves into how to compute summary statistics for each column in a DataFrame using the Pandas library. It begins by explaining the basic usage of the DataFrame.describe() method, which automatically calculates common statistical metrics for numerical columns, including count, mean, standard deviation, minimum, quartiles, and maximum. The discussion then covers handling columns with mixed data types, such as boolean and string values, and how to adjust the output format via transposition to meet specific requirements. Additionally, the pandas_profiling package is briefly mentioned as a more comprehensive data exploration tool, but the focus remains on the core describe method. Through practical code examples and step-by-step explanations, this guide provides actionable insights for data scientists and analysts.

Introduction

In data analysis and preprocessing, computing summary statistics for columns in a dataset is a fundamental and critical task. Pandas, a widely-used data manipulation library in Python, offers powerful and flexible tools to accomplish this. This article uses a specific DataFrame example to detail how to calculate column summary statistics with Pandas and generate structured output.

DataFrame Example and Problem Statement

Consider a DataFrame containing shopper information with the following structure:

shopper_num,is_martian,number_of_items,count_pineapples,birth_country,tranpsortation_method
1,FALSE,0,0,MX,
2,FALSE,1,0,MX,
3,FALSE,0,0,MX,
4,FALSE,22,0,MX,
5,FALSE,0,0,MX,
6,FALSE,0,0,MX,
7,FALSE,5,0,MX,
8,FALSE,0,0,MX,
9,FALSE,4,0,MX,
10,FALSE,2,0,MX,
11,FALSE,0,0,MX,
12,FALSE,13,0,MX,
13,FALSE,0,0,CA,
14,FALSE,0,0,US,

This DataFrame includes six columns: shopper_num (shopper number, integer), is_martian (whether Martian, boolean), number_of_items (number of items, integer), count_pineapples (pineapple count, integer), birth_country (birth country, string), and tranpsortation_method (transportation method, string, with some missing values). The goal is to compute summary statistics for each column and output a new DataFrame indexed by column names with metrics such as maximum, minimum, and median. For non-numeric columns like booleans, special handling may be needed, e.g., the median might return FALSE.

Calculating Summary Statistics with the describe Method

Pandas' DataFrame object provides a built-in method, describe(), for quickly computing common statistics for numerical columns. By default, describe() returns count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum. For non-numeric columns, Pandas adjusts the output based on data type; for instance, boolean columns might show only count and unique values, while string columns may be omitted as they logically cannot be summarized numerically.

Here is how to apply describe() to the example DataFrame:

import pandas as pd

# Assuming df is the loaded DataFrame
summary = df.describe()
print(summary)

After executing this code, the output will resemble:

       shopper_num is_martian  number_of_items  count_pineapples
count      14.0000         14        14.000000                14
mean        7.5000          0         3.357143                 0
std         4.1833          0         6.452276                 0
min         1.0000      False         0.000000                 0
25%         4.2500          0         0.000000                 0
50%         7.5000          0         0.000000                 0
75%        10.7500          0         3.500000                 0
max        14.0000      False        22.000000                 0

From the output, describe() automatically handles numerical columns (e.g., shopper_num, number_of_items, and count_pineapples), computing full statistical metrics. For the boolean column is_martian, it displays count and min/max values (here False), but does not compute mean or percentiles, as booleans are treated as categorical data in Pandas. String columns birth_country and tranpsortation_method are excluded from the output since describe() defaults to numerical columns only.

Transposing Output to Match Target Format

The original problem requires outputting a DataFrame indexed by column names, for example:

columnname, max, min, median,
is_martian, NA, NA, FALSE

While the output from describe() is close to this format, it organizes statistics as rows and columns as columns. To flip this structure, use the transpose() method (or property .T) to swap rows and columns:

transposed_summary = df.describe().transpose()
print(transposed_summary)

The transposed output is as follows:

                 count      mean       std    min   25%  50%    75%    max
shopper_num         14       7.5    4.1833      1  4.25  7.5  10.75     14
is_martian          14         0         0  False     0    0      0  False
number_of_items     14  3.357143  6.452276      0     0    0    3.5     22
count_pineapples    14         0         0      0     0    0      0      0

Now, each row corresponds to a column name, and columns contain various statistical metrics. For the boolean column is_martian, the median (50th percentile) shows as 0, which corresponds to False (in boolean context, False is mapped to 0). If the original data included True values, the median might be 0.5 or another value, depending on the distribution. Note that for a boolean column entirely composed of False, as in this case, the median is naturally False, aligning with the example in the problem statement.

Handling Mixed Data Types and Missing Values

In practice, DataFrames may contain more complex combinations of data types. For instance, if the birth_country column needs to be summarized, using describe(include='all') can include all columns, but this typically provides only basic information like count, unique values, and top frequency for categorical or object columns, not numerical statistics. For missing values (e.g., empty strings in the tranpsortation_method column), describe() handles them automatically, excluding them from the count, but other statistics are computed based on non-missing values.

For finer control over output, one can combine groupby with aggregation functions, though this is generally used for grouped statistics rather than overall column summaries. For example, the following code demonstrates custom statistical functions for specific columns:

# Example: Custom statistics for numerical columns
custom_stats = df.agg({
    'shopper_num': ['count', 'min', 'max', 'median'],
    'number_of_items': ['count', 'min', 'max', 'median']
})
print(custom_stats)

However, for simple column summaries, the describe() method is usually sufficient and more efficient.

Supplementary Tool: The pandas_profiling Package

Beyond the built-in describe() method, community-developed tools like the pandas_profiling package offer more comprehensive data exploration. This package generates detailed reports including missing value analysis, data distribution histograms, correlation matrices, and more. Usage is as follows:

import pandas_profiling
report = pandas_profiling.ProfileReport(df)
report.to_file("output.html")  # Save as an HTML report

While pandas_profiling is powerful, it is better suited for preliminary data exploration and generating visual reports rather than simple column statistical summaries. For most summary statistic needs, the describe() method is preferred due to its simplicity and speed.

Conclusion

This article has detailed how to calculate summary statistics for DataFrame columns using Pandas. The core approach leverages DataFrame.describe(), which automatically processes numerical columns and provides a rich set of statistical metrics. By transposing the output, one can easily adjust the format to match specific requirements, such as a table indexed by column names. For columns with mixed data types like booleans or strings, describe() handles them appropriately, though users should be aware of its limitations. Additionally, the pandas_profiling package serves as a supplementary tool for deeper data analysis. Mastering these techniques will empower data practitioners to efficiently perform data preprocessing and exploratory analysis.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.