Efficiently Combining Pandas DataFrames in Loops Using pd.concat

Nov 20, 2025 · Programming · 16 views · 7.8

Keywords: pandas | data_concatenation | Excel_processing | performance_optimization | Python_programming

Abstract: This article provides a comprehensive guide to handling multiple Excel files in Python using pandas. It analyzes common pitfalls and presents optimized solutions, focusing on the efficient approach of collecting DataFrames in a list followed by single concatenation. The content compares performance differences between methods and offers solutions for handling disparate column structures, supported by detailed code examples.

Problem Context and Common Mistakes

When processing batches of Excel files, developers often need to combine data from multiple files into a unified DataFrame. A frequent erroneous attempt involves using the DataFrame.append() method directly within a loop, as shown in the original question:

for infile in glob.glob("*.xlsx"):
    data = pandas.read_excel(infile)
    appended_data = pandas.DataFrame.append(data) # Incorrect usage

This approach fails because DataFrame.append() requires at least two arguments: the existing DataFrame calling the method and the new DataFrame to append. More critically, each call to append() creates a new DataFrame object, leading to significant performance degradation and memory waste.

Efficient Solution: List Collection and Single Concatenation

The optimal approach involves collecting all DataFrames in a list and then performing a single concatenation using pd.concat():

import pandas as pd
import glob

appended_data = []
for infile in glob.glob("*.xlsx"):
    data = pd.read_excel(infile)
    appended_data.append(data)

final_df = pd.concat(appended_data, ignore_index=True)
final_df.to_excel('appended.xlsx')

This method offers substantially better memory efficiency, as it only performs simple list appends during the loop, deferring the actual data combination to the final step.

Performance Analysis and Comparison

Compared to methods that perform repeated concatenations within the loop, the list collection strategy demonstrates clear performance advantages:

For large datasets, these differences can be orders of magnitude. The performance benefits become particularly pronounced when handling hundreds of files or files containing thousands of rows each.

Handling Different Column Names

When Excel files have inconsistent column structures, additional preprocessing steps are necessary:

import pandas as pd

# Assume three DataFrames with different column structures
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]})
df3 = pd.DataFrame({'A': [9, 10], 'D': [11, 12]})

dfs = [df1, df2, df3]
df_list = []

# Get union of all column names
all_columns = set()
for df in dfs:
    all_columns = all_columns.union(set(df.columns))

# Reindex each DataFrame to ensure column consistency
for df in dfs:
    df_reindexed = df.reindex(columns=list(all_columns))
    df_list.append(df_reindexed)

result = pd.concat(df_list, ignore_index=True)
print(result)

This approach ensures the final DataFrame contains all columns that appear in any source, with missing values filled as NaN.

Practical Implementation Recommendations

In real-world projects, consider implementing these additional features:

By following these best practices, you can build data processing pipelines that are both efficient and robust.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.