Keywords: pandas | data_concatenation | Excel_processing | performance_optimization | Python_programming
Abstract: This article provides a comprehensive guide to handling multiple Excel files in Python using pandas. It analyzes common pitfalls and presents optimized solutions, focusing on the efficient approach of collecting DataFrames in a list followed by single concatenation. The content compares performance differences between methods and offers solutions for handling disparate column structures, supported by detailed code examples.
Problem Context and Common Mistakes
When processing batches of Excel files, developers often need to combine data from multiple files into a unified DataFrame. A frequent erroneous attempt involves using the DataFrame.append() method directly within a loop, as shown in the original question:
for infile in glob.glob("*.xlsx"):
data = pandas.read_excel(infile)
appended_data = pandas.DataFrame.append(data) # Incorrect usage
This approach fails because DataFrame.append() requires at least two arguments: the existing DataFrame calling the method and the new DataFrame to append. More critically, each call to append() creates a new DataFrame object, leading to significant performance degradation and memory waste.
Efficient Solution: List Collection and Single Concatenation
The optimal approach involves collecting all DataFrames in a list and then performing a single concatenation using pd.concat():
import pandas as pd
import glob
appended_data = []
for infile in glob.glob("*.xlsx"):
data = pd.read_excel(infile)
appended_data.append(data)
final_df = pd.concat(appended_data, ignore_index=True)
final_df.to_excel('appended.xlsx')
This method offers substantially better memory efficiency, as it only performs simple list appends during the loop, deferring the actual data combination to the final step.
Performance Analysis and Comparison
Compared to methods that perform repeated concatenations within the loop, the list collection strategy demonstrates clear performance advantages:
- Memory Usage: Avoids creating new DataFrame copies in each iteration
- Execution Time:
pd.concat()is optimized for batch operations - Code Readability: Clear logic that is easy to understand and maintain
For large datasets, these differences can be orders of magnitude. The performance benefits become particularly pronounced when handling hundreds of files or files containing thousands of rows each.
Handling Different Column Names
When Excel files have inconsistent column structures, additional preprocessing steps are necessary:
import pandas as pd
# Assume three DataFrames with different column structures
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]})
df3 = pd.DataFrame({'A': [9, 10], 'D': [11, 12]})
dfs = [df1, df2, df3]
df_list = []
# Get union of all column names
all_columns = set()
for df in dfs:
all_columns = all_columns.union(set(df.columns))
# Reindex each DataFrame to ensure column consistency
for df in dfs:
df_reindexed = df.reindex(columns=list(all_columns))
df_list.append(df_reindexed)
result = pd.concat(df_list, ignore_index=True)
print(result)
This approach ensures the final DataFrame contains all columns that appear in any source, with missing values filled as NaN.
Practical Implementation Recommendations
In real-world projects, consider implementing these additional features:
- Error Handling: Add exception handling for file reading to ensure errors in individual files don't halt the entire process
- Progress Indication: Include progress bars or logging for large file sets
- Memory Monitoring: Track memory usage when processing extremely large datasets
- Data Validation: Check data quality and consistency before concatenation
By following these best practices, you can build data processing pipelines that are both efficient and robust.