Keywords: Pandas | DataFrame | append | concat | performance_optimization
Abstract: This article provides an in-depth analysis of common issues when appending DataFrames in Pandas, particularly the problem of empty DataFrames returned by the append method. By comparing original code with optimized solutions, it explains the characteristic of append returning new objects rather than modifying in-place, and presents efficient solutions using list collection followed by single concat operation. The article also discusses API changes across different Pandas versions to help readers avoid common performance pitfalls.
Problem Analysis
In data processing workflows, it's common to merge multiple DataFrames into one. The user's code example demonstrates reading data from CSV files in chunks, filtering rows that match URL patterns, and appending results to a target DataFrame. However, after execution, df_res remains empty despite res showing data when printed within the loop.
Behavior Characteristics of Append Method
Pandas' DataFrame.append method is designed to return a new DataFrame object rather than modifying the original object in-place. This means each call to df_res.append(res) actually creates a new DataFrame containing both existing and new data, while the original df_res remains unchanged. This explains why df_res stays empty after the loop completes.
The correct usage should reassign the returned new object to the variable:
df_res = df_res.append(res)
Performance Optimization Solutions
While the above correction solves the problem, from a performance perspective, iterative DataFrame appending is inefficient. Each append operation requires copying the entire DataFrame, resulting in O(n²) time complexity. A better approach is to collect all DataFrames needing merging in a list, then perform a single merge using pd.concat:
all_res = []
for df in df_all:
for i in substr:
res = df[df['url'].str.contains(i)]
all_res.append(res)
df_res = pd.concat(all_res)
This method avoids repeated data copying and significantly improves efficiency when processing large-scale data.
API Evolution and Best Practices
It's worth noting that starting from Pandas 2.0, the DataFrame.append method has been removed, with pd.concat recommended as the replacement. Even in earlier versions, it's advised to avoid using append in loops and instead adopt the list collection + concat pattern.
For scenarios requiring row-by-row data addition, a more efficient approach is to collect data into a list first, then create the DataFrame in one operation:
data = []
for item in data_source:
data.append([item['col1'], item['col2'], item['col3']])
df = pd.DataFrame(data, columns=['col1', 'col2', 'col3'])
Practical Application Example
Combining with the user's specific scenario, the complete optimized code is as follows:
import pandas as pd
# Read data
df_all = pd.read_csv('data.csv', error_bad_lines=False, chunksize=1000000)
urls = pd.read_excel('url_june.xlsx')
substr = urls.url.values.tolist()
# Use list to collect matching results
all_res = []
for df_chunk in df_all:
for pattern in substr:
# Use str.contains for pattern matching
matched_rows = df_chunk[df_chunk['url'].str.contains(pattern)]
if not matched_rows.empty:
all_res.append(matched_rows)
# Merge all results in one operation
if all_res:
df_res = pd.concat(all_res, ignore_index=True)
else:
df_res = pd.DataFrame() # Return empty DataFrame if no matches found
This implementation not only solves the empty DataFrame problem but also provides better performance and code maintainability.