Keywords: Pandas | DataFrame | Python Data Processing | from_records | List Conversion
Abstract: This article provides a comprehensive examination of common issues when creating DataFrames with pandas, particularly the differences between from_records method and DataFrame constructor. Through concrete code examples, it analyzes why string lists are incorrectly parsed as multiple columns and offers correct solutions. The paper also compares applicable scenarios of different creation methods to help developers avoid similar errors and improve data processing efficiency.
Problem Background and Phenomenon Analysis
In Python data analysis, pandas DataFrame is one of the most commonly used data structures. Many developers habitually use the pd.DataFrame.from_records() method to create DataFrames from lists, but this method can produce unexpected errors in certain situations.
Consider the following two seemingly similar examples:
# Example 1: Working correctly
import pandas as pd
test_list = ['a','b','c','d']
df_test = pd.DataFrame.from_records(test_list, columns=['my_letters'])
print(df_test)
The above code executes correctly, outputting a DataFrame with a single column. However, when using the same method to process a list of numeric strings:
# Example 2: Producing error
q_list = ['112354401', '116115526', '114909312', '122425491', '131957025', '111373473']
df1 = pd.DataFrame.from_records(q_list, columns=['q_data'])
The system throws an AssertionError: 1 columns passed, passed data had 9 columns error, indicating that 1 column parameter was passed, but the actual data has 9 columns.
Root Cause Analysis
The core of this difference lies in the special handling mechanism of the from_records method for string data. When passing a list of strings, pandas treats each string as a character sequence rather than a single data element.
For short string lists like ['a','b','c','d'], each string has length 1, thus being parsed as 1 column of data, matching the specified columns=['my_letters'] parameter. But for long numeric strings like '112354401' (length 9), the method splits them into 9 characters, corresponding to 9 columns, which conflicts with the expected 1-column configuration.
This design originates from the primary purpose of from_records—processing structured record data where each element is typically a tuple or dictionary that can explicitly provide multiple columns of information. When handling simple scalar lists, using this method can easily cause misunderstandings.
Correct Solution
For simple single-column DataFrame creation, it is recommended to use the DataFrame constructor directly:
# Correct method
q_list = ['112354401', '116115526', '114909312', '122425491', '131957025', '111373473']
df_correct = pd.DataFrame(q_list, columns=['q_data'])
print(df_correct)
Output result:
q_data
0 112354401
1 116115526
2 114909312
3 122425491
4 131957025
5 111373473
This method directly uses list elements as row data in the DataFrame, avoiding the issue of strings being incorrectly parsed into multiple columns.
Method Comparison and Extended Applications
Applicable Scenarios for from_records: This method is most suitable for processing already structured record data, for example:
# Processing list of tuples
records = [('Alice', 25), ('Bob', 30), ('Charlie', 35)]
df_records = pd.DataFrame.from_records(records, columns=['Name', 'Age'])
Or processing list of dictionaries:
# Processing list of dictionaries
dict_list = [{'name': 'Alice', 'age': 25}, {'name': 'Bob', 'age': 30}]
df_dict = pd.DataFrame.from_records(dict_list)
Advantages of DataFrame Constructor: For one-dimensional data, the constructor is more intuitive and reliable. Additionally, it can handle multi-dimensional lists:
# Creating from multi-dimensional list
test_list = [['a','b','c'], ['AA','BB','CC']]
df_multi = pd.DataFrame(test_list, columns=['col_A', 'col_B', 'col_C'])
Creating from Multiple Lists: When needing to create a multi-column DataFrame from multiple lists, the zip function can be used:
# Combining multiple lists
lstA = [1, 2, 3]
lstB = ['A', 'B', 'C']
df_combined = pd.DataFrame(list(zip(lstA, lstB)), columns=['Number', 'Letter'])
Best Practice Recommendations
Based on the above analysis, we summarize the following best practices:
- Single Column Simple Data: Prefer using
pd.DataFrame(data, columns=[...])constructor - Structured Records: Use
from_recordsfor tuple or dictionary formatted data - Multiple Data Source Combination: Use zip with constructor to create multi-column DataFrames
- Data Type Attention: Pay special attention to string data processing to avoid column count errors caused by length inconsistencies
By understanding the underlying mechanisms of different creation methods, developers can more flexibly and efficiently build pandas DataFrames, avoiding common pitfalls and errors.