Keywords: Pandas | text splitting | data processing
Abstract: This article provides an in-depth exploration of techniques for splitting text columns containing delimiters into multiple rows using Pandas. Addressing the needs of large CSV file processing, it demonstrates core algorithms through practical examples, utilizing functions like split(), apply(), and stack() for text segmentation and row expansion. The article also compares performance differences between methods and offers optimization recommendations, equipping readers with practical skills for efficiently handling structured text data.
When working with large datasets, it is common to encounter scenarios where text columns need to be split into multiple rows based on specific delimiters. For instance, in a CSV file, a column might contain composite information separated by spaces or colons, requiring decomposition into independent rows for further analysis. This article will use a concrete case study to explain in detail how to achieve this functionality using the Pandas library.
Problem Context and Data Example
Consider a CSV file containing customer order information, where the Seatblocks column stores seat block information in a format of multiple colon-separated strings divided by spaces. For example:
CustNum CustomerName ItemQty Item Seatblocks ItemExt
32363 McCartney, Paul 3 F04 2:218:10:4,6 60
31316 Lennon, John 25 F01 1:13:36:1,12 1:13:37:1,13 300
The goal is to split the Seatblocks column by spaces, making each seat block an independent row while preserving other column information. For the Lennon, John record, two rows should be generated, corresponding to the two seat blocks.
Core Solution
Pandas offers robust string processing capabilities, which, combined with data reshaping methods, can efficiently implement text splitting. Below is the core code implementation based on the best answer:
import pandas as pd
from pandas import Series
# Create example DataFrame
df = pd.DataFrame({
'CustNum': [32363, 31316],
'CustomerName': ['McCartney, Paul', 'Lennon, John'],
'ItemQty': [3, 25],
'Item': ['F04', 'F01'],
'Seatblocks': ['2:218:10:4,6', '1:13:36:1,12 1:13:37:1,13'],
'ItemExt': [60, 300]
})
# Step 1: Split the Seatblocks column
s = df['Seatblocks'].str.split(' ').apply(Series, 1).stack()
# Step 2: Adjust index to match the original DataFrame
s.index = s.index.droplevel(-1)
s.name = 'Seatblocks'
# Step 3: Remove the original Seatblocks column and join the result
del df['Seatblocks']
result = df.join(s)
print(result)
After executing the above code, the output is as follows:
CustNum CustomerName ItemQty Item ItemExt Seatblocks
0 32363 McCartney, Paul 3 F04 60 2:218:10:4,6
1 31316 Lennon, John 25 F01 300 1:13:36:1,12
1 31316 Lennon, John 25 F01 300 1:13:37:1,13
The key to this method lies in the str.split() function splitting text into lists by spaces, apply(Series, 1) converting each list into a Series object, stack() stacking the Series into a multi-level index sequence, and finally merging back into the original DataFrame via join().
Further Splitting and Performance Optimization
If further splitting of colon-separated strings into multiple columns is needed, the following method can be used:
# Split each seat block into multiple columns
expanded_result = df.join(s.apply(lambda x: Series(x.split(':'))))
print(expanded_result)
However, for large datasets, the above method may be inefficient. Referring to other answers, more efficient implementations include:
# Method 1: Use tolist() to avoid lambda functions
split_list = df['Seatblocks'].str.split().tolist()
expanded_df = pd.DataFrame(split_list)
# Method 2: Directly use the expand parameter of str.split()
expanded_df = df['Seatblocks'].str.split(expand=True)
Performance tests show that the tolist() method is approximately 40 times faster than the original method when processing 100,000 rows, while the expand=True parameter, though concise, is slightly less efficient.
Technical Summary
1. String Splitting: Pandas' str.split() method supports splitting text by specified delimiters, returning a sequence of lists.
2. Data Reshaping: apply(Series) combined with stack() can transform list sequences into multiple rows, achieving row expansion.
3. Index Alignment: Adjusting multi-level indices via droplevel() ensures that split data can be correctly joined with the original DataFrame.
4. Performance Optimization: Avoiding the creation of numerous Series objects within apply() and using tolist() or vectorized operations can significantly improve processing speed.
Application Scenarios and Extensions
This technique is not only applicable to seat block splitting but also widely useful in fields such as log analysis, text mining, and data cleaning. For example, when handling user tags, product attributes, or time-series data, it is often necessary to split composite fields into structured formats.
For more complex delimiter patterns, regular expressions can be combined:
# Split using regular expressions
df['Seatblocks'].str.split(r'[\s:]+', expand=True)
Additionally, Pandas' explode() function (available in version 0.25.0 and above) offers a more concise method for row expansion:
df['Seatblocks'] = df['Seatblocks'].str.split(' ')
exploded_df = df.explode('Seatblocks')
In summary, mastering text splitting and row expansion techniques can significantly enhance data preprocessing efficiency, laying a solid foundation for subsequent analysis.