Keywords: Python | CSV file splitting | data processing
Abstract: This article explores practical methods for splitting large CSV files into multiple subfiles by specified row counts in Python. By analyzing common issues in existing code, we focus on an optimized solution that uses csv.reader for line-by-line reading and dynamic output file creation, supporting advanced features like header retention. The article details algorithm logic, code implementation specifics, and compares the pros and cons of different approaches, providing reliable technical reference for data preprocessing tasks.
Problem Background and Requirements Analysis
When handling data science or routine programming tasks, it is often necessary to manipulate large CSV (Comma-Separated Values) files. For example, a user might have a CSV file with 5000 rows of data and wish to evenly split it into five smaller files, each containing 1000 rows. This need is common in scenarios such as data chunking, parallel computation preparation, or file size limitations.
Common Errors and Code Analysis
The initial code provided by the user attempts to implement file splitting via a recursive function but contains logical flaws:
import codecs
import csv
NO_OF_LINES_PER_FILE = 1000
def again(count_file_header,count):
f3 = open('write_'+count_file_header+'.csv', 'at')
with open('import_1458922827.csv', 'rb') as csvfile:
candidate_info_reader = csv.reader(csvfile, delimiter=',', quoting=csv.QUOTE_ALL)
co = 0
for row in candidate_info_reader:
co = co + 1
count = count + 1
if count <= count:
pass
elif count >= NO_OF_LINES_PER_FILE:
count_file_header = count + NO_OF_LINES_PER_FILE
again(count_file_header,count)
else:
writer = csv.writer(f3,delimiter = ',', lineterminator='\n',quoting=csv.QUOTE_ALL)
writer.writerow(row)
Main issues include: recursive calls causing repeated file openings, confusing conditional logic (e.g., if count <= count is always true), and improper file write modes (using 'at' append mode instead of 'w' write mode). These errors result in multiple empty files or data corruption.
Optimized Solution
Inspired by the best answer (Answer 2), we implement a robust splitting function. This approach avoids reinventing the wheel and adopts a proven code structure:
import os
import csv
def split_csv(filehandler, delimiter=',', row_limit=1000,
output_name_template='output_%s.csv', output_path='.', keep_headers=True):
reader = csv.reader(filehandler, delimiter=delimiter)
current_piece = 1
current_out_path = os.path.join(
output_path,
output_name_template % current_piece
)
current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
current_limit = row_limit
if keep_headers:
headers = next(reader)
current_out_writer.writerow(headers)
for i, row in enumerate(reader):
if i + 1 > current_limit:
current_piece += 1
current_limit = row_limit * current_piece
current_out_path = os.path.join(
output_path,
output_name_template % current_piece
)
current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
if keep_headers:
current_out_writer.writerow(headers)
current_out_writer.writerow(row)
Usage example:
split_csv(open('/path/to/input.csv', 'r'), row_limit=1000)
Algorithm Logic Detailed Explanation
The core logic of this function is based on iterator line-by-line processing:
- Initialization: Create a CSV reader, set current file number and row limit.
- Header Handling: If
keep_headers=True, first read and write the first row as headers. - Row Iteration: Use
enumerateto track row indices; when exceeding the current limit, create a new file and update the limit. - File Switching: Dynamically generate output paths via
os.path.jointo ensure standardized file naming.
Key optimizations: Avoid loading the entire file into memory at once (suitable for large files), and precisely control file switching through conditional checks.
Comparison of Alternative Methods
Referring to other answers, we briefly analyze two alternative approaches:
- Using
readlines()andwritelines(): As shown in Answer 1, this method is simple but may be memory-inefficient since it requires reading the entire file into a list. Code example:csvfile = open('file.csv', 'r').readlines(). - Command-line Tool
split: In Unix-like systems, one can directly usesplit -l 1000 input.csv. The advantage is no programming required, but it lacks the flexibility and header retention features of Python.
In contrast, the main reference solution strikes a balance between memory efficiency, functional completeness, and code maintainability.
Practical Recommendations and Extensions
In practical applications, consider the following improvements:
- Error Handling: Add exception catching for file opening (e.g.,
try-except) to ensure program robustness. - Performance Optimization: For very large files, combine with
chunksizeparameters or use the chunking functionality ofpandaslibrary'sread_csv. - Custom Splitting Logic: Extend the function to support splitting by column values, random sampling, timestamps, or other criteria.
By understanding the core algorithm and adjusting parameters appropriately, this solution can be widely applied in scenarios such as data engineering and machine learning preprocessing.