Efficient CSV File Splitting in Python: Multi-File Generation Strategy Based on Row Count

Keywords: Python | CSV file splitting | data processing

Abstract: This article explores practical methods for splitting large CSV files into multiple subfiles by specified row counts in Python. By analyzing common issues in existing code, we focus on an optimized solution that uses csv.reader for line-by-line reading and dynamic output file creation, supporting advanced features like header retention. The article details algorithm logic, code implementation specifics, and compares the pros and cons of different approaches, providing reliable technical reference for data preprocessing tasks.

Problem Background and Requirements Analysis

When handling data science or routine programming tasks, it is often necessary to manipulate large CSV (Comma-Separated Values) files. For example, a user might have a CSV file with 5000 rows of data and wish to evenly split it into five smaller files, each containing 1000 rows. This need is common in scenarios such as data chunking, parallel computation preparation, or file size limitations.

Common Errors and Code Analysis

The initial code provided by the user attempts to implement file splitting via a recursive function but contains logical flaws:

import codecs
import csv
NO_OF_LINES_PER_FILE = 1000
def again(count_file_header,count):
    f3 = open('write_'+count_file_header+'.csv', 'at')
    with open('import_1458922827.csv', 'rb') as csvfile:
        candidate_info_reader = csv.reader(csvfile, delimiter=',', quoting=csv.QUOTE_ALL)
        co = 0      
        for row in candidate_info_reader:
            co = co + 1
            count  = count + 1
            if count &lt;= count:
                pass
            elif count &gt;= NO_OF_LINES_PER_FILE:
                count_file_header = count + NO_OF_LINES_PER_FILE
                again(count_file_header,count)
            else:
                writer = csv.writer(f3,delimiter = ',', lineterminator='\n',quoting=csv.QUOTE_ALL)
                writer.writerow(row)

Main issues include: recursive calls causing repeated file openings, confusing conditional logic (e.g., if count <= count is always true), and improper file write modes (using 'at' append mode instead of 'w' write mode). These errors result in multiple empty files or data corruption.

Optimized Solution

Inspired by the best answer (Answer 2), we implement a robust splitting function. This approach avoids reinventing the wheel and adopts a proven code structure:

import os
import csv

def split_csv(filehandler, delimiter=',', row_limit=1000,
              output_name_template='output_%s.csv', output_path='.', keep_headers=True):
    reader = csv.reader(filehandler, delimiter=delimiter)
    current_piece = 1
    current_out_path = os.path.join(
        output_path,
        output_name_template % current_piece
    )
    current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
    current_limit = row_limit
    if keep_headers:
        headers = next(reader)
        current_out_writer.writerow(headers)
    for i, row in enumerate(reader):
        if i + 1 &gt; current_limit:
            current_piece += 1
            current_limit = row_limit * current_piece
            current_out_path = os.path.join(
                output_path,
                output_name_template % current_piece
            )
            current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
            if keep_headers:
                current_out_writer.writerow(headers)
        current_out_writer.writerow(row)

Usage example:

split_csv(open('/path/to/input.csv', 'r'), row_limit=1000)

Algorithm Logic Detailed Explanation

The core logic of this function is based on iterator line-by-line processing:

Initialization: Create a CSV reader, set current file number and row limit.
Header Handling: If keep_headers=True, first read and write the first row as headers.
Row Iteration: Use enumerate to track row indices; when exceeding the current limit, create a new file and update the limit.
File Switching: Dynamically generate output paths via os.path.join to ensure standardized file naming.

Key optimizations: Avoid loading the entire file into memory at once (suitable for large files), and precisely control file switching through conditional checks.

Comparison of Alternative Methods

Referring to other answers, we briefly analyze two alternative approaches:

Using readlines() and writelines(): As shown in Answer 1, this method is simple but may be memory-inefficient since it requires reading the entire file into a list. Code example: csvfile = open('file.csv', 'r').readlines().
Command-line Tool split: In Unix-like systems, one can directly use split -l 1000 input.csv. The advantage is no programming required, but it lacks the flexibility and header retention features of Python.

In contrast, the main reference solution strikes a balance between memory efficiency, functional completeness, and code maintainability.

Practical Recommendations and Extensions

In practical applications, consider the following improvements:

Error Handling: Add exception catching for file opening (e.g., try-except) to ensure program robustness.
Performance Optimization: For very large files, combine with chunksize parameters or use the chunking functionality of pandas library's read_csv.
Custom Splitting Logic: Extend the function to support splitting by column values, random sampling, timestamps, or other criteria.

By understanding the core algorithm and adjusting parameters appropriately, this solution can be widely applied in scenarios such as data engineering and machine learning preprocessing.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.