Complete Implementation and Optimization of CSV File Parsing in C

Keywords: C Programming | CSV Parsing | File Handling | strtok Function | Memory Management

Abstract: This article provides an in-depth exploration of CSV file parsing techniques in C programming, focusing on the usage and considerations of the strtok function. Through comprehensive code examples, it demonstrates how to read CSV files with semicolon delimiters and extract specific field data. The discussion also covers critical programming concepts such as memory management and error handling, offering practical solutions for CSV file processing.

Fundamentals of CSV File Parsing

In C programming, CSV (Comma-Separated Values) files serve as a common data storage format widely used for data exchange and persistence. Despite the name suggesting comma separation, various delimiters including semicolons and tabs can be employed. This article uses semicolon-delimited CSV files as examples to provide detailed parsing and processing methodologies.

Core Parsing Function Implementation

The essence of CSV file parsing lies in string splitting techniques. The C standard library offers the strtok function specifically designed for string tokenization. Below is an optimized field extraction function implementation:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const char* extract_field(char* line, int field_number)
{
    const char* token;
    char* line_copy = strdup(line);
    
    for (token = strtok(line_copy, ";");
         token != NULL && *token != '\0';
         token = strtok(NULL, ";\n"))
    {
        if (--field_number == 0)
        {
            char* result = strdup(token);
            free(line_copy);
            return result;
        }
    }
    
    free(line_copy);
    return NULL;
}

Complete File Reading Process

The following code demonstrates the complete procedure for reading CSV files and processing each data line:

int main()
{
    FILE* file_handle = fopen("data.csv", "r");
    
    if (file_handle == NULL)
    {
        fprintf(stderr, "Error: Unable to open file\n");
        return 1;
    }
    
    char buffer[1024];
    int line_count = 0;
    
    while (fgets(buffer, sizeof(buffer), file_handle))
    {
        line_count++;
        
        // Skip header line
        if (line_count == 1)
            continue;
            
        char* line_duplicate = strdup(buffer);
        const char* third_field = extract_field(line_duplicate, 3);
        
        if (third_field != NULL)
        {
            printf("Third field of line %d: %s\n", line_count, third_field);
            free((void*)third_field);
        }
        
        free(line_duplicate);
    }
    
    fclose(file_handle);
    return 0;
}

Key Technical Analysis

Working Mechanism of strtok Function

The strtok function performs tokenization by modifying the original string, inserting null characters '\0' at delimiter positions to split the string into multiple substrings. This design implies:

The function alters the original string content
Protection via strdup for creating copies is necessary
Subsequent calls require NULL as the first parameter

Memory Management Considerations

Proper memory management is crucial during CSV parsing:

// Example of correct memory management
char* original_line = "field1;field2;field3";
char* working_copy = strdup(original_line);  // Create copy

// Process copy using strtok
char* token = strtok(working_copy, ";");
while (token != NULL)
{
    printf("Field: %s\n", token);
    token = strtok(NULL, ";");
}

free(working_copy);  // Release copy memory

Error Handling and Edge Cases

Robust CSV parsers must account for various edge cases:

int parse_csv_file(const char* filename)
{
    FILE* fp = fopen(filename, "r");
    if (!fp)
    {
        perror("File opening failed");
        return -1;
    }
    
    char line[1024];
    int record_count = 0;
    
    while (fgets(line, sizeof(line), fp))
    {
        // Handle empty lines
        if (strlen(line) <= 1)
            continue;
            
        char* tmp = strdup(line);
        if (!tmp)
        {
            fprintf(stderr, "Memory allocation failed\n");
            fclose(fp);
            return -1;
        }
        
        // Parsing logic...
        free(tmp);
        record_count++;
    }
    
    fclose(fp);
    return record_count;
}

Performance Optimization Recommendations

For large CSV files, consider the following optimization strategies:

Use fixed-size buffers to reduce memory allocations
Process multiple lines in batches
Avoid unnecessary string copying
Employ more efficient splitting algorithms as alternatives to strtok

Practical Application Scenarios

CSV file parsing finds extensive applications in data processing, database imports, log analysis, and more. Mastering these core techniques enables developers to build efficient and reliable data processing systems.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.