Keywords: CSV Processing | Comma Escaping | C# Implementation | RFC 4180 | Regular Expressions
Abstract: This article provides an in-depth exploration of standardized methods for handling commas in CSV files, based on RFC 4180 specifications. It thoroughly analyzes common issues in practical applications and offers complete C# implementation solutions, including CSV reader and escape utility classes. The content systematically explains core principles and implementation details of CSV format parsing through multiple real-world case studies.
CSV Format Specifications and Comma Handling Fundamentals
When processing CSV files, handling fields containing commas presents a common technical challenge. According to RFC 4180 standard specifications, fields containing commas, double quotes, or line breaks must be enclosed in double quotes. For example, when a field value is "bar,baz", the correct CSV representation should be "bar,baz".
Detailed Escape Mechanism
When fields themselves contain double quotes, escape processing is required. The RFC specification mandates using two consecutive double quotes to represent one actual double quote character. For instance, the field value "b"bb" should be represented as "b""bb" in CSV. This escape mechanism ensures data integrity and parsing accuracy.
C# Implementation Solution
The following is a complete C# CSV reader implementation that supports quoted values and escape character processing:
using System;
using System.IO;
using System.Text.RegularExpressions;
public sealed class CsvReader : System.IDisposable
{
public CsvReader(string fileName) : this(new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
}
public CsvReader(Stream stream)
{
__reader = new StreamReader(stream);
}
public System.Collections.IEnumerable RowEnumerator
{
get {
if (null == __reader)
throw new System.ApplicationException("Cannot start reading without CSV input.");
__rowno = 0;
string sLine;
string sNextLine;
while (null != (sLine = __reader.ReadLine()))
{
while (rexRunOnLine.IsMatch(sLine) && null != (sNextLine = __reader.ReadLine()))
sLine += "\n" + sNextLine;
__rowno++;
string[] values = rexCsvSplitter.Split(sLine);
for (int i = 0; i < values.Length; i++)
values[i] = Csv.Unescape(values[i]);
yield return values;
}
__reader.Close();
}
}
public long RowIndex { get { return __rowno; } }
public void Dispose()
{
if (null != __reader) __reader.Dispose();
}
private long __rowno = 0;
private TextReader __reader;
private static Regex rexCsvSplitter = new Regex(@",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))");
private static Regex rexRunOnLine = new Regex(@"^[^""]*(?:""[^""]*""[^""]*)*""[^""]*$");
}
public static class Csv
{
public static string Escape(string s)
{
if (s.Contains(QUOTE))
s = s.Replace(QUOTE, ESCAPED_QUOTE);
if (s.IndexOfAny(CHARACTERS_THAT_MUST_BE_QUOTED) > -1)
s = QUOTE + s + QUOTE;
return s;
}
public static string Unescape(string s)
{
if (s.StartsWith(QUOTE) && s.EndsWith(QUOTE))
{
s = s.Substring(1, s.Length - 2);
if (s.Contains(ESCAPED_QUOTE))
s = s.Replace(ESCAPED_QUOTE, QUOTE);
}
return s;
}
private const string QUOTE = """;
private const string ESCAPED_QUOTE = """"";
private static char[] CHARACTERS_THAT_MUST_BE_QUOTED = { ',', '"', '\n' };
}
Practical Application Cases
In real business scenarios, company names like "ABC, Inc." require proper handling. Using the Csv.Escape method mentioned above, this field will be correctly escaped as "ABC, Inc.", ensuring it remains a complete field in the CSV file.
Regular Expression Parsing Principles
The key regular expression rexCsvSplitter in the implementation uses positive lookahead mechanism to only match commas not within quotes. The core logic of this regular expression is: match commas, but require that the comma is followed by an even number of quotes (including zero), thus ensuring commas within quoted fields are not incorrectly split.
Error Handling and Edge Cases
The code includes comprehensive error handling mechanisms and can correctly merge multi-line fields. The rexRunOnLine regular expression detects unclosed quoted fields, ensuring proper parsing of multi-line data.
Performance Optimization Considerations
This implementation adopts streaming reading and deferred execution design, enabling efficient processing of large CSV files. The row enumerator implemented through yield return only parses the next row of data when needed, significantly reducing memory usage.