Keywords: JavaScript | CSV parsing | regular expressions | state machine | RFC 4180
Abstract: This article explores two core methods for parsing CSV strings in JavaScript: a regex-based parser for non-standard formats and a state machine implementation adhering to RFC 4180. It analyzes differences between non-standard CSV (supporting single quotes, double quotes, and escape characters) and standard RFC formats, detailing how to correctly handle fields containing commas. Complete code examples are provided, including validation regex, parsing logic, edge case handling, and a comparison of applicability and limitations of both methods.
When processing CSV (Comma-Separated Values) data in JavaScript, a common challenge is correctly parsing field values that contain commas. A naive string split using string.split(/,/) incorrectly treats commas within fields as delimiters, leading to data misparsing. For example, given the string "'string, duppi, du', 23, lala", the desired output is ["string, duppi, du", "23", "lala"], but direct splitting yields ["'string", " duppi", " du'", " 23", " lala"]. This article examines two solutions: a regex-based parser for non-standard CSV and a state machine method compliant with RFC 4180.
Regex-Based Approach for Non-Standard CSV Parsing
First, we define a non-standard CSV format that supports single-quoted strings, double-quoted strings, and unquoted strings. Key rules include: quoted values may contain commas; escape characters within quotes (e.g., \' or \") are processed; unquoted strings must not contain quotes, commas, or backslashes; whitespace around comma separators is ignored. Based on this, we design a validation regex to ensure input strings conform:
var re_valid = /^\s*(?:'[^'\\]*(?:\\[\S\s][^'\\]*)*'|"[^"\\]*(?:\\[\S\s][^"\\]*)*"|[^,'"\s\\]*(?:\s+[^,'"\s\\]+)*)\s*(?:,\s*(?:'[^'\\]*(?:\\[\S\s][^'\\]*)*'|"[^"\\]*(?:\\[\S\s][^"\\]*)*"|[^,'"\s\\]*(?:\s+[^,'"\s\\]+)*)\s*)*$/;
This regex checks if the entire string consists of valid values separated by commas. If validation passes, we use another regex re_value to match and parse values individually:
var re_value = /(?!\s*$)\s*(?:'([^'\\]*(?:\\[\S\s][^'\\]*)*)'|"([^"\\]*(?:\\[\S\s][^"\\]*)*)"|([^,'"\s\\]*(?:\s+[^,'"\s\\]+)*))\s*(?:,|$)/g;
In a JavaScript function, we combine these regexes for parsing. First, validate input with re_valid.test(text); if valid, iterate matches using text.replace(re_value, callback), processing escape characters and collecting values in the callback. For instance, \' in single-quoted values is replaced with ', and \" in double-quoted values with ". Finally, handle the special case of an empty last value: if the string ends with a comma and whitespace, add an empty string to the result array.
Here is a complete implementation example:
function CSVtoArray(text) {
var re_valid = /^\s*(?:'[^'\\]*(?:\\[\S\s][^'\\]*)*'|"[^"\\]*(?:\\[\S\s][^"\\]*)*"|[^,'"\s\\]*(?:\s+[^,'"\s\\]+)*)\s*(?:,\s*(?:'[^'\\]*(?:\\[\S\s][^'\\]*)*'|"[^"\\]*(?:\\[\S\s][^"\\]*)*"|[^,'"\s\\]*(?:\s+[^,'"\s\\]+)*)\s*)*$/;
var re_value = /(?!\s*$)\s*(?:'([^'\\]*(?:\\[\S\s][^'\\]*)*)'|"([^"\\]*(?:\\[\S\s][^"\\]*)*)"|([^,'"\s\\]*(?:\s+[^,'"\s\\]+)*))\s*(?:,|$)/g;
if (!re_valid.test(text)) return null;
var a = [];
text.replace(re_value, function(m0, m1, m2, m3) {
if (m1 !== undefined) a.push(m1.replace(/\\'/g, "'"));
else if (m2 !== undefined) a.push(m2.replace(/\\"/g, '"'));
else if (m3 !== undefined) a.push(m3);
return '';
});
if (/,\s*$/.test(text)) a.push('');
return a;
}
This method works well for well-defined non-standard CSV but has limitations: it requires strict input adherence, e.g., unquoted values cannot contain backslashes or quotes. For more general CSV parsing, standard formats must be considered.
State Machine Method Adhering to RFC 4180
RFC 4180 defines a standard CSV format where fields may be enclosed in double quotes, and quotes within fields are escaped by doubling (e.g., "" represents a single quote). Additionally, fields can contain line breaks (CRLF), making single-line parsing insufficient. A state machine-based parser processes characters sequentially, tracking whether inside quotes and correctly handling escapes and delimiters.
Below is a JavaScript function implementing RFC 4180 parsing:
function csvToArray(text) {
let p = '', row = [''], ret = [row], i = 0, r = 0, s = true, l;
for (l of text) {
if ('"' === l) {
if (s && l === p) row[i] += l;
s = !s;
} else if (',' === l && s) l = row[++i] = '';
else if ('\n' === l && s) {
if ('\r' === p) row[i] = row[i].slice(0, -1);
row = ret[++r] = [l = '']; i = 0;
} else row[i] += l;
p = l;
}
return ret;
}
This function uses a state variable s indicating whether outside quotes (initialized as true). It iterates each character: if a double quote is encountered, check if the previous character is also a double quote (indicating escape), and if so, add to the current field; otherwise, toggle the state. When a comma is met outside quotes, start a new field; when a newline is encountered outside quotes, start a new row. This approach handles multi-line CSV and escaped quotes but may not support non-standard single-quote formats.
Comparison and Selection Guidelines
The regex method is suitable for parsing simple, well-defined non-standard CSV strings, with relatively concise code, but the regexes are complex and hard to maintain, and input must strictly adhere to the format. The state machine method is more flexible, compliant with RFC 4180, capable of handling multi-line data and complex escapes, but implementation is slightly more involved. In practice, if data sources are controlled and formats are simple, the regex method may suffice; for compatibility with standard CSV (e.g., files exported from Excel) or large file processing, the state machine method is more reliable. Developers should choose based on specific needs and consider using existing libraries (e.g., Papa Parse) for more comprehensive functionality.
In summary, when parsing CSV strings, the key is correctly handling delimiters and escape characters. By deeply understanding regex and state machine principles, robust parsers can be built to avoid common pitfalls like mis-splitting fields containing commas.