Keywords: JavaScript | UTF-8 encoding | byte array conversion
Abstract: This article provides an in-depth exploration of converting UTF-8 strings to byte arrays in JavaScript. It begins by explaining the fundamental principles of UTF-8 encoding, including rules for single-byte and multi-byte characters. Three main implementation approaches are then detailed: a manual encoding function using bitwise operations, a combination technique utilizing encodeURIComponent and unescape, and the modern Encoding API. Through comparative analysis of each method's strengths and weaknesses, complete code examples and performance considerations are provided to help developers choose the most appropriate solution for their specific needs.
Fundamental Principles of UTF-8 Encoding
In JavaScript, strings are stored by default in UTF-16 format, with each character occupying 2 bytes. However, for network transmission or file storage, UTF-8 encoding is more commonly used due to its compatibility and space efficiency. UTF-8 is a variable-length encoding scheme with the following core rules:
- ASCII characters (U+0000 to U+007F) are encoded as single bytes with the highest bit set to 0
- For multi-byte characters, the first byte indicates the total number of bytes with consecutive 1 bits, while subsequent bytes start with 10
- Specific encoding ranges:
- 2 bytes: U+0080 to U+07FF
- 3 bytes: U+0800 to U+FFFF (excluding surrogate pair regions)
- 4 bytes: U+10000 to U+10FFFF
Understanding these rules is essential for manual encoding implementation. For example, the character "©" (U+00A9) requires 2 bytes in UTF-8: first byte 0xC2 (11000010), second byte 0xA9 (10101001).
Manual Implementation of UTF-8 Encoding Function
Based on these principles, we can implement a complete toUTF8Array function. The core logic involves converting UTF-16 code points to UTF-8 byte sequences using bitwise operations:
function toUTF8Array(str) {
var utf8 = [];
for (var i = 0; i < str.length; i++) {
var charcode = str.charCodeAt(i);
// Single-byte characters (ASCII)
if (charcode < 0x80) {
utf8.push(charcode);
}
// 2-byte UTF-8 encoding
else if (charcode < 0x800) {
utf8.push(0xc0 | (charcode >> 6));
utf8.push(0x80 | (charcode & 0x3f));
}
// 3-byte UTF-8 encoding (non-surrogate pairs)
else if (charcode < 0xd800 || charcode >= 0xe000) {
utf8.push(0xe0 | (charcode >> 12));
utf8.push(0x80 | ((charcode >> 6) & 0x3f));
utf8.push(0x80 | (charcode & 0x3f));
}
// 4-byte UTF-8 encoding (surrogate pairs)
else {
i++;
// Calculate full Unicode code point
charcode = 0x10000 + (((charcode & 0x3ff) << 10) | (str.charCodeAt(i) & 0x3ff));
utf8.push(0xf0 | (charcode >> 18));
utf8.push(0x80 | ((charcode >> 12) & 0x3f));
utf8.push(0x80 | ((charcode >> 6) & 0x3f));
utf8.push(0x80 | (charcode & 0x3f));
}
}
return utf8;
}
The key to this implementation lies in correctly handling surrogate pairs. JavaScript's UTF-16 uses surrogate pairs to represent characters from U+10000 to U+10FFFF, requiring the combination of two 16-bit values into a 20-bit code point. Bitwise operations >> (right shift) and & (bitwise AND) extract specific bits, while | (bitwise OR) combines flag bits with data bits.
Alternative Implementation Approaches
Beyond manual encoding, two practical alternative approaches exist:
Utilizing Built-in Function Combinations
The encodeURIComponent function converts non-ASCII characters to percent-encoded UTF-8 bytes, which can be combined with unescape to obtain raw bytes:
function toUTF8ArrayViaURI(str) {
var utf8Str = unescape(encodeURIComponent(str));
var arr = [];
for (var i = 0; i < utf8Str.length; i++) {
arr.push(utf8Str.charCodeAt(i));
}
return arr;
}
This method is concise but relies on specific function behaviors, and unescape is deprecated.
Using the Modern Encoding API
Modern browsers provide the TextEncoder API specifically designed for string encoding:
function toUTF8ArrayModern(str) {
var encoder = new TextEncoder();
return Array.from(encoder.encode(str));
}
// Example usage
var encoded = new TextEncoder().encode("Hello 世界");
console.log(encoded); // Uint8Array [72, 101, 108, 108, 111, 32, 228, 184, 150, 231, 149, 140]
This approach is the most concise and efficient but requires consideration of browser compatibility. For unsupported environments, polyfills or fallback to manual implementation are necessary.
Performance and Selection Recommendations
Each method has distinct advantages and disadvantages:
- Manual Implementation: Best compatibility and control, but higher code complexity
- Function Combination: Concise code, but relies on deprecated functions; not recommended for new projects
- Encoding API: Optimal performance, clear semantics; the preferred choice for modern projects
In practical applications, prioritize using TextEncoder with polyfills for older browsers. For scenarios requiring fine-grained control or special encoding needs, manual implementation remains a reliable option.
Application Scenarios and Considerations
UTF-8 byte arrays are particularly useful in the following scenarios:
- Network communication (e.g., WebSocket, Fetch API)
- File reading/writing (via Blob or ArrayBuffer)
- Cryptographic hash calculations
- Interaction with backend systems
It's important to note that array elements in JavaScript are double-precision floating-point numbers; for pure byte operations, using Uint8Array typed arrays is more appropriate. Additionally, when processing large strings, attention should be paid to memory usage and performance optimization.