Converting UTF-8 Strings to Byte Arrays in JavaScript: Principles, Implementation, and Best Practices

Keywords: JavaScript | UTF-8 encoding | byte array conversion

Abstract: This article provides an in-depth exploration of converting UTF-8 strings to byte arrays in JavaScript. It begins by explaining the fundamental principles of UTF-8 encoding, including rules for single-byte and multi-byte characters. Three main implementation approaches are then detailed: a manual encoding function using bitwise operations, a combination technique utilizing encodeURIComponent and unescape, and the modern Encoding API. Through comparative analysis of each method's strengths and weaknesses, complete code examples and performance considerations are provided to help developers choose the most appropriate solution for their specific needs.

Fundamental Principles of UTF-8 Encoding

In JavaScript, strings are stored by default in UTF-16 format, with each character occupying 2 bytes. However, for network transmission or file storage, UTF-8 encoding is more commonly used due to its compatibility and space efficiency. UTF-8 is a variable-length encoding scheme with the following core rules:

ASCII characters (U+0000 to U+007F) are encoded as single bytes with the highest bit set to 0
For multi-byte characters, the first byte indicates the total number of bytes with consecutive 1 bits, while subsequent bytes start with 10
Specific encoding ranges:
- 2 bytes: U+0080 to U+07FF
- 3 bytes: U+0800 to U+FFFF (excluding surrogate pair regions)
- 4 bytes: U+10000 to U+10FFFF

Understanding these rules is essential for manual encoding implementation. For example, the character "©" (U+00A9) requires 2 bytes in UTF-8: first byte 0xC2 (11000010), second byte 0xA9 (10101001).

Manual Implementation of UTF-8 Encoding Function

Based on these principles, we can implement a complete toUTF8Array function. The core logic involves converting UTF-16 code points to UTF-8 byte sequences using bitwise operations:

function toUTF8Array(str) {
    var utf8 = [];
    for (var i = 0; i < str.length; i++) {
        var charcode = str.charCodeAt(i);
        
        // Single-byte characters (ASCII)
        if (charcode < 0x80) {
            utf8.push(charcode);
        }
        
        // 2-byte UTF-8 encoding
        else if (charcode < 0x800) {
            utf8.push(0xc0 | (charcode >> 6));
            utf8.push(0x80 | (charcode & 0x3f));
        }
        
        // 3-byte UTF-8 encoding (non-surrogate pairs)
        else if (charcode < 0xd800 || charcode >= 0xe000) {
            utf8.push(0xe0 | (charcode >> 12));
            utf8.push(0x80 | ((charcode >> 6) & 0x3f));
            utf8.push(0x80 | (charcode & 0x3f));
        }
        
        // 4-byte UTF-8 encoding (surrogate pairs)
        else {
            i++;
            // Calculate full Unicode code point
            charcode = 0x10000 + (((charcode & 0x3ff) << 10) | (str.charCodeAt(i) & 0x3ff));
            utf8.push(0xf0 | (charcode >> 18));
            utf8.push(0x80 | ((charcode >> 12) & 0x3f));
            utf8.push(0x80 | ((charcode >> 6) & 0x3f));
            utf8.push(0x80 | (charcode & 0x3f));
        }
    }
    return utf8;
}

The key to this implementation lies in correctly handling surrogate pairs. JavaScript's UTF-16 uses surrogate pairs to represent characters from U+10000 to U+10FFFF, requiring the combination of two 16-bit values into a 20-bit code point. Bitwise operations >> (right shift) and & (bitwise AND) extract specific bits, while | (bitwise OR) combines flag bits with data bits.

Alternative Implementation Approaches

Beyond manual encoding, two practical alternative approaches exist:

Utilizing Built-in Function Combinations

The encodeURIComponent function converts non-ASCII characters to percent-encoded UTF-8 bytes, which can be combined with unescape to obtain raw bytes:

function toUTF8ArrayViaURI(str) {
    var utf8Str = unescape(encodeURIComponent(str));
    var arr = [];
    for (var i = 0; i < utf8Str.length; i++) {
        arr.push(utf8Str.charCodeAt(i));
    }
    return arr;
}

This method is concise but relies on specific function behaviors, and unescape is deprecated.

Using the Modern Encoding API

Modern browsers provide the TextEncoder API specifically designed for string encoding:

function toUTF8ArrayModern(str) {
    var encoder = new TextEncoder();
    return Array.from(encoder.encode(str));
}

// Example usage
var encoded = new TextEncoder().encode("Hello 世界");
console.log(encoded); // Uint8Array [72, 101, 108, 108, 111, 32, 228, 184, 150, 231, 149, 140]

This approach is the most concise and efficient but requires consideration of browser compatibility. For unsupported environments, polyfills or fallback to manual implementation are necessary.

Performance and Selection Recommendations

Each method has distinct advantages and disadvantages:

Manual Implementation: Best compatibility and control, but higher code complexity
Function Combination: Concise code, but relies on deprecated functions; not recommended for new projects
Encoding API: Optimal performance, clear semantics; the preferred choice for modern projects

In practical applications, prioritize using TextEncoder with polyfills for older browsers. For scenarios requiring fine-grained control or special encoding needs, manual implementation remains a reliable option.

Application Scenarios and Considerations

UTF-8 byte arrays are particularly useful in the following scenarios:

Network communication (e.g., WebSocket, Fetch API)
File reading/writing (via Blob or ArrayBuffer)
Cryptographic hash calculations
Interaction with backend systems

It's important to note that array elements in JavaScript are double-precision floating-point numbers; for pure byte operations, using Uint8Array typed arrays is more appropriate. Additionally, when processing large strings, attention should be paid to memory usage and performance optimization.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.