Converting String to UTF-16 Byte Array in JavaScript

Keywords: JavaScript | String Conversion | Byte Array | UTF-16 | Encoding

Abstract: This article details how to convert a string to a UTF-16 Little-Endian byte array in JavaScript, matching the output of C#'s UnicodeEncoding.GetBytes method. It covers UTF-16 encoding basics, implementation using charCodeAt(), code examples, and considerations for handling special characters, aiding developers in cross-language data interoperability.

In web development, there is often a need to convert string data from client-side JavaScript into byte arrays for interoperability with server-side code, such as in C#. C#'s UnicodeEncoding class uses UTF-16 Little-Endian encoding by default to convert strings to byte arrays, and since JavaScript internally represents strings in UTF-16, compatible conversion can be achieved through appropriate methods. This article explains the process step by step and provides detailed code examples.

UTF-16 Encoding Basics

UTF-16 is a character encoding standard that uses 16-bit code units to represent characters. For characters within the Basic Multilingual Plane (BMP), one code unit suffices; for characters outside the BMP (such as some emojis or special symbols), surrogate pairs—two code units—are used. In Little-Endian byte order, the low byte of each code unit is stored first, followed by the high byte. For example, the character 'H' with a Unicode code point of 72 is represented as the byte sequence [72, 0] in UTF-16 Little-Endian.

JavaScript String Representation

JavaScript strings are internally encoded in UTF-16, with each character consisting of one or more 16-bit code units. The charCodeAt(index) method can be used to retrieve the code unit value at a specified index, returning an integer between 0 and 65535. This forms the basis for conversion to a byte array, as each code unit can be split into two separate bytes.

Implementing String to Byte Array Conversion

To convert a string to a UTF-16 Little-Endian byte array, iterate through each character in the string, use charCodeAt() to get the code unit, and then extract the low and high bytes. Here is a simple function implementation:

function stringToUtf16Bytes(str) {
  var bytes = [];
  for (var i = 0; i < str.length; i++) {
    var code = str.charCodeAt(i);
    bytes.push(code & 0xFF); // Low byte
    bytes.push((code >> 8) & 0xFF); // High byte
  }
  return bytes;
}

For example, calling stringToUtf16Bytes("Hello") returns the array [72, 0, 101, 0, 108, 0, 108, 0, 111, 0], which matches the output of C#'s UnicodeEncoding.GetBytes("Hello"). This approach works for most common characters, but attention should be paid to edge cases in character encoding.

Handling Surrogate Pairs and Special Characters

For characters outside the BMP (with code points greater than 0xFFFF), JavaScript's charCodeAt() method returns surrogate code points. For instance, the character "𐐷" (U+10437) is represented by two code units. In such cases, it is necessary to check if a character is part of a surrogate pair to ensure correct encoding. Here is an enhanced function that handles surrogate pairs:

function stringToUtf16BytesWithSurrogates(str) {
  var bytes = [];
  for (var i = 0; i < str.length; i++) {
    var code = str.charCodeAt(i);
    if (code >= 0xD800 && code <= 0xDBFF) {
      // High surrogate
      var high = code;
      i++;
      var low = str.charCodeAt(i);
      // Output bytes for high and low surrogates
      bytes.push(high & 0xFF);
      bytes.push((high >> 8) & 0xFF);
      bytes.push(low & 0xFF);
      bytes.push((low >> 8) & 0xFF);
    } else {
      bytes.push(code & 0xFF);
      bytes.push((code >> 8) & 0xFF);
    }
  }
  return bytes;
}

This version correctly handles surrogate pairs, ensuring all characters are encoded into the byte sequence properly. In practical applications, if the string contains non-BMP characters, using this enhanced function is recommended to avoid data loss.

Alternative Methods and Considerations

Beyond manual implementation, modern browsers support the TextEncoder API, but it defaults to UTF-8 encoding and is not suitable for scenarios requiring exact UTF-16 matching. For example:

let encoder = new TextEncoder(); // Defaults to UTF-8
let bytes = encoder.encode("abc"); // Returns Uint8Array [97, 98, 99]

If the target environment supports Node.js, the Buffer class can be used:

var str = "Hello";
var buffer = Buffer.from(str, 'utf16le');
var bytes = Array.from(buffer); // Convert to array

However, these methods may not directly match C#'s byte order, so manual use of charCodeAt() is more reliable. Developers should also consider byte order marks (BOM), but in many cases, C#'s UnicodeEncoding does not add a BOM, so JavaScript implementations typically do not require extra handling.

Summary and Best Practices

Using the charCodeAt() method, JavaScript can efficiently convert strings to UTF-16 Little-Endian byte arrays, ensuring compatibility with server-side languages like C#. It is advisable to test various character inputs during development to verify conversion accuracy. For cross-browser compatibility, if using newer APIs like TextEncoder, check browser support. Overall, understanding encoding principles and manual implementation offers the greatest flexibility and control.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.