In-depth Analysis of MySQL LENGTH() vs CHAR_LENGTH(): Fundamental Differences Between Byte Length and Character Length

Keywords: MySQL | String Functions | Character Encoding

Abstract: This article provides a comprehensive examination of the essential differences between MySQL's LENGTH() and CHAR_LENGTH() string functions. Through detailed code examples and theoretical analysis, it explains the core mechanism where LENGTH() calculates length in bytes while CHAR_LENGTH() calculates in characters. The focus is on understanding how multi-byte characters in Unicode encoding and UTF-8 character sets affect length calculations, with practical guidance for real-world application scenarios. Complete MySQL code implementations are included to help developers grasp the underlying principles of string storage and processing.

Function Definitions and Core Differences

In the MySQL database system, LENGTH() and CHAR_LENGTH() are two commonly used string length calculation functions that appear similar in functionality but differ fundamentally in their underlying implementation and applicable scenarios.

The LENGTH() function returns the byte length of a string, representing the actual storage space occupied by the string in computer memory. This calculation is based on the binary representation of the string, independent of character encoding. For example, with ASCII character set English characters, each character occupies 1 byte, so LENGTH('MySQL') returns 5.

The CHAR_LENGTH() function returns the character length of a string, representing the actual number of characters contained in the string. This function is multi-byte safe and can correctly identify and process multi-byte encoded characters. In the same example, CHAR_LENGTH('MySQL') also returns 5, as the string contains 5 characters.

Differences in Encoding Environments

In single-byte character set environments (such as Latin1), both functions typically return the same values since each character corresponds to exactly one byte. However, in modern web applications and multilingual environments where Unicode encoding (particularly UTF-8) has become standard, the differences between these functions become critically important.

UTF-8 is a variable-length encoding scheme where character byte lengths range from 1 to 4 bytes. English characters typically occupy 1 byte, while many special characters and non-Latin characters require 2 to 4 bytes. For example, the Euro symbol "€" occupies 3 bytes in UTF-8 encoding:

SELECT LENGTH(_utf8 '€'), CHAR_LENGTH(_utf8 '€');
-- Returns: 3, 1

This example clearly demonstrates the difference: LENGTH() returns 3 (bytes), while CHAR_LENGTH() returns 1 (characters). The Euro symbol's UTF-8 encoding is 0xE282AC, which indeed requires 3 bytes for storage.

In-depth Analysis with Unicode Character Sets

To gain deeper understanding of this difference, we can test with the UCS2 character set. UCS2 is a fixed-length Unicode encoding where each character occupies 2 bytes:

SET @A = CONVERT('Gaurav' USING ucs2);
SELECT CHAR_LENGTH(@A), LENGTH(@A);
-- Returns: 6, 12

In this example, the string "Gaurav" contains 6 characters, so CHAR_LENGTH() returns 6. However, since each character occupies 2 bytes in UCS2 encoding, LENGTH() returns 12 (6 characters × 2 bytes/character).

Another example involves the copyright symbol "©":

SET @B = CONVERT('©' USING utf8);
SELECT CHAR_LENGTH(@B), LENGTH(@B);
-- Returns: 1, 2

The copyright symbol occupies 2 bytes in UTF-8 encoding, so LENGTH() returns 2, while CHAR_LENGTH() correctly identifies it as 1 character.

Practical Considerations for Binary String Storage

Regarding the practical aspects of binary string storage, there are indeed specific application scenarios. Using binary string types is appropriate when precise binary representation of data must be maintained, or when handling non-text data such as images, encrypted data, etc. Binary string storage involves no character set conversion and preserves the original byte sequence of data.

However, for most text processing scenarios, character-based string types are recommended as they provide better internationalization support and text processing capabilities. The CHAR_LENGTH() function is more reliable in such contexts as it accurately reflects the character count as perceived by users.

Best Practices for Function Selection

In practical development, the choice between these functions depends on specific requirements:

Use LENGTH() when you need to understand the storage size of a string
Use CHAR_LENGTH() when you need to count the number of characters in a string
In applications involving multilingual content, prefer CHAR_LENGTH() to ensure accurate character counting
In performance-sensitive scenarios where strings are known to contain only single-byte characters, LENGTH() may offer slight performance advantages

Understanding the fundamental differences between these functions helps developers make correct technical decisions across different character set environments, avoiding logical errors caused by encoding differences.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.

Function Definitions and Core Differences

Differences in Encoding Environments

In-depth Analysis with Unicode Character Sets

Practical Considerations for Binary String Storage

Best Practices for Function Selection

Cite this article