Keywords: MySQL | Character Set | Collation | UTF8mb4 | Database Repair
Abstract: This article provides a comprehensive examination of character set and collation repair in MySQL databases. Addressing the issue of Chinese and Japanese characters displaying as ??? due to Latin character set configuration, it offers complete conversion solutions from database, table to column levels. Detailed analysis of utf8mb4_0900_ai_ci meaning and advantages, combined with practical cases demonstrating safe and efficient character set migration to ensure proper storage and display of multilingual data.
Problem Background and Impact Analysis
In MySQL database management, incorrect configuration of character sets and collations can lead to serious data display issues. As described by users, when a database uses Latin character set to store data containing Chinese and Japanese characters, these characters are incorrectly converted to "???" symbols. This situation typically occurs in improperly configured initial database setups or legacy systems.
Fundamental Concepts of Character Sets and Collations
Character sets define the collection of characters that can be stored in a database, while collations determine the sorting and comparison rules for these characters. MySQL supports multiple character sets, with the UTF-8 series being the preferred choice for modern applications due to its excellent support for multiple languages.
The UTF-8 character set has two important versions: traditional utf8 (3-byte encoding) and the newer utf8mb4 (4-byte encoding). utf8mb4 fully supports the Unicode standard, including emojis and other supplementary characters, and is currently the recommended character set.
Complete Repair Solution
To address character set misconfiguration issues, repairs need to be performed at multiple levels:
Database Level Repair
ALTER DATABASE <database_name> CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
This command modifies the default character set and collation for the entire database to utf8mb4. Note that this only affects newly created tables and columns going forward, and does not automatically apply to existing data tables.
Table Level Repair
ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
This command converts the specified table and all its columns to the target character set and collation. This is the most effective method for repairing existing data tables, but should be performed with caution, and data backup is recommended before execution.
Column Level Repair
ALTER TABLE <table_name> MODIFY <column_name> VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
For specific columns that require individual adjustment, this command can be used. Complete data type definition must be specified to ensure compatibility with the original definition.
Detailed Collation Explanation
Each part of the collation name has specific meanings:
Character Set Version Identification:
- _unicode_: Corresponds to Unicode 4.0 standard
- _unicode_520_: Corresponds to Unicode 5.20 standard
- _0900_: Corresponds to Unicode 9.0 standard, with significant performance optimization
Comparison Rules:
- _bin: Binary comparison, case-sensitive and accent-sensitive
- _ci: Explicitly case-insensitive, implicitly accent-insensitive
- _ai_ci: Explicitly case-insensitive and accent-insensitive
- _as: Accent-sensitive
Performance Characteristics:
- _bin: Simple and fast
- _general_ci: Medium speed, supports basic multi-letter comparison
- _0900_: Significantly optimized through rewrite, best performance
Practical Case Analysis
Based on actual project experience, mixed collations can cause MySQL to throw "Illegal mix of collations" errors. This situation is common in legacy systems that have been maintained by multiple development teams, leading to configuration inconsistencies.
By querying the INFORMATION_SCHEMA system tables, batch repair scripts can be generated:
SELECT
CONCAT('ALTER TABLE `', TABLE_NAME, '` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;') AS stmt
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'database_name'
AND TABLE_COLLATION = 'old_collation';
Data Recovery Possibility
For characters that have already been corrupted and display as "???", if the original data lost information during storage due to character set mismatch, simply modifying the character set may not recover the original characters. In such cases, data needs to be restored from backups or re-entered with correct encoding.
Best Practice Recommendations
1. Prevention Over Repair: Standardize character set and collation configuration during project initialization
2. Regular Inspection: Monitor character set consistency across database objects through system tables
3. Backup First: Complete database backup must be performed before any character set modification operations
4. Testing Verification: Thorough testing in development environment before applying to production
5. Documentation: Record character set configuration decisions and change history
Tool Support
Modern development tools such as TablePlus, DDEV provide convenient database management functions that can assist in character set inspection and migration. Some frameworks (like Craft CMS) also offer dedicated CLI commands to simplify the character set conversion process.
Conclusion
Proper handling of MySQL character sets and collations is crucial for ensuring the normal operation of multilingual applications. Through systematic repair solutions and preventive measures, character display issues can be effectively avoided, ensuring data integrity and consistency. utf8mb4_0900_ai_ci, as the currently recommended configuration combination, achieves a good balance between functional completeness and performance.