MySQL utf8 4-byte truncation
Description
MySQL's utf8 character set only supports characters encoded in up to 3 bytes, while the UTF-8 standard allows characters up to 4 bytes in length. When a 4-byte UTF-8 character (such as certain emoji or rare Unicode symbols) is inserted into a MySQL utf8 column, MySQL silently truncates the string at that point, discarding the 4-byte character and all subsequent data. This truncation behavior can be exploited to bypass input validation and filtering mechanisms, potentially leading to security vulnerabilities such as stored cross-site scripting (XSS) attacks. Testing has confirmed that your application truncates strings containing 4-byte UTF-8 characters while preserving 3-byte characters, indicating potential exposure to this MySQL behavior.
Remediation
Implement the following measures to prevent UTF-8 truncation vulnerabilities:
1. Migrate to utf8mb4 character set: Convert your MySQL database tables and columns from utf8 to utf8mb4, which supports the full UTF-8 character range including 4-byte characters.
ALTER TABLE your_table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
2. Enable MySQL strict mode: Configure MySQL to use strict SQL mode, which will reject invalid data instead of silently truncating it. Add the following to your MySQL configuration file (my.cnf or my.ini):
sql_mode=STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION
3. Validate input at the application layer: Before inserting data, check for 4-byte UTF-8 characters and either reject the input or handle it appropriately:
// PHP example
if (preg_match('/[\x{10000}-\x{10FFFF}]/u', $input)) {
// Handle 4-byte characters: reject, strip, or encode
throw new Exception('Input contains unsupported characters');
}4. Update database connection settings: Ensure your application's database connection uses utf8mb4 charset:
// PHP PDO example
$pdo = new PDO('mysql:host=localhost;dbname=mydb;charset=utf8mb4', $user, $pass);5. Test thoroughly: After implementing changes, verify that 4-byte characters (such as emoji: 😀, 🎉) are properly stored and retrieved without truncation.