One of the most confusing topics ever in web development history is character sets. MySQL has a lot of features to help you with this, but when things go wrong, it can be a real pain to get it solved. Read this and fear no more.
A little background
Most default installations of MySQL (still) default to a latin1 character set. This character set is also called ISO-8859-1, and is pretty much sufficient for most western languages. It is in fact a superset of the original ASCII table, and is akin to Windows-1252 which is used as a default encoding in Microsoft Windows systems. Typically, most content devised on Windows machines (e.g. in Microsoft Office) will use this encoding rather than ISO-8859-1.
All these character sets are based on the principle of code pages: you have (in this case only) one byte per character (no more, no less) and these map to certain characters (or “glyphs”). Therefore the number of characters available in each character set based on this system is limited by the amount of bytes used per character. One-byte codepages consequently offer no more than 256 characters.
The UTF-8 character set however provides an extension system, with which certain control bytes define how many bytes (or “octets”) are used for the character in question. The first set of characters defined within the range of 0-127 (compatible with the ASCII table) take up one byte of space, the next set of characters mostly used in Western languages and such take 2 bytes, up until 6 bytes used by the least frequently used characters. Being backward compatible with ASCII, and having the possibility to display nearly all languages, makes UTF-8 the most practical and popular character set on the web.
What does MySQL do with character sets?
MySQL has some features as it comes to handling different character sets. To understand how MySQL handles character sets, you need to be aware of two types of settings.
1. The character set of a column
The character set of a column defines how MySQL should interpret string operations on that column. For example, counting the number of characters in a string using
CHAR_LENGTH() needs to know what character set the content is in, because UTF-8 can have multiple bytes per character:
SELECT CHAR_LENGTH(title) FROM item;
2. The character set of the current connection
The character set the client is “speaking”, is made known to the server by setting the connection character set. Since MySQL needs to know what character set the client’s strings are in. A good example of this, again, is the
CHAR_LENGTH() function, which can be used to determine the length of a client value too:
SELECT CHAR_LENGTH('my string') FROM item;
Silent encoding conversion
If there is a discrepancy between the connection and the storage character set, MySQL silently converts the content from one encoding to another. Let’s say you’re connected to a server with a
latin1 character set, and the column specification states that the content in the table is
utf8, MySQL will convert any write to that column from
utf8, and any read vice versa.
However, when you’re actually writing UTF-8 characters to the database over a
latin1 connection, each of the UTF-8 byte sequences will be interpreted as separate latin1 characters. For the lower 127 of the character set, this imposes no problem, because they’re ASCII in both
utf8. However, each byte composing an UTF-8 character is interpreted as a separate
latin1 character, and is each converted to UTF-8 when writing to the table, causing the content that is stored in the storage engine to be wrong.
Example 1: UTF-8 data, latin1 connection and latin1 column
Since latin1 is the default character set for most MySQL installations, this is probably the most common situation.
Consider the string ‘á’. This is a lowercase ‘a’ with an acute accent. In UTF-8, the character is represented as two bytes, hexadecimally denoted as
Consider the following table definition:
As we insert UTF-8 data into this table, MySQL does nothing special. It just treats the two bytes sent as two latin1 characters, does no conversion, and saves them as such into the table. When reading the database, still nothing specific happens. The application gets the two bytes, thinking they are one UTF-8 character represented in two bytes, and nothing specific nor problematic happens.
However, demonstrating that MySQL does not actually know you only wished to store one character, you can try to insert the string of 10 ‘á’ characters into the table. That will render a warning, because only 5 will fit. The problem becomes even more apparent when we change the table specification to an odd length of 5:
mysql> insert into t(s) values('ááááá'); Query OK, 1 row affected, 1 warning (0.00 sec) mysql> show warnings; +---------+------+----------------------------------------+ | Level | Code | Message | +---------+------+----------------------------------------+ | Warning | 1265 | Data truncated for column 's' at row 1 | +---------+------+----------------------------------------+ 1 row in set (0.00 sec) mysql> select * from t; +-------+ | s | +-------+ | áá? | +-------+ 1 row in set (0.00 sec)
The first 4 bytes are consumed by the first two characters (each consisting of two bytes), and the last byte is filled with the first byte of the third character because it is truncated by MySQL. This actually renders an invalid UTF-8 sequence, hence the display of the ‘?’ as the third character above. The following demonstrates that:
mysql> select HEX(s) FROM t; +------------+ | HEX(s) | +------------+ | C3A1C3A1C3 | +------------+ 1 row in set (0.00 sec)
The last byte (
C3) is actually the first byte of the third
Example 2: UTF-8 data, latin1 connection and utf8 column
Consider the same string ‘á’ in UTF-8 represented as two bytes:
Consider the following table definition:
When inserting UTF-8 data into the table using a latin1 connection, the following actually happens:
The string inserted into the table is sent as the two bytes
A1, which are in turn the characters ‘Ã’ 1 and ‘¡’ (an inverted exclamation mark). MySQL interprets these characters separately and converts them into their UTF-8 counter parts,
C383 respectively, taking up a total of 4 bytes of space.
The binary equivalent, which shows how MySQL will interpret this query, is as follows:
_latin1 x'C3A1' means: “Interpret the hexadecimal string ‘C3A1’ as latin1“. It would have been interpreted as UTF-8, if the connection had been utf8.
Now, when you read the data from the database, MySQL interprets the two characters in the database (represented by the four bytes mentioned above) as utf8 needed to convert to
latin1 (since your client connection is set as
latin1). The conversion as described above is inverted, so the first character is converted to the latin1 ‘Ã’ and the second to ‘¡’. As long as noone interferes, and these two bytes are sent to your application, which in turn sends them to the browser represented with a charset=utf-8 content type, you’ll never notice the problem.
However, the problem does become horribly apparent as soon as you ask MySQL to tell you the length of the value inserted:
mysql> SELECT LENGTH(s) FROM t; +-----------+ | LENGTH(s) | +-----------+ | 4 | +-----------+ 1 row in set (0.00 sec)
Here you can see that MySQL actually stored 4 bytes of data in the table, telling you the string that is in the table is not 2 bytes in length, as you might expect, but 4. Again, displaying the HEX value of the stored data demonstrates that:
mysql> select hex(s) from t; +----------+ | hex(s) | +----------+ | C383C2A1 | +----------+ 1 row in set (0.00 sec)
Setting the right connection encoding
As you might have gathered from the examples above, the culprit isn’t MySQL. It is your application telling MySQL that it’s about to send ISO-8895-1 encoded data, but is actually sending UTF-8. You need to tell MySQL that the data you’re sending is actually already UTF-8, so no conversion needs to take place when inserting that data into the table.
You can do that by using the SET NAMES query:
SET NAMES utf8;
Now MySQL will interpret all string data as utf8, so now MySQL will interpret all characters sent as UTF-8 and no overhead conversion is done internally. You can also be confident that any data originally stored as latin1 will be converted to utf8, which is the character set your application expects. So the decision is not about what you’re planning to store, it’s about what character set the application wants to feed to and retrieve from the connection.
The other character set configurations
There are more places in MySQL where you can tell MySQL something about character sets. However, they all serve as defaults. So, you don’t really need them but it’s good to know they’re there.
- Server level: Defines the default character set for newly created databases
- Database level: Defines the default character set for newly created tables
- Table level: Defines the default character set for columns added to that table
As long as you keep that in mind: they provide defaults, ultimately for defining column character sets, there’s really nothing more to know.
Figuring out which character set is defined for a column
Typical MySQL clients provide options to show what encoding is used for columns. Whenever you want to ask MySQL directly, you can use the following query:
The third column displayed is the collation that is used. The first part of the collation always indicates the character set, so
latin1_swedish_ci is a collation of the
latin1 character set, similarly for
utf8_general_ci the character set is
Converting UTF-8 data in a latin1 column to utf8
Whenever you alter a column specification from one charset to another, MySQL assumes that the data in that column was encoded correctly, and will therefore convert the original data to the new encoding. This makes sense, because you shouldn’t have been inserting UTF-8 data into a latin1 column in the first place 😉
With the following little recipe you can change single columns. First alter a column and change it’s definition to a binary intermediate. Then alter the column to the desired character set. This situation is applicable for Example 1 above.
TEXT (with respective
LARGE prefixes whenever necessary) and
VARCHAR fields. Remember to include the full column specification (including
DEFAULT and such); they won’t be remembered otherwise. Also, be careful not to truncate values, by assuming that
VARCHAR(10) is the same as
VARBINARY(10). When converting from multibyte values to binary you probably need more bytes to store all characters. In other words:
VARCHAR(10) means at most 10 characters long,
VARBINARY(10) means at most 10 bytes long. When unsure, use
Also, be aware of keys. Whenever you use keys on
(VAR)CHAR fields, you need to drop them first, when using a
BLOB intermediate format.
Converting doubly encoded data
In example 2 above, the data is in fact encoded twice. First, you need to convert the UTF-8 characters back to latin1. That seems as an unnatural solution, but it is the only right one. You have been telling MySQL you were speaking latin1, so let’s return the favor of converting the latin1 to UTF-8: Reverting that process is converting from UTF-8 to latin1, which can be accomplished by the following query.
Note that, with this query, all character columns in the table are converted to the best of MySQL’s knowledge of its contents.
Now, you’re back to “square 1”, as in Example 1, where you have utf8 encoded data inside latin1 columns, so follow the steps above to convert to utf8 with the intermediate binary format:
- Know how UTF-8 is different from codepage-based encodings
- Get comfortable inspecting and identifying encoded data either in PHP using
iconv, or in MySQL using the
- Use that knowledge to find out if columns contents are actually matching their specification. If they don’t, convert them accordingly.
- Always explicitly define the connection character set in your application.
- Be explicit in your CREATE and ALTER statements. If your development server has a different default character set than the staging one, you’re screwed otherwise.
- Choose one encoding type and stick to it. Choose wisely. Prefer UTF-8.
- Whenever a headache does come up, rethink the problem in binary form. You’ll get a better understanding of the issue and the way to solve it.
Happing unicoding 🙂
I am pretty sure that Ã looks familiar to you, by the way, so will probably Â, which is
C2in ISO-8859-1. These two characters are the number one indication that UTF-8 content is displayed as ISO-8859-1, as they are both control bytes for two-byte characters. ↩