UTF-8 and MySQL

MySQL seems to operate in latin1 encoding by default, not in UTF-8, even though in many cases (such as most Linux distributions), the system uses UTF-8 for everything else by default. Today, this situation caused a confusing problem for me.

From 30,000 feet, the application is a Java web application that reads data from a MySQL database. The database is populated by a bunch of sql scripts (run via MySQL batch mode). Simple stuff. And all was good until we started using UTF-8 symbols, like ¢, ®, and ™ in the text in the scripts. These symbols came out all weird: instead of “¢” the page shows “¢”.

The first thing I checked was the encoding the .sql script files themselves were using. Turns out, they were stored as cp1251 (yay for Windows users). I used gedit to convert them to UTF-8 (there were only 4… if there were more, it would have been worth using iconv). Then I had to manually fix all the encoding errors in the files, which was time consuming.

Next, I checked the database itself. The tables were all stored in the latin1 character set with latin1_swedish_ci collation… so a quick alter table, and I was using the utf8 character set with utf8_general_ci collation.

I re-ran the scripts like this: mysql -hhost -uuser -ppassword < script.sql

And still… the site displayed the same, messed up characters.

I checked the documentation for MySQL Connector/J (the JDBC driver), and it says that the connection is UTF-8, and all should be fine. According to the browser, the server was returning UTF-8 data. Plus, I was seeing the actual UTF-8 symbols (¢ for example), just with extra garbage (Â) in front of them.

After much searching, I ran across the “default-character-set” parameter to the mysql command line client. It turns out that the command line client doesn’t detect the source file’s encoding using the byte order marker (BOM) or anything else, it just uses a default, which is usually latin1. So I changed my command, and tried again with: mysql -hhost -uuser -ppassword –default-character-set=utf8 < script.sql

And finally, the page showed everything correctly.

CC BY-SA 4.0 UTF-8 and MySQL by Craig Andrews is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

2 thoughts on “UTF-8 and MySQL

  1. > I checked the documentation for MySQL Connector/J (the JDBC driver), and it says that the connection is UTF-8

    I just surfed in and noticed your DB problem here. Actually, your MySQL Connector encoding was the original cause. If you had ensured the text in the scripts was UTF-8 and forced the connection to match the single-byte (latin1) encoding of the DB table, then you would have had a binary-safe storage layer.

    What threw you off track here was changing the table encoding to UTF-8. As soon as you did that, it became necessary to guarantee everything between the script and the table was UTF-8 as well, to prevent implicit latin1-to-UTF8 conversion. (That conversion is bad for binary data)

    On the bright side, now all of your string columns will use chars instead of bytes 😛

  2. Thanks, this is really helpful!

    One note – “default-character-set” needs to be preceded by two dashes though in your post this seems to have been converted to a hyphen so I can’t be copied and pasted.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.