Archive

Archive for January 22nd, 2009

UTF-8 and MySQL

January 22nd, 2009 2 comments

MySQL seems to operate in latin1 encoding by default, not in UTF-8, even though in many cases (such as most Linux distributions), the system uses UTF-8 for everything else by default. Today, this situation caused a confusing problem for me.

From 30,000 feet, the application is a Java web application that reads data from a MySQL database. The database is populated by a bunch of sql scripts (run via MySQL batch mode). Simple stuff. And all was good until we started using UTF-8 symbols, like ¢, ®, and ™ in the text in the scripts. These symbols came out all weird: instead of “¢” the page shows “¢”.

The first thing I checked was the encoding the .sql script files themselves were using. Turns out, they were stored as cp1251 (yay for Windows users). I used gedit to convert them to UTF-8 (there were only 4… if there were more, it would have been worth using iconv). Then I had to manually fix all the encoding errors in the files, which was time consuming.

Next, I checked the database itself. The tables were all stored in the latin1 character set with latin1_swedish_ci collation… so a quick alter table, and I was using the utf8 character set with utf8_general_ci collation.

I re-ran the scripts like this: mysql -hhost -uuser -ppassword < script.sql

And still… the site displayed the same, messed up characters.

I checked the documentation for MySQL Connector/J (the JDBC driver), and it says that the connection is UTF-8, and all should be fine. According to the browser, the server was returning UTF-8 data. Plus, I was seeing the actual UTF-8 symbols (¢ for example), just with extra garbage (Â) in front of them.

After much searching, I ran across the “default-character-set” parameter to the mysql command line client. It turns out that the command line client doesn’t detect the source file’s encoding using the byte order marker (BOM) or anything else, it just uses a default, which is usually latin1. So I changed my command, and tried again with: mysql -hhost -uuser -ppassword –default-character-set=utf8 < script.sql

And finally, the page showed everything correctly.

Categories: Uncategorized Tags: