The project which I’m currently working on is a Java project, powered by Spring, built by Maven. I use Ubuntu/Linux to build and run the project locally, but everyone else on the team uses Windows XP. We recently got back a set of translations that included pages in Chinese, Japanese, Korean, and Russian – all languages with characters not included in ISO 8859-1.
The first issue we encountered was that Java will not read properties files (the standard key=value .properties format) in any character encoding except ISO 8859-1. So we converted all the .properties format files in XML properties files with the “<?xml version="1.0" encoding="UTF-8" ?>
” declaration for the encoding. And all was well in the world (from my perspective at least).
I built and ran the project, and checked out some the (what I call at least) exotic language pages. I saw Cyrillic characters in Russian, so I was happy. But when someone built and ran the project on a Windows computer, they saw boxes and question marks.
The problem is in how Java handles files based on what the environment specifies. On my Ubuntu computer, my environment is for UTF-8, but on Windows, it’s set to cp1252 (an MS proprietary extension of ISO 8859-1). So when Maven copies files during the build process, Java re-encodes the files to cp1252, which results in lots of question marks and boxes and other such problems.
The solution is to add MAVEN_OPTS="-Dfile.encoding=UTF-8"
to the environment before you run mvn. That overrides Java detecting Windows’ cp1252 encoding, and makes everything work.
Now can someone tell me how Windows doesn’t do UTF-8 natively, especially in a world with more speakers of these exotic languages than those with languages that can be expressed in cp1252?
Windows, Java, and an Internationalization Mess by Craig Andrews is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.