MySQL latin1 → utf8 (Wordpress upgrade)

Spurred on by mass hacking, I’ve updated my old (version 2.0.3) Wordpress install to something a little newer.

I decided to err on the side of caution and upgrade a copy of the live DB first - and I’m glad I did. I saw lots of problems with accents and symbols getting munged in the upgrade - well, that’s what I thought.

In reality, the symbols were being screwed in the backup-and-restore process. Words such as “naïve”, £-signs and various typographic quotes were obviously getting messed up through character-encoding issues.

After a bit of sleuthing, I found the output from mysqldump wasn’t valid. The problem was caused by a combination of default connection settings in a .my.cnf, and the fact that the old Wordpress install was storing utf-8 characters inside a latin1 database.

My MySQL default settings are:

[client]
default-character-set=utf8

Because we’re pulling latin1 data over a utf-8 connection, MySQL starts doing character-set conversions, and screws up a bunch of text that it thinks is latin1, but in reality is actually already in utf-8.

The fix for the backup process was to override my default settings with latin1:

mysqldump --default-character-set=latin1 --opt -h db.example.com -u user -ppassword schema > db-backup-latin1-20080707.sql

That worked fine, and the next step was reloading it into the DB as utf-8. This required a little bit of string replacement using a command-line utility bundled with MySQL. If you’re going to do this yourself, watch what you type: I somehow typed in “lastin1” halfway through and lost an hour or so trying to figure out what went wrong. Anyhow, here’s the command-line:

sed \
  -e 's/CHARSET=latin1/CHARSET=utf8/g' \
  -e 's/SET NAMES latin1/SET NAMES utf8/g' \
  < db-backup-latin1-20080707.sql > db-backup-utf8-20080707.sql

You should now be able to blat / restore / overwrite your DB and ensure all tables are in the appropriate character set, ready for a smooth wordpress upgrade.

mysql --opt -h db.example.com -u user -ppassword schema < db-backup-utf8-20080707.sql