MySQL 8.0 Collations: The devil is in the details.

One of the challenges of language specific collations, is making sure they are accurate in the edge-cases of sometimes lesser-used language features. Since I am Norwegian, let me use the Danish collation (which is identical to Norwegian collation) as an example:

Most Scandinavian people know that in Danish (and Norwegian), we have three extra letters: ‘Æ’, ‘Ø’ and ‘Å’ and they follow after ‘Z’ in that order. Some also know that ‘Ö’ is treated like ‘Ø’ and ‘Ü’ is treated like ‘Y’. In addition, we have an obsolete alternative to ‘Å’ which is ‘AA’ (well, not that obsolete, since the official name of Århus in Denmark has been Aarhus since 2010).

And the utf8_danish_ci collation in MySQL adheres to the above mentioned rules. And that is mostly OK. But there’s more to it.

ANGSTROM SIGN

There is an distance unit called Angstrom (0.1 nm) which has the sign ‘Å’. This letter has Unicode value U+212B, as opposed to the letter ‘å’ which is U+00E5. These two letters are supposed to be treated as the same letter in ai/ci collations:

So we get, for various collations, when comparing ‘a’, ‘å’, and ‘Å’ (Angstrom sign, not the Scandinavian uppercase letter ‘Å’):

As shown above, the old utf8 collations are correct only for the common letters, while the new utf8mb4 collations for Unicode 9.0.0 are correct for all the letters in the example.

LATIN CAPITAL LIGATURE OE

And there’s even more: The letters ‘ ű’, ‘ð’, ‘đ’, ‘œ’ and ‘þ’ (and their upper case counterparts) also need special treatment in Norwegian and Danish.

Let’s have a look at ‘Œ’ (U+0152). In Danish (and Norwegian) ‘Œ’ is supposed to collate as the Scandinavian letter ‘Ø’, but with utf8_danish_ci we get the order:

while with utf8mb4_da_0900_ai_ci we get the correct order:

For most applications these are subtleties which does not matter much, but for a few,  these details may be important.

 

Conclusion

As you can see, we have sought to make the new collations in MySQL 8.0 not just usable for the 99% of cases, but instead make them complete for all use cases.

Try out some of your favorite language-specific collations and see if they meet your expectations. And please report any bugs you might find. To find the complete list of new collations in MySQL 8.0, do the command:

About Bernt Marius Johnsen

Bernt has been working with software for more than three decades. As a developer he has worked with expert systems, compilers, distributed systems, banking software and several databases. He has been a Senior member of MySQL's Server QA team since 2010.

3 thoughts on “MySQL 8.0 Collations: The devil is in the details.

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter * Time limit is exhausted. Please reload CAPTCHA.