MySQL 8.0 Regular expressions and character properties

In MySQL 8.0 we have replaced the old regular expression library with the ICU regex library.  See Martin’s blog on the topic. The main goal is to get full Unicode support for regular expressions, but in addition we get a lot of neat features. The documentation of regular expressions in MySQL 8.0 is found here, and the documentation of ICU regular expressions is found here. This blog will highlight some of the Unicode related features that are new to MySQL 8.0.

A sample table

We will in this blog work with a table called demo with the following content:

Character properties

Character properties (or character classes as they are called in the MySQL documentation) have been greatly extended in 8.0.

The main properties you can check for are 1) what Unicode calls general properties and 2) script names.

General properties are L (Letter), N (Numerals), C (Controls), Z (Separators), P (punctuation), S (Symbols) and M (Marking). All these have subcategories, like e.g. Lu (Uppercase letters), Ll (Lowercase letters) etc. See “General category” in “Unicode File Format” for an exhaustive list.

Properties can be checked with the patterns \p{property}  or  [:property:]  while negative properties (not having the property) can be checked with the pattern \P{property} .

Example 1: List the letters in the demo table.

Example 2: List all the non-letters in the demo table.

Scripts may be specified by the ISO-15924 code or a property value alias. Use the code or the alias as a property or more specific script=code or script=alias .

Example 3: List all Devanagari characters:

Example 4: The Japanese use Kanji, Katakana and Hiragana as well as Romaji (Latin characters), let’s list those:

Example 5: That’s all fine, but we want to exclude Roman numerals from the list. That can be done with the  &&  operator like this (Numerals that are letters have the character category Nl):

Example 6: And finally, while talking about numerals, we can also search for numerals with a certain numeric value:

 

Character codepoint values.

Instead of characters, you may use Unicode codepoint values in patterns. Use \uxxxx for 16-bit values and \U00xxxxxx for 32-bit values:

Those codepoint values may be used in ranges too, so in order to find all characters that have a codepoint value outside BMP (have a code point value larger than 0xFFFF):

 

There are vast possibilities, so check out the MySQL and ICU/Unicode documentation on the topic and play with it.

Thanks for using MySQL!

About Bernt Marius Johnsen

Bernt has been working with software for more than three decades. As a developer he has worked with expert systems, compilers, distributed systems, banking software and several databases. He has been a Senior member of MySQL's Server QA team since 2010.

One thought on “MySQL 8.0 Regular expressions and character properties

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter * Time limit is exhausted. Please reload CAPTCHA.