InnoDB Full-Text: MeCab Parser

In addition to our general CJK support, as detailed in this blog post, we’ve also added a MeCab parser. MeCab is a Japanese morphological analyzer, and we now have a full-text plugin parser based on it!

How Would I Use It?

  1. Set the mecab_rc_file option — mecab_rc_file is a read-only system variable pertaining to the MeCab parser. The mecabrc file that it points to is a configuration file required by MeCab, and it should at least have one entry for dicdir=/path/to/ipadic , which tells MeCab where to load the dictionary from.Once MySQL is installed, we have a default bundled mecabrc file in /path/to/mysql/install/lib/mecab/etc/mecabrc , and we have three dictionaries within the
    /path/to/mysql/install/lib/mecab/dic directory: ipadic_euc-jp, ipadic_sjis, and ipadic_utf-8. We’ll need to modify the mecabrc file to specify which one of these three dictionaries we want to use.

    Note: If you have your own dictionary, you can instead use that as well. There are also many additional options that can be specified within the mecabrc file. For more information about that configuration file, please see the documentation here.

    For our testing purposes here, let’s load ipadic_utf-8 using these steps (in my case, MySQL 5.7.7 is installed in /usr/local/mysql):

    1. Add an entry in the mecabrc file like this:
    2. Add an entry in the [mysqld] section of /etc/my.cnf like this:
  2. Set innodb_ft_min_token_size — The recommended value is 1 or 2 with the MeCab parser (the default value is 3). We will use 1 for the following examples.
  3. Install the MeCab Plugin:
  4. Create a Full-Text Index with MeCab (NOTE: With 5.7.6, you will need to use utf8 instead of utf8mb4. The MeCab parser plugin now supports the eucjpms, cp932, and utf8mb4 character sets in 5.7.7 and later—Bug#20534096):

More on MeCab Tokenization

Let’s look at an example that demonstrates how the word tokenization is done:

More on Full-Text Searches with Mecab
Text Searches

  • In NATURAL LANGUAGE MODE, the text searched for is converted to a union of search tokens. For example, '日本の首都' is converted to '日本 の 首都'. Here’s a working example:
  • In BOOLEAN MODE searches, the text searched for is converted to a phrase search. For example, '日本の首都' is converted to '"日本 の 首都"'. Here’s a working example:

Wildcard Searches

  • We don’t tokenize the text of a wildcard search. For example, for '日本の首都*' we will search the prefix of '日本の首都', and may not produce any matches. Here’s two working examples:

Phrase Searches

  • A phrase search is tokenized by mecab. For example, "日本の首都" is converted to "日本 の 首都". Here’s a working example:

MeCab Limitations

It only supports three specific character sets: eucjpms (ujis), cp932 (sjis), and utf8 (utf8mb4). If there is a mismatch between what MeCab is using and what the InnoDB table is using—for example the MeCab character set is ujis, but the the fulltext index is utf8/utf8mb4—then you will get a character set mismatch error when attempting the search.

If you would like to learn more general details about InnoDB full-text search, please see the InnoDB Full-Text Index section of the user manual and Jimmy’s excellent Dr. Dobb’s article. For more details about the MeCab parser specifically, please see the MeCab parser section in the user manual.

We hope that you find this new feature useful! We’re very happy to have improved CJK support throughout MySQL 5.7, and this is a big part of that. If you have any questions please feel free to post them here on the blog post or in a support ticket. If you feel that you have encountered any related bugs, please let us know via a comment here, a bug report, or a support ticket.

As always, THANK YOU for using MySQL!

4 thoughts on “InnoDB Full-Text: MeCab Parser

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter * Time limit is exhausted. Please reload CAPTCHA.