InnoDB Full-Text : N-gram Parser

The default InnoDB full-text parser is ideal for latin based languages where whitespace is the token or word separator, but for languages like Chinese, Japanese, and Korean (CJK)—where there is no fixed separators for individual words, and each word can be compromised of multiple characters—we need a different way to handle the word tokens. I’m now very happy to say that in MySQL 5.7.6 we’ve made use of the new pluggable full-text parser support in order to provide you with an n-gram parser that can be used with CJK!

What Exactly Is N-gram?

In full-text search, an n-gram is a contiguous sequence of n characters from a given sequence of text. For example, with n-gram we tokenize the text “abcd” as follows:

How Can You Use the N-gram Parser with InnoDB?

The new n-gram parser is loaded and enabled by default. To use it, you would simply specify the WITH PARSER ngram clause in your relevant DDL statements. For example, any of the following statements are valid in MySQL 5.7.6 and later:

We also have a new global server variable called ngram_token_size (a token being roughly equivalent to a word, made up of n characters). The default value is 2 (bigram), and the valid range is from 1 to 10. The next natural question then becomes: what token size value should I choose? 2, or bigram, is recommended for general use with CJK, but you can choose any valid value based on this simple rule: set the token size to the largest token you expect to search for. If we only ever want to search for single characters, then we should set ngram_token_size to 1. The smaller the ngram_token_size the smaller the index, and the faster the full-text searches using that index will generally be. The downside, however, is that you’re then limiting the token size and thus what you can search for. For example, ‘Happy Birthday’ in English translates to ‘生日高興’ in traditional Chinese, where ‘Happy’ == ‘高興’ and ‘Birthday’ == ‘生日’. As you can see, each respective word or token is comprised of two individual characters, so in order to search for each token we need to have ngram_token_size set to 2 (the default) or higher.

More on N-gram Tokenization

The n-gram parser differs from the default full-text parser in the following ways:

  1. Token sizes — innodb_ft_min_token_size and innodb_ft_max_token_size do not apply. You instead specify token handling with ngram_token_size.
  2. Stopword handling — we handle stopwords a little differently. Usually if a tokenized word itself (an exact match) is in the stopword table, then it will not be added to the full-text index. With the n-gram parser, however, we check whether the tokenized word contains any word in the stop word table and if so then we do not index that token. The reason for this difference in behavior is that we have so many frequent and meaningless characters, words, and  punctuation with CJK. If we use the contains method to match stopwords, then we can eliminate more useless tokens.
  3. White space — white space is always a hard-coded stopword. As an example, this means that ‘my sql’ is always tokenized to ‘my’, ‘y ‘, ‘ s’, ‘sq’, and ‘ql’, and ‘y ‘ and ‘ s’ will not be indexed.

We can query INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE and INFORMATION_SCHEMA.INNODB_FT_TABLE_TABLE to see exactly what tokens have been indexed within a specific full-text index. This is a very useful tool for debugging purposes. For example, if a word doesn’t show up in a full-text search results as expected, then the word may simply not be indexed for one reason or another (stop words, token sizes, etc.) and you can query these tables to be find out for sure. Here’s a simple example:

More on Search Result Processing with N-gram

Text Searches

  • In NATURAL LANGUAGE MODE, the text searched for is converted to a union of n-gram values. For example, 'sql' is converted to 'sq ql' (with a default token size of 2 or bigram). Here’s an example:
  • In BOOLEAN MODE searches, the text searched for is converted to an n-gram phrase search. For example, 'sql' is converted to '"sq ql"':

Wildcard Searches

  • If the prefix is shorter than ngram_token_size, then the query returns all rows that contain n-gram tokens starting with that prefix. For example:
  • If the prefix length is equal to or greater than ngram_token_size, then the wildcard search is converted to a phrase search and the wildcard is ignored. For example, 'sq*' is converted to '"sq"', and 'sql*' is equivalent to '"sq ql"':

Phrase Searches

  • A phrase search is converted to a phrase search of n-gram tokens. For example, "sql" is converted to "sq ql". For some examples:

If you would like to learn more general details about InnoDB full-text search, please see the InnoDB Full-Text Index section of the user manual and Jimmy’s excellent Dr. Dobb’s article. For more details about the N-gram parser specifically, please see the N-gram parser section in the user manual.

We hope that you find this new feature useful! We’re very happy to have improved CJK support throughout MySQL 5.7, and this is a big part of that. If you have any questions please feel free to post them here on the blog post or in a support ticket. If you feel that you have encountered any related bugs, please let us know via a comment here, a bug report, or a support ticket.

As always, THANK YOU for using MySQL!

4 thoughts on “InnoDB Full-Text : N-gram Parser

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter * Time limit is exhausted. Please reload CAPTCHA.