Rankings with InnoDB Full-Text Search

Summary
Since MySQL 5.6 went GA—where among many other new features, we introduced full-text indexes for the InnoDB storage engine—questions have occasionally come up about InnoDB full-text search relevancy rankings when doing BOOLEAN MODE searches. Typically these questions revolved around a core issue: why do the results differ from that of MyISAM? In short, the InnoDB document search and relevancy rankings were modeled after the open source Sphinx full-text search engine, and not after the MyISAM storage engine (which sometimes simply produces incorrect results). The algorithms used for the InnoDB implementation are based on the well known BM25 and TF-IDF ranking algorithms.



Variables
Before we get to some examples, we should first talk about the variables involved that can alter the results (the resulting differences are also often the cause of slightly different results between InnoDB and MyISAM). I won’t go through each one again here, as they (innodb_ft_*) are described in the manual page. The ones that directly affect search results—innodb_ft_min_token_size, innodb_ft_max_token_size, innodb_ft_server_stopword_table, innodb_ft_user_stopword_table, innodb_ft_enable_stopword—are also discussed here. These variables affect the core concepts involved:

  • TOKENS — The index actually contains tokens, which you can simply think of as individual words. The common token separators or word boundaries (at least in latin based languages) are whitespace characters and punctuation characters: dashes, underscores, periods, etc. If a word is shorter than the min_token size, then it will not be indexed and thus not be used for the searches. The same is true if the word is longer than the max_token size.
  • STOPWORDS — A stopwords list contains words that will explicitly be skipped when creating the index records. So any word in this list will not be used for searches.



How the Calculation Is Done
The relevancy ranking system we use is a variation of “term frequency / inverse document frequency” (TF-IDF). In short, the more often the word appears in an individual document (“document” in this context is simply a full-text index record), and the less frequently the word appears in all of the documents, the higher the individual document is ranked.

The IDF (inverse document frequency) value of a search word is calculated as:
${IDF} = log10( ${total_records} / ${matching_records} )

Unlike the IDF value, the TF (term frequency) value is calculated for each matching record, and it is simply the number of times that the search word appears in that matching document. If the document contains the search word/token multiple times, then its IDF value is multiplied by the TF value.

The final relevancy ranking for a document is then calculated this way:
${rank} = ${TF} * ${IDF} * ${IDF}



Creating Some Sample Data
Let’s first create some sample data, so that we can walk through a simple example:



A Simple Example
So far, all of this may sound more complicated than it really is, so let’s look at a simple example (using MySQL 5.6 Enterprise):

We have 8 records in total, with 3 of them matching the search term “database”, and the first record (id 6) contains the search term 6 times. Let’s now focus on the first record (id 6), and see how its relevancy ranking was calculated at 1.0886961221694946. I’ll use easy to understand pseudo code to explain it, but you can see the source code and the actual calculations that InnoDB uses here.

First we calculate the IDF value for the search term “database”:
${IDF} = log10( 8 / 3 ) = 0.42596873216370745

Then we factor in the TF value (6) for the first record (id 6) in determining the final relevancy ranking value for it:
${rank} = (6 * 0.42596873216370745 * 0.42596873216370745) = 1.08869616468694

You may notice a slight difference between the two results: 1.08869616468694 versus 1.0886961221694946. That is the result of differences related to floating point precision and rounding decisions, and how they are done internally by InnoDB versus how they’re done in the mysql command-line client or in your chosen calculator (I happen to be using Calculator.app in OS X Mavericks in the calculations above).



Some Simple Multi-Word Search Examples
If you search for multiple words, then the final relevancy ranking value is simply a sum of the previously described calculation, done for each search word. Again, the calculation is:



Further Details
For further information, and a deeper dive into the functionality, please see Jimmy’s Dr. Dobb’s article.



Conclusion
Hopefully this has helped a bit in understanding the document relevancy rankings used for InnoDB full-text searches. If you feel that any specific results are incorrect, please let us know! When checking for correctness, the results should be validated against the BM25 and TF-IDF ranking algorithms, or simply compared against the open source Sphinx full-text search engine. If you have a specific odd case, feel free to mention the details in a comment below, or in a new bug report here.

2 thoughts on “Rankings with InnoDB Full-Text Search

Leave a Reply

Your email address will not be published. Required fields are marked *


9 × nine =

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">