Optimizer Cost Model Improvements in MySQL 5.7.5 DMR

In a previous blog post we presented some of the issues with the current optimizer cost model and listed several ideas for improvements. The new 5.7.5 DMR contains the result of our initial work on improving the optimizer’s cost model:

  • Cost Model for WHERE Conditions. In previous versions of MySQL, the estimated number of rows from a table that will be joined with the next table only takes into account the conditions used by the access method. This often led to record estimates that were far too high and thus to a very wrong cost estimate for the join. With wrong cost estimates, the join optimizer might not find and choose the best join order. To solve this issue, a cost model that includes the entire query condition on the table when calculating the number of rows produced from a table has been implemented. This model estimates the filtering effect that the table’s conditions have on the number of records read from the table. For more details on how condition filtering works, see two earlier blog posts on this topic: part1, part2. With this feature, the join optimizer should in many cases be able to find a more optimal join order and produce a better query plan.
  • More Accurate Index Statistics. When joining tables, index statistics are used for selecting which index to use. In MySQL the index statistics contain an estimate for the number of records that exist per key value within the index. Due to the previous use of integer values for this records per key estimate, the index statistics had a precision that was too low for indexes with high selectivity or cardinality. There had previously been no means to weigh the relative value of an index that has an average of 1.9 records per key value from another one that has 1.1 records per key value. This problem is now fixed by switching to the use of floating point values for the index statistics within the MySQL server. Since the index statistics are maintained by the storage engines, the more accurate index statistics must be supplied by the storage engine. In the 5.7.5 DMR, InnoDB has been changed to provide the more accurate index statistics using the new floating point format.
  • Configurable Cost Constants. Previously, the basic cost values that the optimizer used when calculating cost estimates for queries were hard coded within the source code. Examples of such cost constants were the cost of reading a disk page, the cost of evaluating a query condition on a record, or the cost of comparing key values. The hard coded cost constants have now been replaced by user configurable costs. Two new tables—server_cost and engine_cost—have been added to the mysql system database. The cost values can now be changed by simply updating the corresponding rows in these two new tables.

We have also done some initial refactoring of the cost model code. The main focus here has been to replace the use of hard coded cost constants with function calls into the new cost model API. As part of this work we also introduced a new cost_estimate object that’s used for keeping track of cost estimates within the new cost model code.

If you are attending MySQL Central @ OpenWorld next week, I will present the MySQL Cost Model talk, where I will go into greater detail about all of the new cost model related features that are implemented in the new 5.7.5 DMR. I look forward to seeing you there!

We value your feedback! Please let us know if you encounter any bugs, or if you have more general feedback.

UPDATE: My OpenWorld presentation is now available here!

4 thoughts on “Optimizer Cost Model Improvements in MySQL 5.7.5 DMR

  1. How does this estimate selectivity for condition filters when there isn’t an index on the columns involved? Is sampling done? Are there hardwired assumptions? The previous blog post mentioned here states that range optimizer analysis is used without explaining what that does.

    1. If there are no index on the column involved, we use fixed selectivity estimates for different types of conditions. For instance, if the condition is equality (=) then we assume that 1 out of 200 records will qualify. If the condition is >,>=,<,<=, we assume that 1 out of 3 records qualify. For a more complete list, see the description in WL#6635. Note that it is likely that some of these constants will be changed when we get more experience with this feature.

      If we later adds statistics for non-indexed columns it will be trivial to replace these constants with using this statistics.

    1. Both of these optimizations can be really good when the data needs to be read from a disk. But if the data is in a memory buffer, using these adds a small CPU overhead. For MRR we did a re-write of the cost model in 5.6 so the choice of whether to use MRR or not is cost-based and should work for some cases. The reason it is not always selected, is that the cost model is very conservative to avoid using MMR in cases where it is possible that the entire table already is in memory.

      For BKA, there is currently no cost model. As you say, it has to be turned on manually and then it will affect all queries.

      I hope we will be able to implement more statistics in the storage engine about whether the data for a given table is in memory or needs to be read from disk. When we get this, it will be much easier to make a good model for automagically choose whether MRR and BKA should be used.

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter * Time limit is exhausted. Please reload CAPTCHA.