Some Notes on Index Statistics in InnoDB

In MySQL 5.6 we introduced a huge improvement in the way that index and table statistics are gathered by InnoDB and subsequently used by the Optimizer during query optimization: Persistent Statistics. Some aspects of the way that Persistent Statistics work could be improved further though, and we’d really like your input on that.

How much to sample?

The statistics are gathered by picking some pages semi-randomly, analyzing them, and deriving some conclusions about the entire table and/or index from those analyzed pages. The number of pages sampled can be specified on a per-table basis with the STATS_SAMPLE_PAGES clause. For example: ALTER TABLE t STATS_SAMPLE_PAGES=500;
This way you can request that more pages be sampled for larger tables in order to make the estimates more accurate. The more pages that are sampled, the better the estimates will be and thus the better the chance that the Optimizer will be able to choose optimal query plans, with the only downside being that the index statistics gathering operation (the sampling) itself becomes slower.

Being able to specify the number of pages to sample on a per table basis is flexible, but what if one can specify the value as a percentage of the table size, rather than an absolute number? Would that make sense? Setting such a global value would not meet every need and use case, because the given percentage may not be sufficient to get a good estimate on a small table and the same percentage may make the sampling process too slow for a very large table. Setting the percentage on a per table basis may also have little advantage over the existent functionality as one already knows if the table is huge or tiny and can set the appropriate value as an absolute number of pages to sample. What do you think? We would love to get your input on this! Do you see a lot of value in being able to set the number of pages to sample as a percentage of the table size or have it depend on the table size in some other way? Cast your vote:

Make it possible to set the amount of pages to ANALYZE as a % of the table size?

View Results

Loading ... Loading ...

How often to sample?

Currently, if 10% of the table has changed (rows updated, deleted, or inserted) since the previous statistics gathering operation then InnoDB will automatically trigger a recalculation of the statistics (it will analyze some leaf pages as discussed in the previous paragraph, reading them from disk if they are not already cached).

In MySQL 5.5 and older, where the auto recalculation threshold was set to 6.25%, it was a problem that the statistics were updated too often which caused too many execution plan changes. In general, we’ve found that 10% is a good overall threshold for this and we haven’t seen a great need to make this dynamic. But do you agree? Do you want to be able to change that value globally or even on a per table basis? Would you increment it or decrement it if you could? Cast your vote:

Make it possible to change the 10% threshold that triggers an automatic ANALYZE?

View Results

Loading ... Loading ...

We look forward to your feedback! And as always, THANK YOU for using MySQL!

5 thoughts on “Some Notes on Index Statistics in InnoDB

  1. Using a percentage value to specify the number of pages to analyze has the big advantage that it automatically adapts to a growing or shrinking table, that’s why I advocate that.

    I’m not sure whether 10% change rate is too high, too low, or good compromise when triggering the new analysis: If a big data load operation is running, all but the last analysis will be waste of effort. I can imagine to combine that value with some “age” check: After 10% change, the table is a candidate for a new analysis, but that will be started only if the table was then unchanged for time X. For that time, I feel 30 to 60 minutes are a reasonable value. If massive changes continue in that time, postpone the start.

    Thanks for sharing your thoughts!

  2. I would like the system to automatically trigger the update just like Sql Server does. By the time the stats change to 10%, it could have caused some bad plans. Why not the stats gathered as the data is inserted updated and deleted in an async process?

  3. Both are reasonable, depending on table size, row count volatility.

    I do not remember seeing any consideration for ‘HOW OFTEN’ statistics analysis will be considered OR building a ‘STATISTICS ANALYSIS’ queue of table names to be analyzed. It is possible most installations could live with Task Scheduler/Cron type functionality where DBA can name the DayoftheWeek, Hr, Mn to start the process and limit the duration your users can tolerate this ‘reasonably heavy’ workload before suspending active building statistics OR searching for candidate tables.

    Of course, this would also lead to the need for ability to view/manage the StatisticsAnalysisQueue and then the question becomes how many StatisticsAnalysisQueueThreads can be concurrently active, similar to INNODB_PURGE_THREAD concept (which is NOT dynamic – unfortunately).

    With prevalence of multi cpu servers today, there could be many active threads for many installations between 10 PM and 7 AM, the day timer users certainly would not care.

    Thanks for asking the questions.

Leave a Reply