Improvements to the MySQL `IGNORE` Implementation

In 5.7.5, as a part of the larger effort to improve error handling, we re-implemented the IGNORE clause (WL#6614). The IGNORE clause is a MySQL extension to the SQL standard. It affects the errors which occur for each row. The new implementation aims to make the behavior of the IGNORE clause more consistent.

Statements which support the IGNORE clause are:

  • INSERT [ IGNORE ]
  • UPDATE [ IGNORE ]
  • DELETE [ IGNORE ]
  • LOAD DATA [ IGNORE ]
  • LOAD XML [ IGNORE ]
  • CREATE TABLE… [ IGNORE ] SELECT

When the INSERT statement is used to insert a number of rows into a table, an exception during processing would normally abort the statement and return an error message. With the IGNORE keyword, rows that cause certain exceptions are ignored, but the remaining rows are inserted and the execution of the statement is regarded as successful. The same principle applies to UPDATE, DELETE, LOAD, and CREATE…SELECT statements (the IGNORE keyword only affects the DML part of a CREATE…SELECT statement).

The IGNORE clause has two basic functions which we’ll now describe.

Independent of ‘STRICT’ mode

The IGNORE clause downgrades errors to warnings and continues the processing of a statement. The IGNORE keyword affects the following error codes:

  • ER_SUBQUERY_NO_1_ROW : Subquery returned more than 1 row when one is expected
  • ER_ROW_IS_REFERENCED_2 : Foreign key constraint violation in parent table
  • ER_NO_REFERENCED_ROW_2 : Foreign key constraint violation in child table
  • ER_BAD_NULL_ERROR : NOT NULL constraint violation
  • ER_DUP_ENTRY : Unique key constraint violation
  • ER_DUP_ENTRY_WITH_KEY_NAME : Unique key constraint violation
  • ER_DUP_KEY : Unique key constraint violation
  • ER_VIEW_CHECK_FAILED : CHECK constraint for view fails
  • ER_NO_PARTITION_FOR_GIVEN_VALUE : No partition in the table for the given value
  • ER_NO_PARTITION_FOR_GIVEN_VALUE_SILENT : No partition in the table for the existing value
  • ER_ROW_DOES_NOT_MATCH_GIVEN_PARTITION_SET : Row not found matching the given

The row which caused the error is then skipped and the operation continues with the next row. One warning is output for each ignored exception. If two rows violate a UNIQUE constraint, we will get 2 warnings. We do not allow constraints to be broken inside the database.

A Practical Use Case for IGNORE

We have a table ‘t’ created with the following statement:
CREATE TABLE t(a INT) ENGINE = InnoDB;

Let’s look at an example of 3 rows with 1 duplicate among them.

We can not add a UNIQUE constraint directly here, but there is a workaround using the IGNORE clause (when adding a UNIQUE key, we will lose the data found in duplicate rows).

In the case of 3 rows there is little reason to use IGNORE, but if the number of rows is very large IGNORE can be of much help. This simply serves as a simple example.

When ‘STRICT’ Mode is ON

STRICT mode controls how MySQL handles invalid or missing values in data changing statements such as INSERT or UPDATE. With STRICT mode ON, MySQL produces an error for invalid values and aborts the statement. Using the IGNORE keyword in a statement cancels the effects of STRICT mode for that statement. The following error codes are not upgraded from warnings to errors by STRICT mode when the IGNORE clause is used:

  • ER_TRUNCATED_WRONG_VALUE : Truncated incorrect value
  • ER_WRONG_VALUE_FOR_TYPE : Incorrect value for function
  • ER_WARN_DATA_OUT_OF_RANGE : Out of range value
  • ER_DIVISION_BY_ZERO : Division by 0″
  • ER_TRUNCATED_WRONG_VALUE_FOR_FIELD : Incorrect value for column at row
  • WARN_DATA_TRUNCATED : Data truncated for column
  • ER_DATA_TOO_LONG : Data too long for column
  • ER_BAD_NULL_ERROR : Column cannot be null
  • ER_NO_DEFAULT_FOR_FIELD : Field doesn’t have a default value”
  • ER_NO_DEFAULT_FOR_VIEW_FIELD : Field of view underlying table doesn’t have a default value
  • ER_CUT_VALUE_GROUP_CONCAT : Row was cut by GROUP_CONCAT()
  • ER_DATETIME_FUNCTION_OVERFLOW : Datetime function field overflow
  • ER_WARN_TOO_FEW_RECORDS : Row doesn’t contain data for all columns
  • ER_WARN_NULL_TO_NOTNULL : Column set to default value: NULL supplied to NOT NULL column”
  • ER_INVALID_ARGUMENT_FOR_LOGARITHM : Invalid argument for logarithm
  • ER_WRONG_ARGUMENTS : Incorrect arguments

In these cases MySQL inserts the adjusted values for invalid or missing values and produces warnings.

An Example Use Case

Bugs Fixed by This Worklog

  1. Bug#6196: INSERT IGNORE should return warnings
    For the error codes mentioned in case (a) above, there were no warnings before this worklog when a constraint would fail. For example, INSERT IGNORE will silently ignore the duplicate values.
  2. Bug#43895: Multi-DELETE IGNORE does not report warnings
    Same case as mentioned in #1 above.
  3. Bug#68726: Update trigger invoked when update ignore means that no update is performed
    When IGNORE is used with an UPDATE statement, for all supported error codes there will be no error and the statement will be marked as having successfully executed. For the rows which did not get updated, however, the AFTER UPDATE trigger will not be executed.

InnoDB: Supporting Page Sizes of 32k and 64k

In the new InnoDB lab release we support page sizes of 32k and 64k. This gives users even more choices on the page size, allowing you to further customize InnoDB for your particular workload.

There are some things worthy of note related to this new feature:

  1. The extent size changes when the innodb_page_size is set 32k or 64k.
    The extent size is 2M for 32k page sizes, and 4M for 64k page sizes (the extent size is 1M for 4k, 8k, and 16k page sizes). If we do not enlarge the extent size then we will have too many extent headers on the allocation bitmap page, and the bitmap page will overflow.
  2. The innodb_log_buffer_size default value changes when the innodb_page_size is set to 32k or 64k. We change the innodb_log_buffer_size default value from 8M to 16M in this case. If we don’t increase the innodb_log_buffer_size then we may suffer from a lack of free log pages because we have fewer available log buffer pages.
  3. We do not support compression when innodb_page_size is set to 32k or 64k.
    When innodb_page_size is set to 32k or 64k, you will get the following warning if you include a ROW_FORMAT=COMPRESSED or a KEY_BLOCK_SIZE clause in your table DDL:
    Warning 1478 InnoDB: Cannot create a COMPRESSED table when innodb_page_size > 16k
    If innodb_strict_mode is ON, you will instead get the following error:
    Error 1031 Table storage engine for 't1' doesn't have this option
  4. The maximum record size is 16k when innodb_page_size is set to 64k.
    In theory we can store a record whose size is about 50% of the page size. In ROW_FORMAT=REDUNDANT, the record header contains 14-bit pointers to the end of each field, limiting the maximum record size to 16k. For simplicity, we enforce the 16k record size limit on all row formats when innodb_page_size is 64k.
  5. The innodb_page_size setting is global for all tablespaces within a single MySQL server instance.

We look forward to your feedback on this new feature! Please let us if you encounter any issues, or if you have any more general feedback.

Thank you for using MySQL!

InnoDB General Tablespaces – Preview

The new InnoDB Labs release contains the ability to create and use independent multi-table general tablespaces.

This feature will provide a way to group tables together into tablespaces at a location and filename of your choosing.  Tables using row formats of Redundant, Compact, and Dynamic can be combined together into the same general tablespace. Compressed tables with the same key_block_size can also be combined together.

The SQL syntax for creating an empty general tablespaces is:
CREATE TABLESPACE `tblspace_name` ADD DATAFILE 'tablespace.ibd' [FILE_BLOCK_SIZE=n];

The filename can contain an absolute path or a path relative to the data directory where the InnoDB system tablespace (ibdata1) is found. The file in the example above would be created in the same directory as the InnoDB system tablespace. The file name extension ‘.ibd’ is required on all datafile names.

The FILE_BLOCK_SIZE parameter is only necessary if you want to create a general tablespace that holds compressed tables. The value can be specified in bytes or using the “k” shorthand notation, for example: 4096 or 4k.

A table can be added to a general table space with:
CREATE TABLE tbl_name TABLESPACE=`tblspace_name`;

Or it can be moved to a general tablespace  with:
ALTER TABLE tbl_name TABLESPACE=`tblspace_name`;

And finally, an empty tablespace can be deleted with:
DROP TABLESPACE `tblspace_name`;

This labs release also demonstrates the ability to move tables between the three types of InnoDB tablespaces — the system tablespace, file-per-table tablespaces, and general tablespaces — using the ALTER TABLE tbl_name TABLESPACE='tblspace_name'; statement.  If the specified tblspace_name value is “innodb_system” then the table will be re-created within the system tablespace.  If the specified tblspace_name value is “innodb_file_per_table”, then the table will be re-created as a single-table tablespace.  Thus you have full flexibility to move tables around within your database.

Note that a general tablespace is not associated with any particular database/schema like a table is.  So it is possible to create tables in multiple database schemas using the same general tablespace, just as it has always been with the system tablespace. And while DROP DATABASE might drop all of the individual tables found within a general tablespace, it will not drop the tablespace itself.

We look forward to getting your input on this new feature! Please let us know if you encounter any issues or have any general feedback. Thank you for using MySQL!

Removing Scalability Bottlenecks in the Metadata Locking and THR_LOCK Subsystems in MySQL 5.7

The MySQL Server 5.7.5 Development Milestone Release, which was published recently, contains some significant changes to the metadata locking (MDL) subsystem and to the usage of the THR_LOCK manager for InnoDB tables. This post provides more information about these changes, which resulted in nice improvements in both scalability and performance.

Sometime during the development cycle of MySQL 5.6 we realized that locks used in the metadata locking subsystem (MDL) implementation can become a scalability bottleneck when many short statements were executed in @autocommit=1 mode against InnoDB tables.

Bug #66473 is a good example (8-table Sysbench POINT_SELECT/InnoDB test). In this particular case the bottleneck was the mutex protecting the MDL_map hash which contained the MDL_lock objects describing the individual metadata locks. To some extent, we were able to solve the problem for this specific case by partitioning the MDL_map hash and the mutex protecting it in MySQL 5.6.8.

But this solution was not without its own problems — see bug #68487 (fixed in 5.6.15). Moreover, it didn’t solve the problem for the case when most of the statements accessed the same table (e.g. 1-table Sysbench POINT_SELECT/InnoDB test). This problem became even more prominent in 5.7 after some improvements were made in the InnoDB storage engine. After these changes we started to see the rwlock protecting the individual metadata lock (MDL_lock::m_rwlock) as the hottest lock in the server for such workloads.

So at some point during the Summer of 2013 we realized that a more drastic solution was needed.

We started by experimenting with a patch which disabled most of the metadata locking subsystem and thus removed the bottlenecks associated with it (and broke correctness in a big way). We immediately saw that the bottleneck moved to the THR_LOCK table-level lock manager, and the THR_LOCK::mutex in particular. Another patch was drafted that disabled THR_LOCK locking for InnoDB tables (which also broke correctness). With both these patches applied we saw the potential for significant scalability/performance improvements.

After these experiments it became obvious that we needed to address issues both in the metadata locking (MDL) subsystem and the THR_LOCK lock manager in order to improve scalability and performance:

  1. For MDL we decided to use the fact that workloads where this subsystem can become a bottleneck mostly involve DML statements. DDL statements are rare in them. DDL-specific metadata locks are likely to conflict with other metadata locks, while DML-specific metadata locks don’t conflict with each other. So it is possible to improve scalability by making acquisition of DML-specific locks cheap (e.g. it can be as simple as checking a flag and increasing a counter) and making acquisition of DDL-specific locks a bit more expensive. WL#7304 “Improve MDL performance and scalability by implementing ‘fast-path’ for DML locks” implements exactly this idea. Essentially, it decreases the size of the critical section associated with MDL_lock::m_rwlock for DML-specific metadata locks at the expense of DDL-specific locks, and thus makes MDL_Lock::m_rwlock less hot.Moreover, we found that it was possible to develop this idea further and implement acquisition of DML-specific metadata locks as a lock-free atomic operation by using atomics support and the lock-free hash implementation already present in MySQL Server (and used in the Performance Schema). This resulted in WL#7305 “Improve MDL scalability by using lock-free hash” and WL#7306 “Improve MDL performance and scalability by implementing lock-free lock acquisition for DML”.After these changes, acquisition of a DML-specific lock consists of a look-up in a lock-free hash and then an atomic compare-and-swap operation on a single 64-bit value. This value contains counts of the various types of DML-specific locks acquired and flags indicating the presence of pending/active DDL-specific locks. If there are pending or active DDL-specific locks, we have to resort to the old approach using MDL_lock::m_rwlock lock. Similarly, the release of a DML-specific lock is in the best case a single atomic compare-and-swap operation which decrements one of these counters (for all of the details, please read the above WLs).

    As a result of this change, MDL_lock::m_rwlock became insignificant in our benchmarks — we no longer see it in the list of hot locks, and MDL_map::m_lock/ MDL_map_partition::m_lock was removed completely as we now use a single lock-free hash.

    One of the interesting consequences of WL#7305 was that we had to change the handling of unused MDL_lock objects (these are objects representing metadata locks which are not acquired by anyone). In 5.6 such objects are simply moved to an LRU list of unused MDL_lock objects. If the list gets bigger than --metadata_locks_cache_size elements, the least recently used object is deleted. With the lock-free algorithm, deletion becomes more expensive, so some kind of caching policy was definitely needed, but implementing an LRU list gets complicated. So instead of deleting the least recently used element we delete a random unused element once the unused/used ratio reaches a certain threshold. This meant that the --metadata_locks_cache_size option no longer made sense so it was removed.

    Since MDL_map is now implemented using a single lock-free hash instead of several hash partitions, the --metadata_locks_cache_size option was removed as well.

  2. The situation with the THR_LOCK manager looked a bit simpler. InnoDB doesn’t really need/rely on THR_LOCK locks for the execution of most of the statements. Instead InnoDB relies on its own row-level locks (for DML) and on metadata locks (for DDL and DML/DDL conflicts). The only exception was the LOCK TABLES READ statement for which acquisition of a strong THR_LOCK lock was necessary to properly isolate it from any concurrent DML. Once a new type of metadata lock was introduced, to replace THR_LOCK lock in this role, we were able to remove THR_LOCK acquisition for InnoDB tables completely. Doing this was not as easy as it might sound since there were various questions to answer:
    • What priority should the new lock get compared to other DDL and DML locks?
    • How do we prevent DML from starving out LOCK TABLES READ?
    • What should we do about the LOW_PRIORITY clause?

    As a result we ended-up adding even more lock types to the metadata locking subsystem. See WL#6671 “Improve scalability by not using thr_lock.c locks for InnoDB tables” for details.

    While doing the above we were not able to completely avoid introducing incompatible changes. One thing, which I would like to note in particular, is that multi-update now is blocked by concurrent LOCK TABLES READ on any table in its join, even if such table is only used for reading.

    On the other hand, the introduction of a new metadata lock type for LOCK TABLES READ allowed us to fix the long-standing bug #57474 “Deadlock when DDL under LOCK TABLES WRITE, READ + PREPARE” as a bonus.

The above changes removed the metadata locking subsystem and the THR_LOCK lock manager from the list of scalability bottlenecks for scenarios involving InnoDB tables. They became part of the MySQL 5.7.5 DMR (some of the changes were already available in the MySQL 5.7.4 DMR).

They are also partially responsible for the impressive numbers which Dimitri Kravtchuk got in his benchmarks. See his recent blog post: “Indeed, MySQL 5.7 rocks : OLTP_RO Point-Selects 8-tables Benchmark“.  Dimitri also presented these numbers at Oracle OpenWorld 2014  “MySQL 5.7: Performance and Scalability Benchmarks“, and that link contains a link to his presentation.

A big Thank you! goes out to all who participated/helped with this work:

  • Mark Callaghan for reporting bug #66473, issues which led to the reporting of bug #68487 and urging us to look at scalability bottlenecks in the MDL subsystem
  • Dimitri Kravtchuk for his benchmarks and help in identifying the problems
  • Mikael Ronstrom for discussions about possible solutions to the problem
  • Jon Olav Hauglid for reviewing the design and the code of these worklogs
  • Shipra Jain, Vince Rezula, and Jeb Miller for QAing them
  • And last but not least, to Paul Dubois for documenting changes caused by these WLs

Thank you for using MySQL!

New InnoDB Memcached Functionality in 5.7 Lab Release

InnoDB Memcached’s read only query performance in 5.7 has already achieved a remarkable 1.1 million QPS record. Now, the read only query bottleneck shifts to the memcached client itself. Thus anything that can batch the queries and shorten the query strings helps.

Multiple Get

In the new 5.7 InnoDB Lab Release, we add support for the “multi-get” option within InnoDB Memcached. This functionality enables users to send multiple keys in a single “get” command. In this way, for client-server communication, the package size is reduced as multiple keys are package within a single “get” call. For InnoDB, it will no longer need to start a transaction and open the table for each key if they are packaged together. The lookup of a key and its corresponding values is done by a single open table within a single transaction. It is now similar to a “select … where … in()” query.

Here’s a quick example showing how to retrieve 4 key/value pairs in a single get call:

Notice that it returns all 4 key/value pairs in a single response.

For this feature, the performance evaluation has yet to be conducted. We will blog more about that once the performance results are available.

Range Queries with InnoDB Memcached

Another interesting memcached feature added in this Lab release is that memcached now supports range queries. This is a good example to show how flexible the InnoDB API and InnoDB Memcached can be.

Initially we support a single compare operator placed directly before a key, and prefixed with a special marker of “@”.

For example, if we have following table:

Then in order to get all results that are less than “H”, the following query can be issued:

To include values with “H”, you can use “<=”:

To find rows larger than “B”:

In this Lab release, we only support single key point range searches. It can be easily extended moving forward, however, to range searches bounded by both sides (key1 < key < key2 etc.).

Summary

By supporting simple range searches we demonstrate the flexibility and versatility of the InnoDB APIs and InnoDB Memcached. In addition, it shows that InnoDB Memcached can take full advantage of the storage engine’s indexes and support functionalities beyond traditional memcached capabilities, all with superior performance.

Please let us know what you think of this new feature! We’d love to hear your feedback on what else you’d like to see in the InnoDB Memcached plugin moving forward. If you encounter any problems with the new feature, please let us know here in the comments, open a bug report at bugs.mysql.com, or open a support ticket. Thank you for using MySQL!

Back home from MySQL Central @ OpenWorld

Back home after an excellent week at MySQL Central  in San Francisco. I want to thank the MySQL Community for all your questions, suggestions, and discussions. Your knowledge about – and passion for MySQL is stunning. Thank you!

It was also good fun, see pictures and comments #MySQLCentral. Presentations are available online, for example my own What’s New in MySQL 5.7? [CON2314] and Alexander Nozdrin’s  New Data Dictionary: An Internal Server API That Matters [CON4478] .

And don’t miss the Keynote  Oracle MySQL Session 2014 — Screven and Ulin.

See you all next year!

 

Generated Columns in MySQL 5.7.5

Generated Columns is a new feature available in the latest lab release. This work is based on a contribution by Andrey Zhakov. Thanks, Andrey! The Optimizer team modified it to follow the current MySQL design, and to lift a number of limitations.

The syntax is:

<type>   [ GENERATED ALWAYS ]   AS   ( <expression> )   [ VIRTUAL|STORED ]
[ UNIQUE [KEY] ]   [ [PRIMARY] KEY ]   [ NOT NULL ]   [ COMMENT <text> ]

There are two kinds of Generated Columns: virtual (default) and stored. Virtual means that the column will be calculated on the fly when a record is read from a table. Stored means that the column will be calculated when a new record is written in the table, and after that it will be treated as a regular field. Both types can have NOT NULL restrictions, but only a stored Generated Column can be be a part of an index.

There are still a few limitations on Generated Columns:

  • The generation expression used can only call native deterministic functions; stored routines and UDFs are not yet supported.
  • The length of the generation expressions for a table are subject to the limitations of the .frm file, and thus they can’t be longer than 64K in total. So you can have one field with a generation expression length of 64K, or you can have 30 fields with an average expression length of 2K each.
  • A Generated Column can’t refer to itself or to other Generated Columns that are later defined, but it can refer to any previously defined Generated Column. This limitation does not apply to regular columns. A Generated Column can refer to any regular one, no matter where it’s defined.

What can Generated Columns can be used for? Many things. To name few:

  • As a materialized cache for often used expressions:

    The result of those two queries is exactly the same, but in the first one ExtractValue(doc,’/user/username’) will be evaluated 3 times per record read, while in the second only once per record read. If the “username” columns would be defined as STORED then the generation expression will be evaluated only when a record is inserted or updated.

    Another similar case is that Generated Columns (GC) could be used to add flexibility by replacing often used expressions with a GC. For example, if you have a bunch of applications that work on the same database then it might be practical to have a unified way to access data without need to keep all apps in sync.

  • Providing indexes for joins with non-relational data:

    Here ref access over an index is used to access a table with XML data.

  • Working around a limited set of partitioning functions:

    Note that only one partition is going to be scanned due to partition pruning.

As always, there are more than one way to do things. Generated Columns adds yet another means of solving a variety of interesting challenges. It now also becomes more convenient to deal with derivatives of relational and non-relational data. We look forward to seeing all of the interesting ways that you apply the feature!

InnoDB Native Partitioning – Early Access

The InnoDB labs release includes a snapshot of the InnoDB Native Partitioning feature.

To better understand why we implemented this, we need to start with some background on tables, storage engines, and handlers. In MySQL an open instance of a table has a handler object as an interface to the table’s storage engine. For a partitioned table there is a main table handler that implements the partitioning feature, but for storage, each partition has its own handler. This worked fairly well, but the more partitions you had the more overhead from the per partition handlers. So to remove this overhead for partitioned InnoDB tables we’re introducing Native Partitioning support! This means a new InnoDB partitioning aware handler, so that we have a single handler object for a partitioned table and not one handler object per partition.

Let us create a simple table with 8k partitions:

If we compare the amount of memory used when opening a single instance of this table, first using the old generic non-native partitioning, and then with InnoDB Native Partitioning we see the following:
One open instance of the table takes 49% less memory (111MB vs 218MB) with the current state of Native Partitioning support. With ten open instances of the table, we take up 90% less memory (113MB vs 1166MB)!

The base 111MB is used internally by the InnoDB data dictionary cache; 86MB for index level info, 21MB for table level info, and 4MB for statistics. This is likely to decrease even further due to de-duplicating info for each partition.

The overhead on memory with the old generic partitioning are due to the following factors:

  1. A handler per partition (~6MB per table instance)
  2. The internal caches (prebuilt struct) per partition (~29MB per table instance)
  3. The prefetch row cache per partition (~66MB per table instance)**
  4. Internal row templates (~2MB per table instance)

**The prefetch row cache is allocated for range and scan operations, so if you only use exact key searches it will not be allocated by the old partitioning engine.

By creating a partitioning aware InnoDB handler we also have a better foundation for supporting other features like Foreign Keys and FullText indexes with partitioned InnoDB tables.

We look forward to your feedback on this new feature! Please let us know if you run into any problems or have any other comments.

The Query Rewrite Plugins

Why Query Rewrites?

Now that the cost model project is progressing, most of you are going to notice execution plan changes. In the vast majority of the cases, the changes will be for the better, and some bugs with a long history will finally be closed. In some cases, however, you will notice that your queries run slower. This is inevitable: even if the MySQL optimizer is doing a much better job with the information it has, it may still be the case that the information was incomplete and that the best plan was, in fact, found by not trusting that information! Normally, we would just say “add an optimizer hint” and be over with it. But sometimes you can’t do that. For instance your query could be auto-generated from an application that you have no control over. This is why you want to intervene right before the query reaches the server. And on behalf of Sweden I apologize for Ace of Base.

When Query Rewrites?

Ideally, a query re-writer should be placed right between the client and the server, to take a quick sniff at the queries coming in and dose the foul-smelling ones with fragrant optimizer hints. But in practice it would probably have to have a complete SQL parser built-in. This would cause a large performance overhead if that re-writer was running on the same machine as the server. A common request has therefore been to be able to write plugins that can pick up the queries at the server side.

We now offer two API’s for writing query rewrite plugins. The first one is for when you know exactly – as in character-by-character exactly – what the offending queries look like. This one has a hook to intercept the query string right before it’s parsed. Unsurprisingly, we call it the pre-parse rewrite plugin API. The second one comes in right after parsing and acts on the parse tree. It offers the basic functionality to walk over the parsed query, which is a lot more efficient than dealing with a string. It should not surprise anyone that we call this the post-parse rewrite plugin API. As part of the package there is also a plugin we developed called Rewriter, to which I shall devote the remainder of this post. This plugin is of the latter kind, post-parse.

How Query Rewrites?

The Rewriter plugin uses a pattern matching machinery to identify the queries that should be rewritten and what they should be rewritten to. The pattern uses wildcard symbols to match constants in the query, and those symbols may be referenced in the replacement. A bit like search-and-replace with capture. I feel the easiest way to explain it is with examples, so I’ve put together a little tutorial.

Query Rewrite Tutorial

This will be a tutorial on how to get going with your query rewrites. You shouldn’t need more than a running MySQL server and a client to get going.

I don’t recommend loading the plugin by simply doing INSTALL PLUGIN. Granted, this will install the plugin for you, but it won’t let you communicate with it. In order to get the whole package you will have to run the provided install script. You should be able to just copy-paste it in your sql client. I just pipe it to the command line client:

This will set everything up for you. If you change your mind about rewriting queries, you can just uninstall the plugin with an UNINSTALL PLUGIN command. If you want to completely cover your tracks, you can run the provided script uninstall_rewriter_plugin.sql, but be aware that this will drop all of your rewrite rules without a trace.

A simple example

Now, let’s rewrite some queries. You do this by defining rewrite rules. A rule consists of a pattern that the queries have to match and a replacement query that will replace all queries that match the pattern. So now that we’re up and running let’s write a rule and hand it to Rewriter. This being a relational database, rules are of course stored in a table. The table is called rewrite_rules and lives in the database query_rewrite. There is nothing magic about this table, it’s a table like any other.

We will start with the simplest possible example: we’ll rewrite the query SELECT constant to SELECT constant + 1. So let’s make this  rule

We have now created a rule in this table, and you may take a look at at in case you forget it later:

There are other columns in the table as well, but let’s focus on these for now. We have now created a rewrite rule, but we haven’t told Rewriter about it yet. In order to execute rewrites with a minimal effect on performance, the rules have to be loaded into memory. This is done by calling the stored procedure flush_rewrite_rules in the same database as before:

This procedure will commit  your current transaction, just like FLUSH TABLE would, and load the rules up in Rewriter’s memory. We are now ready to rewrite some queries:

Certainly not what you’d expect when writing SELECT 1! Obviously, something just happened here, so let’s note two things: First, the pattern was “SELECT ?”, yet I wrote “select 1″  – lowercase – and still it obviously matched the pattern. Rewriter, being a post-parse plugin, acts on parse trees rather than strings, so as long as the query and the pattern have the same parse tree, it is considered a match. The second thing to note is that there is a warning. So what might that be? Let’s find out:

You didn’t think Rewriter would just rewrite your queries and not tell you, did you? In fact, it will always leave this  note when it rewrites your queries. The really vigilant ones of you will also notice that the original query is spelled out exactly the way I wrote it, in lowercase. This makes the rewrites easy to track in logs.

As you saw in the above example, the rule remembers those constant values matched by the ? symbol and injects them into the replacement. The values are injected strictly left-to-right. The syntax uses the same mechanism for matching parts of queries as the prepared statement syntax does: the question mark symbol may be used to match any single literal. So in a sense patterns are like prepared statements run backwards.

Where Is My Current Database?!

Those of you that really stayed awake during the last section may actually have tried inserting that rewrite rule. You were really apt pupils, trying to be the best in your class. And what did you get for all your hard work? You probably did something like this.

Huh? Failed to load? What’s going on here I hear you saying. Indeed you followed my example to the letter and still you ran into trouble immediately. In order to fix this, let’s gather all information we can from Rewriter. It’s time to see the full contents of the rules table:

As you can see, there’s three more columns that I didn’t tell you about: enabled, pattern_database, and message. Enabled is straightforward, the default value is Y, meaning that the rule should be loaded when you load the table. You can set it to N if you want to disable a rule but wish to keep it so you don’t forget it. In this case Rewriter disabled the rule for you because there’s a problem with it. What the problem is can be seen in the message column. Now you’re thinking “but I did select a database! I ran use mydb!” In order to explain why this happens, I will need to go into a technical detail of the loading procedure. So far I’ve tried to keep digressions into the inner workings of the plugin to a minimum, but this one you really need to know about. When you tell Rewriter to load your rules (calling flush_rewrite_rules(), remember?), what it actually does is log in with its own session. Since the current database is a session-dependent setting, Rewriter has no idea what your current session is. Besides, you may have multiple clients logging in an running queries with their own current database. So Rewriter does the sensible thing and doesn’t make any assumptions about what should be considered the current database. That’s what we use the pattern_database column for. Rewriter will use that as the current database when trying to match a query to a rule. Armed with this knowledge, let’s rectify the situation:

That’s better. This means that the rule will match when you send the query with the unqualified table name mytable and the current database is mydb. Note that the rule will not match queries referencing the qualified table name. If you want to match the query regardless of qualification (and you usually do,) you will have to create two rules, one referencing mydb.mytable and one referencing mytable. This is slightly inconvenient, but it’s a price we have to pay for performance of the pattern matching. I might cover this in more detail in a later blog post, but for now let’s accept it as a fact of life.

A Real-World Example

In this section, I will cover an actual case where the optimizer lacks complete knowledge about data distribution, and makes an optimistic choice which works best on average but is really slow in the worst case. Those who encounter a worst-case data distribution case will want to nudge the optimizer in the right direction.

In this example case we have a health clinic where they frequently need to run the following query:

There are existing indexes on patientId and time.

Now, there are two ways we could go when executing this query.

  1. Do a range scan on the patientId index, then sort it on time using a filesort and finally to return the first row.
  2. Do an index scan on the time field until we find a row that matches the where clause.
Intuitively, the second approach seems to be the most efficient one in this case. Since headache is a fairly common gripe, at least among us lowly developers who spend our days hunched in front of terminals, we can assume that we will find a matching row fairly quickly. Note that the optimizer does not have the information on distribution here, so it has to make an educated guess using general heuristics. Indeed, this is how the query is executed after MySQL 5.6.
Now, consider if the query were:
The illness is different here (I dare you to pronounce it). In this case, a matching row will be rare indeed, if there is one at all. MySQL, still going with option 2, will take 20 seconds to execute this query for the user. There are proposals on how to address this, but it’s not a clear-cut case. That’s why for the time being we recommend adding a FORCE INDEX( patientIdIdx ) clause to the query. The problem is that in this particular case the query was automatically generated by a third-party application that the user couldn’t alter. The solution is now to create a rewrite rule for this particular case:

This will add the needed hint to each matching query and we’ll get a major improvement in the performance of them.

We look forward to your feedback on this new feature! Please let us know if you encounter any problems or have other general input. Thanks!

Acrocephalosyndactylia is a disease that affect one in 100, 000 people.

Online Truncate of InnoDB UNDO Tablespaces

We have received a lot of requests from our user community regarding the ability to  truncate UNDO tablespaces (‘truncate’ here means that the size of the tablespace is reset back to what it was when it was first created). We are happy to say that we’ve now been able to implement this anticipated feature.

Introduction

The InnoDB UNDO tablespace(s) host rollback segments that hold rollback information related to database changes. This information is used to rollback a transaction and to retrieve the previous version of a record that has been updated or deleted for multi-version concurrency control (MVCC). Once a transaction is committed, InnoDB will discard the related UNDO log records. UNDO log records for updates or deletes will be kept around as long as there exists an open transaction that may access older versions of the records. When all such open transactions are committed then the associated UNDO log records can be discarded.

Currently the InnoDB purge thread(s) will try to free UNDO log pages containing old UNDO log records so that those pages can be re-used. But depending on the retention window (dictated by active transactions that need those UNDO log records) the UNDO tablespace can continue to grow in size, continuously allocating new pages. Because of that, with multiple active long running transactions the total UNDO tablespace size could continue to increase and grow considerably over time.

In order to re-claim the space and help reduce the disk foot-print of an UNDO tablespace that has grown much larger than is necessary, we have introduced a new option to truncate an UNDO tablespace.

Turning This Feature “ON”

Whether or not to support truncation of UNDO tablespaces is controlled using the innodb_undo_log_truncate option, which is turned “OFF” by default. Users can turn that “ON” and “OFF” dynamically, so as to support the feature on an as needed basis.

There are a few important semantics that affect when the actual truncate operation can run:

  • Truncate will happen only if UNDO logs are configured to use UNDO tablespaces. It will not be possible to truncate UNDO logs space that is stored within the system tablespace.

Condition: --innodb-undo-tablespaces > 0

  • Since the truncate is being done online we don’t want to halt server operations, so for truncate to progress we need to have at least 2 UNDO tablespaces and at least 2 REDO enabled UNDO logs configured to use these UNDO tablespaces.

Condition: --innodb-undo-tablespaces >= 2 and --innodb-undo-logs >= 35

(The value of 35 comes from how the UNDO logs (rollback segments or rsegs) are allocated—0: REDO enabled rseg allocated in the system tablespace, 1-32: non-REDO enabled rsegs allocated in the temporary tablespace, 33-n: REDO enabled rsegs allocated in UNDO tablespaces)

How It Works

Provided all the above mentioned conditions are met, then InnoDB will try to do the following:

  1. Select/Mark an UNDO tablespace for truncate. This is done in a round-robin fashion to avoid bias selection.
  2. Make all the rollback segments (rsegs) residing in the selected UNDO tablespace inactive. Inactive means that these rollback segments will not be allocated to new transactions. Existing transactions that are using these rollback segments will continue to progress without being affected.
  3. The purge system will continue to free rollback segments that are no longer needed. This marks the pages allocated to rollback segments as being free and reduces the logical size of the rollback segments.
  4. Once all rollback segments residing within UNDO tablespaces are freed, an actual file system truncate operation is carried out on the marked UNDO tablespace. It is then re-initialized to the default size that was set when it was created new.
  5. The truncated rollback segments are made active again so that they can be allocated to new transactions.

Accelerating the Freeing of Rollback Segments

As mentioned previously, the purge thread will try to free rollback segments that are no longer needed. Truncation of UNDO tablespaces cannot proceed until it has freed all of the rollback segments.

Previously the purge thread used to invoke the truncation of rollback segments every 128th time a purge was run. This was done to preserve performance, since the truncation activity can be resource intensive.

This static value of 128 is now replaced with the innodb_purge_rseg_truncate_frequency option, giving users more flexibility to control how frequently the freeing of rollback segments should occur.

innodb_purge_rseg_truncate_frequency: Default: 128, Min: 1, Max: 128

For example:
mysql> SET GLOBAL innodb_purge_rseg_truncate_frequency = 64;

Users can continue to use the default setting of 128 during normal operations to maintain the overall performance. This freeing/truncation operations can be accelerated as needed by reducing this variable and thus increasing the frequency.

Minimum Size of UNDO Tablespace to Truncate

We don’t want to truncate each and every tablespace. Neither should this operation result in continuous action such that one after another tablespace is qualified for truncate.

To address this issue there is some minimum size of UNDO tablespace criterion that needs to be met (dictated by innodb_max_undo_logs_size)

innodb_max_undo_logs_size: Threshold after which UNDO tablespace will qualify for truncate.  Default: 1 GB, Min: 10 MB, Max: ULONGLONG_MAX

For example:
mysql> SET GLOBAL innodb_max_undo_logs_size = 100M;

Only if the UNDO tablespace size exceeds the configured threshold will it be considered for truncation. Users should tune the value according to their operating environment and needs.

Performance Impacts

With the feature turned “OFF” there should not be any performance impact whatsoever. With the feature turned “ON”, however, there can be some performance impact. Given that the truncation operations are being done online while the server continues to accept new requests, there can be a performance impact on the user transactions. Let’s understand where the potential impact comes from.

As stated above, rollback segments residing in UNDO tablespaces are marked for truncation and are made inactive. This of course then limits the related available UNDO log management resources.

For example, if you have 2 UNDO tablespaces and 128 UNDO logs, and 95 of these are redo rsegs residing in those 2 UNDO tablespaces, then taking 1 of the UNDO tablespaces “offline” in order to truncate it means that we’re effectively making 48 UNDO logs unavailable for transaction processing, thus reducing the UNDO log processing resources by half. This will of course have some impact on the transaction processing rate (TPS) while the truncation operation runs.

The exact performance impact depends on a number of factors:

  • The number of UNDO tablespaces
  • The number of UNDO logs
  • The size of the UNDO tablespace
  • The speed of the I/O subsystem
  • Whether or not any long running transactions exist
  • Whether or not the system is otherwise heavily loaded

Based on our internal testing we found that — with innodb_undo_tablespaces=8, innnodb_undo_logs=128, and innodb_max_undo_logs_size=100M — at 128 threads the sysbench OLTP_RW workload drop in TPS is less than 5% on a server with an average IO subsystem.

Once the truncate operation completes and all of the rollback segments are made active again the performance will naturally improve again as more resources become available. The performance should in fact be slightly improved compared to before the truncation operation because UNDO log record allocation should be faster as the UNDO tablespace is smaller and less fragmented.

Summary

We now have a server option that allows you to reclaim some of the disk space used for UNDO tablespace management dynamically, without having to stop mysqld. This feature can be turned “ON” and “OFF” dynamically and on an as needed basis, for example when the overall server load is expected to be light (e.g. at 4AM every Saturday).

The related configuration options are:

  1. To enable the general feature: --innodb-undo-log-truncate
  2. To configure undo-tablespaces: --innodb-undo-tablespaces
  3. To configure undo-logs: --innodb-undo-logs
  4. To accelerate rsegs free: --innodb_purge_rseg_truncate_frequency
  5. To set size threshold: --innodb_max_undo_logs_size

We look forward to your feedback on this new feature! Please let us know if you encounter any seeming bugs, or if you have more general feedback. Thank you!