Externally Stored Fields in InnoDB

This article discusses the storage (inline and external) of field data in the InnoDB storage engine. All fields of variable length types like VARCHAR, VARBINARY, BLOB and TEXT can be stored inline within the clustered index record, or stored externally in separate BLOB pages outside of the index record (but within the same tablespace). All of these fields can be classified as large objects. These large objects are either binary large objects or character large objects. The binary large objects do not have an associated character set, whereas the character large objects do.

Within the InnoDB storage engine there is no difference between the way character large objects and binary large objects are handled. Throughout this article we will use the term “BLOB field” to refer to any of the aforementioned field types that can be chosen for external storage.

This article will provide information about the following:

  • Explain when a BLOB field will be stored inline and when it will be stored externally, with respect to the clustered index record.
  • The structure of the BLOB reference.
  • The BLOB prefix that is stored in the clustered index record, when the BLOB is stored externally.
  • Utility gdb functions to examine the BLOB reference and the record offsets array.

The BLOB fields are associated with the clustered index records (the primary key) of a table. Only the clustered index can store a BLOB field externally. A secondary index cannot have externally stored fields. For the purposes of this article, we won’t deal with any secondary indexes.

The Schema

The following example table will be used to present the information:

Note: ௱ – Tamil number one hundred (Unicode 0BF1), ௲ – Tamil number one thousand (Unicode 0BF2)

A single clustered index record can have 1 or more externally stored BLOBs. So for the given table definition of t1, there are 4 possible ways that the BLOB fields of f2 and f3 can be stored:

  1. f2 and f3 are both stored inline within the clustered index page
  2. f2 is stored inline, while f3 is stored externally
  3. f3 is stored inline, while f2 is stored externally
  4. both f2 and f3 are stored externally

In the following sections, let us see which of the BLOB columns are externally stored and which of them are stored inline, for each sample row we created above. Note that the row format of table t1 is not explicitly specified. In MySQL 5.6, it will default to the COMPACT row format.  Please keep this in mind as we discuss the example.

Overview of BLOB Storage

The BLOB data can be stored inline in the clustered index record, or it can be stored externally in separate BLOB pages. These external BLOB pages are allocated from the same tablespace in which the clustered index resides. The BLOB data will always be stored inline whenever possible though. If and only if this is not possible because of the record size, then the BLOB field will be stored externally.  This is true for all of the current row formats: REDUNDANT, COMPACT, DYNAMIC, and COMPRESSED. Let’s now take a look at the storage details for the BLOB columns in our example table.

In MySQL 5.6, the default row format of a table is COMPACT, so that’s what our t1 table is using. The default page size is 16K, so that’s also what we’re using. The maximum record size that can be stored in a 16K page using the COMPACT row format is 8126 bytes. The function page_get_free_space_of_empty() will tell us the total free space available in a page. The value returned by this function, which is then divided by 2, will give us the maximum allowed record size. The division by 2 is required because an index page must contain a minimum of 2 records. Let’s look at an example (the argument value of “1″ tells the function that the row format for the page is COMPACT):

The following table shows the storage details of the BLOB columns for each row in table t1. Keeping in mind the maximum allowed record size, it is clear as to why the 60000 bytes of BLOB data is always stored externally. It simply will not fit within a single clustered index record.

Primary Key
Length of f2
Storage of f2
Length of f3
Storage of f3
13000 bytesInline3000 bytesInline
260000 bytesExternal60000 bytesExternal
360000 bytesExternal4500 bytesInline
44500 bytesInline60000 bytesExternal

Inline Storage of BLOBs

As mentioned previously, no BLOB fields will be externally stored if the size of the record is less than the maximum record size allowed in a page. In our example table, for the row with a primary key value of 1, no BLOB fields need to be stored externally because the full record size is less than 8126 bytes. The following tables give the sizing details for each of the rows in our example table:

Primary Key
Clustered Index Record Size (in bytes)
Fields Moved Out
Record Size After Moving Blob Outside (in bytes)
Maximum Allowed Record Size (in bytes)
16027-60278126
2120027f2, f316038126
364527f253158126
464527f353158126

As we can see, the BLOB fields are stored externally until the record size falls below the limit. In the table above, column 2 gives the initial clustered index record size. If this size is greater than the allowed maximum size of the record (shown in column 5), then the function dtuple_convert_big_rec() is invoked to choose the fields destined for external storage. Column 3 lists the fields that have been chosen by this function for external storage. Column 4 shows the clustered index record size after moving the chosen fields to external storage. Again, this value must be less than the maximum record size, shown in column 5 (which is 8126 bytes in our example).

All of the size details provided above are obtained through the debugger from the callers of this function.

For more clarity, let me explain the clustered index record length for the first row with a primary key value of 1  (shown as 6027 bytes). The length of the user fields f1, f2, and f3 are 4 bytes, 3000 bytes, and 3000 bytes respectively. The length of the system fields DB_ROLL_PTR and DB_TRX_ID are 7 bytes and 6 bytes respectively. The record header stores the length of the f2 and f3 fields, taking 4 bytes (2 bytes for each field). The record header also contains a null bit array, which for this record takes up 1 byte. Lastly, the record header contains REC_N_NEW_EXTRA_BYTES—which is 5 bytes for the COMPACT row format—of additional information. The complete storage details for the final record are presented in the following table:

Field
Length (in bytes)
Total6027
Length of f32
Length of f22
Null bit array1
REC_N_NEW_EXTRA_BYTES5
f14
DB_TRX_ID6
DB_ROLL_PTR7
f23000
f33000

You can refer to the documentation in storage/innobase/rem/rem0rec.cc for more details about the COMPACT and REDUNDANT row formats. The REDUNDANT row format is also explained in the blog article InnoDB REDUNDANT Row Format.

Choosing Fields for External Storage

As discussed above, the function dtuple_convert_big_rec() is invoked to decide which parts of the oversized clustered index record will be chosen for external storage. This function makes use of the following rules to decide this:

  • No fixed length fields can be chosen for external storage.
  • No variable length fields whose size is less than or equal to 40 (2 * BTR_EXTERN_FIELD_REF_SIZE) bytes will be chosen for external storage.
  • No variable length fields whose size is less than the BLOB prefix size will be chosen for external storage.  This means that in the case of REDUNDANT and COMPACT row formats, if the field data length is less than or equal to 768 bytes (DICT_ANTELOPE_MAX_INDEX_COL_LEN), then it will not be chosen for external storage. This rule is not applicable for DYNAMIC and COMPRESSED row formats, because their BLOB prefix is 0 bytes.

In the function dtuple_convert_big_rec(), we examine one BLOB field at a time for potential external storage, and then move it to external storage if it passes the criteria noted above, until the clustered index record size falls within the maximum allowed. Larger fields will be selected for external storage before smaller fields, to ensure that maximum space savings happens in the clustered index page. This ensures that more records can be stored in each clustered index page.

BLOB Reference

When a BLOB field is stored externally, a BLOB reference is stored in the clustered index record. The BLOB reference will be stored after the BLOB prefix, if any. This BLOB reference is 20 bytes, and it contains the following information:

  1. The space identifier (4 bytes)
  2. The page number where the first page of the BLOB is stored (4 bytes)
  3. The offset of the BLOB header within that page (4 bytes)
  4. The total size of the BLOB data (8 bytes)

Even though 8 bytes are available to store the total size of the BLOB data, only the last 4 bytes are actually used. This means that within InnoDB, the maximum size of a single BLOB field is currently 4GB.

Structure of InnoDB BLOB Reference

In the length field, two bits are used to store the ownership and inheritance information, which are not discussed in this article. We will cover that in a subsequent blog article. The most significant bit of the length field is used to store ownership information and the second most significant bit is used to store the inheritance information.

Here is a gdb function to print the contents of a BLOB reference. This function takes a pointer to the external BLOB reference as an argument. The calls to ntohl() are required because all data on disk is stored in network byte order by InnoDB.

BLOB Prefix

When a BLOB field is stored externally, we may also store a prefix of the value in the clustered index record, depending on the row format used. For the REDUNDANT and COMPACT row formats, a BLOB prefix of 768 bytes is stored in the clustered index record. For the DYNAMIC and COMPRESSED row formats, a BLOB prefix is never stored in the clustered index record. The BLOB prefix would be followed by the BLOB reference.

The BLOB prefix, when available, helps to calculate the secondary index key without needing to fetch the externally stored BLOB data (which involves at least one extra page load/fetch). This is possible because the maximum length of a secondary index key is 767 bytes. If we attempt to create a secondary index with a bigger length it will be automatically truncated with a warning. For example, consider the following statement:

This will generate the following warning to the user: “Specified key was too long; max key length is 767 bytes”. The resulting index records of s1 will only be on a prefix of f2, covering only the first 767 bytes.

External Storage of BLOBs

A BLOB field is considered to be stored externally if it is stored outside of the clustered index B-tree, and just a reference to the BLOB (along with any BLOB prefix) is stored in the clustered index record. This means that the size and structure of the clustered index B-tree will not be affected by the externally stored BLOB data.

For the row in our sample table with a primary key value of 2, the BLOB fields f2 and f3 will be stored externally from the clustered index record. This is because their size is too big to fit within the clustered index record. Instead, a reference to the externally stored BLOB data will be stored in the clustered index record. A pictorial view of this external storage is shown below:

Externally Stored Fields in InnoDB

Note: The diagram above shows each clustered index record having exactly one externally stored BLOB. This is just a simplication. Each clustered index record can have any number of externally stored BLOBs.

Let’s verify that the BLOB fields for the row with a primary key value of 2 are indeed externally stored with the help of the debugger. Run the server through a debugger (we used gdb) and issue the query SELECT * FROM t1 WHERE f1 = 2. Put a breakpoint in the function row_search_mvcc(), which searches for rows using a cursor. Within this function, once the offsets are calculated for the selected record, let’s print the offsets to verify which of the columns are stored externally. I’ll use the gdb function ib_print_offsets (provided in the appendix) to examine the offsets array. The (*) in the field output shows that it is in fact stored externally.

The above output shows that the record has 5 fields (3 fields are specified by the user, and 2 are internal system fields, namely DB_TRX_ID and DB_ROLL_PTR). The field number 0 begins at offset 0 and ends at offset 4. The first field begins at offset 4 and ends at offset 10. The second field begins at offset 10 and ends at offset 17. The third field, whose data is stored externally, begins at offset 17 and ends at offset 805. The fourth field, whose data is stored externally, begins at offset 805 and ends at 1593. This information is presented in a tabular format below.

Field Number
Description
External
Begin Offset
End Offset
Length
0The field f1 (Primary Key)No044
1DB_TRX_IDNo4106
2DB_ROLL_PTRNo10177
3The field f2Yes17805788 (768 bytes blob prefix + 20 bytes of blob reference)
4The field f3Yes8051593788 (768 bytes blob prefix + 20 bytes of blob reference)

The length of the fields f2 and f3 are each 788 bytes. This includes the 768 bytes of the BLOB prefix, and the 20 bytes for the external BLOB reference. The gdb function that I used to examine the offsets array is provided in the appendix.

The BLOB reference can be examined by using the gdb function ib_print_blob_ref (which we provided earlier). It takes the BLOB reference as an argument. The BLOB reference of field f2 is available at rec+805-20, where 805 is the end offset of f2 and 20 is the size of the BLOB reference. The BLOB reference of field f3 is available at rec+1593-20, where 1593 is the end offset of f2 and 20 is the size of the BLOB reference. The output is provided below:

The length of the externally stored part of the BLOB is 59232 bytes. The BLOB prefix is 768 bytes. So the total length of the field is 59232 + 768 = 60000 bytes, which is the size of both the f2 and f3 fields.

Conclusion

In this article we saw when the BLOB data will be stored inline, and when it will be stored externally. We also saw the rules applied in choosing particular field data for external storage. In a subsequent blog article, we will cover further details regarding BLOB handling within InnoDB.


Appendix: A gdb Function to Examine the Offsets Array

Within InnoDB, to manipulate and access the various fields of a record, an offsets array is calculated by the function rec_get_offsets(). This function returns an offsets array of type ulint*. This array contains the following pieces of information:

  1. The number of fields in the index.
  2. Whether the record has any externally stored fields.
  3. The offset at which each of the fields begin and end.
  4. Whether the row format is COMPACT or REDUNDANT. Note that DYNAMIC and COMPRESSED row formats are a special type of the COMPACT row format. In other words, a DYNAMIC row format is a COMPACT row format plus additional rules. And the COMPRESSED row format is a DYNAMIC row format with additional rules.
  5. In a debug build, it also contains the rec pointer and the index pointer.

Here is the gdb function referenced earlier that can be used to examine an offsets array in a debug build of MySQL:

Authors

This article is jointly authored by Annamalai Gurusami (கு அண்ணாமலை), Aditya and Thirunarayanan Balathandayuthapani of the MySQL Server Sustaining Team, Bengaluru.

I would also like to thank Marko Makela and Matt Lord for their reviews and support. They helped to improve this article to a great extent.

 

Report on MySQL User Camp, Bangalore on 20th June, 2014

MySQL User Camp Bangalore organized on 20th June, 2014 started as per plan. It was a good gathering. We got many users from different companies like Flipkart, CTS, CGI, Yahoo, Onze Technologies, Webyog and few other startups. Apart from this we also got 2 Mysql customers this time from UID Aadhar project managed by HCL.

Event started on its scheduled time with a welcome speech by Srinivasarao Ravuri, (Senior Manager, Software Development at MySQL India) . He thanked all the attendees for coming followed by a brief agenda and handed over the mic to the speakers for presentation.

MySQL-4046

It was followed by a talk on ” MySQL 5.7 New Features and NoSQL support in MySQL” by Sudipto Sahoo (Senior Engineer at MySQL India). He gave a good summary on the new features we have in Mysql 5.7 DMR 4.

   MySQL-4054

Some of the key points he spoke about were

1. Improved InnoDB Online Alter Table
2. Parallel “Dirty Page” Flushing
3. Partitions – support for Transportable Tablespaces (TTS)
4. New Separate tablespace for temporary tables
5. Improved MDL locking

His presentations were equipped with good intuitive graphs where Sysbench Benchmarks were shown clearly showing good performance in Point Select. It was shown that Mysql 5.7 is 2X Faster than MySQL 5.6 Over 3X Faster than MySQL 5.5.

He finished his presentation with some good Examples on NoSQL implementation in Mysql.

For more details you can also refer to the 5.7.1, 5.7.2, 5.7.3, 5.7.4 blog posts written by Geir Høydalsvik.

Tea Break

After this we had a quick round of introduction from all the attendees followed by a small break for Tea and snacks already arranged at the venue. We had great variety of users from Mysql Enthusiast new to Mysql to Advanced Mysql DBA’s. It was nice to see the users having open discussions with the Mysql Developers over cup of Tea.

MySQL-4069MySQL-4064

MySQL-4074

The interactions went little longer than anticipated and users were requested to fall back for the next presentation.

We had the next presentation on Mysql Fabric by (Narayanan Venkateswaran , Software Developer in Fabric Team ). It could be easily sensed that the users from both outside and within Mysql were really keen on the topic with the number of questions they had. All the questions were answered with good examples by the speaker. There were many use cases from users for which answers were given and for some we have work still in progress but best workarounds were suggested.

MySQL-4077

Some of the key points he spoke about were

1. Sharding Architecture
2. Managing a Sharded Database with some good examples.
3. MySQL Fabric Architecture

Event finished little late. Some Mysql Goodies were distributed to the Users and feedback forms were collected. However the discussion continued among the users little later too.

MySQL-4075

Feedback from attendees:
I was having a look at feedback forms and felt very good to see such a positive response. People also suggested topics to be included in next talk like Replication, Optimization and database tuning and best practices e.t.c . We would definitely take their suggestions into consideration.

Presentations at

Photos and Videos:

Courtesy:
Sayantan Dutta (Senior Engineer at Mysql India)
Vin Perothas (Senior Engineer at Mysql India)

Follow us for more information on upcoming User Camp Events in Bangalore at

Meetup : MySQL-User-Camp-Bangalore

Facebook Group : MySQL User Camp

Linked-in Group : MySQL India

 

Improving Innochecksum

In WL#6045 we improved the innochecksum tool so that it opens the tablespace file and validates the checksum for each page. It compares the calculated checksum to the stored checksum in the page and reports mismatches, if any. It can also now rewrite page checksums in the tablespace. It is an offline tool, however, so make sure that the MySQL server is not using the tablespace files when running the innochecksum tool.

When InnoDB encounters a page checksum mismatch, it will cause an ASSERT and perform a forced shutdown. So rather than wait for InnoDB to encounter this in production, it is better to use the innochecksum tool to validate and correct any checksum issues ahead of time. You could perform these checks on your MySQL Enterprise Backup files after a backup job completes, as an example.

Again, the innochecksum tool cannot be used on tablespace files that the server already has open. When the MySQL server is running, you can instead use the CHECK TABLE command.

Additional new features introduced in MySQL 5.7 for this tool include:

  • Support for file sizes greater than 2GB.
  • You can specify the number of checksum mismatches allowed before terminating the program, using the new --allow-mismatches option.
  • You can specify the checksum algorithm to use with the new --strict-check option.
  • You can tell the tool to rewrite checksums in the tablespace file when mismatches are detected, using the new --write option.
  • You can log the output, including the calculated checksum values for each page, using the new --log option.
  • You can list out page type summaries, using the new --page-type-summary option.
  • You can print the details for each page to standard output (STDOUT) or standard error (STDERR), using the new --page-type-dump option.
  • You can read data from standard input (STDIN), using the new - option.
  • innochecksum can now operate on multiple user-defined tablespace files.
  • innochecksum can now operate on multiple system tablespace files.

Conclusion
This post provided a brief introduction to the new features recently introduced in the innochecksum tool. For more information on the tool, please see the manual here.

The MySQL 5.7.2 Milestone Release

As promised, here is the belated post covering the MySQL 5.7.2 Milestone Release, released on September 21, 2013. You can find the full list of changes and bug fixes in the 5.7.2 Release Notes. Enjoy!

InnoDB Read-Only Scalability

One of the goals of 5.7 is to improve performance for Read-Only (RO) and Read-Mostly workloads. In 5.7.2 Sunny Bains made several improvements (see below) and we reached 500K QPS performance levels in the OLTP_RO Point-Selects 8-tables benchmark.  See also Dimitri Kravtchuk’s blog post.

Do Not Allocate trx id for Read-Only Transactions (WL#6047) — This work by Sunny Bains speeds up READ ONLY transactions under heavy load. InnoDB now auto-detects READ ONLY transactions, and thus removes the need to be running in autocommit mode or using START TRANSACTION READ ONLY to benefit from this optimization.

Reduce lock_sys_t::mutex Contention When Converting Implicit Locks to Explicit Locks (WL#6899) — This work by Sunny Bains significantly increases scalability for RO transactions in 5.7 by avoiding scanning the trx_sys_t::rw_trx_list for trx_id_t.

Improve Locality of Reference by Allocating trx_t in Blocks (WL#6906) — This work by Sunny Bains allocates memory for trx_t instances in configurable sized blocks that are a multiple of sizeof(trx_t). This reduces the cost when iterating over the transactions. This work contributes to the impressive scalability of RO transactions in 5.7.

Optimize Read View Creation (WL#6578) — This work by Sunny Bains reduces the cost of read view creation. The multi-version concurrency control (MVCC) in InnoDB requires that each MVCC-using transaction be assigned a read view. The read view is created by traversing the trx_sys->rw_trx_list, which is a linked list of active read-write transactions. This change is required to improve InnoDB performance both for RO and RW workloads.

InnoDB Read-Write Scalability

One of the goals of 5.7 is to also improve the performance of Read-Write (RW) workloads. In 5.7.2 Yasufumi Kinoshita removed the index->lock contention (see below).

Implement SX-lock (WL#6363) — This work by Yasufumi Kinoshita introduces a new rw-lock called SX-lock (shared exclusive) for accessing internal InnoDB resources.  This new lock paves the way for additional concurrency and scalability improvements.

Fix index->lock Contention  (WL#6326) — This work by Yasufumi Kinoshita removes index->lock contention and improves scalability of RW workloads. The index->lock that was used to protect the entire index tree structure is now replaced by more fine grained block->locks in the tree.

InnoDB Faster Flushing

Optimize Buffer Pool List Scans and Related Batch Processing Code (WL#7047) — This work by Inaam Rana reduces the number of pages scanned when doing flush list batches, speeding up page flushing. The time complexity of a scan is reduced from O(n*n) to O(n).

InnoDB Temporary Table Performance

One of the goals of 5.7 is to optimize InnoDB temp tables for better performance.  With the following work done in 5.7.2, it is about 11x faster for CREATE/DROP temp table statements and about 2-4x faster for INSERT/DELETE/UPDATE on temp tables, as compared with 5.6.

Optimize DML for Temp Tables (WL#6470) — This work by Krunal Bauskar improves InnoDB temp table DML performance (INSERT/UPDATE/DELETE). This is achieved by removing unnecessary UNDO and REDO logging, change buffering, and locks.

UNDO Logs for Temp Tables Should Reside in a Separate Temp Tablespace (WL#6915) — This work by Krunal Bauskar adds an additional type of UNDO log, one that is not REDO logged and resides in a new separate temp tablespace. These non-redo-logged UNDO logs are not required during recovery and are only used for rollback operations.

Speeding up Connection Handling

In some application scenarios (e.g. PHP applications) client connections have very short life spans, perhaps only executing a single query. This means that the time spent processing connects and disconnects can have a large impact on the overall performance. In 5.7.2 (WL#6606) we improved the number of connect/disconnect cycles handled per second from 21K to 35K (+63%), as measured by using mysql-bench with 25 concurrent client threads executing a loop of connect/disconnect cycles for 100,000 iterations each.

Offload THD Initialization and Network Initialization to Worker Thread (WL#6606) — This work by Thayumanavar Sachithanantha is maximizing the number of connections that can be handled per unit of time. Initialization of a THD and its vio/net previously happened in the acceptor thread that accepted the connection. THD and network initialization involves acquiring locks, memory allocation of various structures, and system calls which are compute-bound, as well as tasks that may block. The acceptor thread is now basically an event loop that waits for new connection events from clients. To maximize the number of connections that can be handled per unit of time, the acceptor thread now spends as much of its time listening for new connections, and the THD initialization is offloaded from the acceptor thread and delegated to worker threads that then handle the initialization and resulting client connections. See also Jon Olav Hauglid’s blog post.

Update_time For InnoDB Tables

Implement Update_time for InnoDB Tables (WL#6658) — This work by Vasil Dimov implements in-memory maintenance of update_time for InnoDB tables. For now, this work has the limitation that the values will be lost when the server is restarted or the table evicted from the InnoDB data dictionary cache. We would like to make update_time persistent as follow-up work, but the timeline for this has not been decided. The column INFORMATION_SCHEMA.TABLES.update_time will now contain the timestamps of the last update (or insert or delete). This functionality was previously missing in InnoDB and people have been asking for it, see Bug#2681 reported by Phil Sladen.

InnoDB TRUNCATE TABLE Statement Becomes Atomic

Make Internal InnoDB TRUNCATE TABLE Statement Atomic for Single Table Tablespaces  (WL#6501) — This work by Krunal Bauskar makes the internal InnoDB TRUNCATE TABLE statement atomic by reinitializing the original tablespace header with the same space id and then physically truncating its .ibd file during the truncation of a single table tablespace.

InnoDB Buffer Pool Dump and Load Enhancements

InnoDB Buffer Pool Dump and Load Enhancements (WL#6504) — This work by Vasil Dimov improves both dump and load scenarios. It is now possible to dump only the hottest N% of the pages from each buffer pool. This is controlled by a new variable innodb_buffer_pool_dump_pct, an integer from 1 to 100 with a default value of 100 (dump everything). The load operation is also made less disruptive to user payload, because the load now happens in the background while serving clients; while also attempting not to be too aggressive and keeping too much IO capacity away from servicing new clients requests.

Innochecksum Tool

Improve Innochecksum (WL#6045) — This work by Anil Toshniwal significantly extends the innochecksum utility’s functionality. It is now possible to specify the checksum algorithm (innodb/crc32/none), rewrite the current checksum using the specified algorithm, rewrite the checksum even if the current checksum is invalid, and specify the maximum checksum mismatch allowed before terminating the program. Innochecksum can also now operate on multiple tablespace files and on multiple files in the same tablespace.

MySQL Client

Client Side Protocol Tracing (WL#6226) — This work by Rafal Somla creates hooks inside the client library code which allows tracing of protocol events such as sending packets to the server, receiving server replies, and authentication handshakes. This provides a mechanism for collecting performance data about the client-server connections, from the clients perspective.

Error Reporting

Most Statements Should Clear the Diagnostic Area (WL#5928) — This work by Tatjana Nurnberg makes MySQL follow the SQL standard with respect to clearing the diagnostic area. This means that 1) We clear the diagnostic area for non-diagnostic statements that do not use any tables 2) Clearing the diagnostic area is done at a single well-defined point during statement execution. This work fixes bugs like Bug#35296, Bug#43012, and Bug#49634.

Error Logging

Allow Control of Verbosity (WL#6661) — This work by Tatjana Nurnberg gives the DBA control of how “verbose” the MySQL Server should be (error/warning/note) when writing to the error log. This work also changes the timestamp format printed to a more standard format (syslog) and converts the existing fprintf(stderr, …) in the server layer to use the new internal error logging API.

Triggers

Multiple Triggers Per Table (WL#3253) — This work by Dimitry Shulga provides the ability to have more than one trigger for every action (INSERT, UPDATE, DELETE) and timing (BEFORE or AFTER). This is in line with the SQL standard. See also Dimitry Shulga’s blog post.

Semi-sync Replication

Externalize Transactions Only after ACK is Received (WL#6355) — This work by Libing Song allows for true lossless failovers when using semi-sync replication. If the master crashes, the slave is still ensured to be up to date. The implementation makes the master wait for the ACK after preparing within the storage engine and writing to the binary log, but before committing to the storage engine. See also Libing Song’s blog post.

Multi-threaded Slaves (MTS)

Intra-schema Parallel Slave Execution (WL#6314) — This work by Rohit Kalhans implements intra-schema multi-threaded slaves. With this implementation the slave will be able to apply transactions in parallel, even within a single database or schema, as long as they have a disjoint read and write set. See also Rohit’s blog post.

Performance Schema

Instrument Memory Usage (WL#3249) — This work by Marc Alff instruments memory usage in the server layer. Instrumentation is added for more than 200 memory types. Memory usage statistics are aggregated by type of memory used (caches, internal buffers, etc.) and by the thread/account/user/host indirectly performing the memory operation. Attributes include memory used (bytes), operations counts, and high/low water marks.

Stored Program Instrumentation (WL#5766) — This work by Mayank Prasad extends the statement instrumentation to cover Stored Programs which completes the picture with COM Commands and SQL Queries done in 5.6. New instruments are added for “statement/sp” cclose/cfetch/copen and for “statement/scheduler” event. The new object types are EVENT, FUNCTION, PROCEDURE, and TRIGGER. A new summary table “events_statements_summary_by_program” has been added.

Performance Schema table for SHOW SLAVE STATUS (WL#3656) — This work by Shivji Jha implements Performance Schema tables for the information returned by SHOW SLAVE STATUS. This approach has the following benefits: 1) The information on slave status can be searched with ad hoc queries 2) Pieces of information can be assigned to variables, and thus used in stored procedures 3) We can get rid of a few old fields to present only relevant information and avoid exposing too many internals 4) We can structure the information better, to align it with multi-source replication and multi-threaded slaves. See also Bug#50316, Bug#56414, and Bug#35994.

Security

Convert mysql_secure_installation Script to C (WL#6441) — This work by Vamsikrishna Bhagi converts the script to a C++ program, so that the program can connect to the server directly and execute the specified commands using the C API (libmysql). This removes the need for storing the user supplied password in a temporary option file.

Make Plugin Column in mysql.user Non-NULL (WL#6982) — This work by Vamsikrishna Bhagi ensures that the same rules apply to the native authentication methods as to the other authentication methods. In 5.5 we introduced a backward compatible layer to handle pre-4.1 password hashing. The pre-4.1 password hash has since been deprecated in 5.6, and this work simplifies the authentication code by removing special cases.

Optimizer

Explain For Other Thread (WL#6369) — This work by Evgeny Potemkin implements EXPLAIN for running queries. For example, if you are running a statement in one session that is taking a long time to complete, using EXPLAIN FOR CONNECTION in another session may yield useful information about the cause of the delay and help you optimize the statement. connection_id is the connection identifier, as obtained from the INFORMATION_SCHEMA PROCESSLIST table or the SHOW PROCESSLIST statement. If you have the PROCESS privilege, you can specify the identifier for any connection. Otherwise, you can specify the identifier only for your own connections. (Syntax: EXPLAIN [options] FOR CONNECTION connection_id;).

Print Costs and Amount of Data Processed in JSON based EXPLAIN (WL#6510) — This work by Evgeny Potemkin enhances the JSON EXPLAIN output by printing the total query cost, the cost per table, and the amount of data processed. This will make it easier for a user to see the difference between good and bad plans.

Refactoring

InnoDB: Policy-based Mutex (WL#6044) — This work by Sunny Bains refactors the InnoDB mutex implementation and makes it easy to add new policies and to customize mutex usage.

InnoDB: Use DBUG_PRINT for Tracing (WL#6947) — This work by Marko Mäkelä replaces compilation time flags with DBUG labels. This enables the use of the DBUG facility within InnoDB, so that the printouts from various InnoDB subsystems can be enabled by the –debug command line parameter.

InnoDB: Get Rid of the IB_OFFSETOF() Macro and Allow Non-POD Types in UT_LIST_* (WL#6456) — This work by Sunny Bains is pure internal code refactoring, adapting related InnoDB code to C++.

New Source File Structure for Authentication and Authorization Code (WL#6960) — This work by Harin Vadodaria simplifies existing code maintenance by splitting the large sql_acl.cc file into smaller files, grouped by their logical operations. No functional changes are introduced.

Replication: Refactor the Replication Dump Thread (WL#5721) — This work by Libing Song simplifies the design of the dump thread. The dump thread now only holds a lock for a very short time, while reading the position up to which the binary log has been written.

Optimizer: Improve Building of Classes That Represent DML Statements (WL#4772) — This work by Roy Lyseng refactors objects in the Item class hierarchy. DML statements (SELECT/UPDATE/DELETE/INSERT) are represented by objects from the Item class hierarchy and by st_select_lex and st_select_lex_unit objects. This WL fixes a problem regarding how these object hierarchies were built.

Optimizer: Move LooseScan Out of best_access_path() (WL#6739) — This work by Guilhem Bichot removes the cost estimation for LooseScan from best_access_path() and calls it only when we know we have a potential LooseScan for the first table. The benefit of this is a clean separation of concerns, a lighter best_access_path() implementation (less code and less structures on its stack, which is important as this function is called frequently), and code that’s easier to read and maintain.

Test Suite Improvements

Additional Concurrency Tests for BGC (WL#6969) — This work by Daogang Qu adds concurrency tests to verify that there are no deadlocks in group commit, non-group commit, and semi-sync areas.

Migrate Perfschema MTR Test Suite to Run with the InnoDB Storage Engine (WL#6733) — This work by Akhila Maddukuri adapts the performance schema test suite to be run with InnoDB as the default storage engine.

Migrate Main Test Suite to Run with the InnoDB Storage Engine (WL#6934) — This work by Vinay Fisrekar makes it possible to suppress result differences due to engine value alone, thus allowing us to run with a default engine of InnoDB or MyISAM.

Migrate funcs_2 Suite to Run with InnoDB (WL#6983) — This work by Viswanatham Gudipati makes all the testcases in the suite “funcs_2″ run with InnoDB engine.

Migrate jp,large_tests and Stress Suite Run with InnoDB (WL#6992) — This work by Viswanatham Gudipati makes all the testcases in the suites “jp”, “large_test”, and “stress” run with InnoDB engine.

Migrate funcs_1 Suite Run with the InnoDB Engine (WL#6924) — This work by Matthias Leich makes the “funcs_1″ suite run with InnoDB engine instead of the MyISAM engine in the cases where MyISAM specific properties are not the focus of the test.

Migrate opt_trace Suite to Run with InnoDB (WL#6925) — This work by Viswanatham Gudipati migrates the “opt_trace” suite to run with the InnoDB engine.

Fix ctype_* Tests to Run with a Default Engine of InnoDB (WL#7122) — This work by Vinay Fisrekar migrates tests which can be fixed easily as they have a result difference and/or warning which are expected (and can be suppressed by adding commands) due to engine differences.

Migrate Federated Test Suite to Run with InnoDB (WL#7112) — This work by Akhila Maddukuri migrates the tests in the “federated” suite to run with InnoDB.

For the migration of MTR test suites to InnoDB, see also Anitha’s blog post.

Deprecation & Removal

Deprecate ENCODE()/DECODE() in 5.7 (WL#6984) — This work by Georgi Kodinov triggers a warning (ER_WARN_DEPRECATED_SYNTAX) upon the execution of the ENCODE() or DECODE() functions.

Remove Unused –basedir and –datadir Options from mysql_upgrade (WL#7010) — This work by Georgi Kodinov removes these unused options.

Deprecate Unique Option Prefixes (WL#6978) — This work by Georgi Kodinov deprecates unique option prefixes and only accepts full option names. The old behavior has been found to be confusing and non-future proof. For example, the --compress option could be given to mysqldump or --compr, but not as --comp because the latter was ambiguous.

Deprecate the INFORMATION_SCHEMA.PROFILING Table (WL#6802) — This work by Georgi Kodinov completes the deprecation of PROFILING. In 5.6 we deprecated SHOW PROFILE in favor of the Performance Schema instrumentation. See also Bug#67525, reported by Peter Laursen.

The MySQL 5.7.1 Milestone Release

Due to all of the positive feedback that I received on the MySQL 5.7.3 & 5.7.4 blog posts, I have decided to complete the picture by doing the same for 5.7.1 & 5.7.2. Note that you will find historical releases — including milestone releases — at downloads.mysql.com.

So here is the belated post for the MySQL 5.7.1 Milestone Release, released on April 22, 2013. You can find the full list of changes and bug fixes in the 5.7.1 Release Notes. Enjoy!

InnoDB Online Alter Table

Online Rename Index This work by Marko Mäkelä (WL#6752) and Dmitry Lenev (WL#6555) makes it possible to rename an index as an online operation. Example: mysql> ALTER TABLE t RENAME INDEX i1 TO i2;

Enlarge VARCHAR column size online (WL#6554) This work by Marko Mäkelä makes it possible to enlarge varchar column sizes as an online operation. This is true as long as the number of bytes required by the VARCHAR column type (VARCHAR/LONGVARCHAR) remains the same, i.e. from 0 to 255 or from 256 to higher. Example: mysql> ALTER TABLE t1 ALGORITHM=INPLACE, CHANGE COLUMN c1 c1 VARCHAR(255);

InnoDB Temporary Table Performance

One of the goals of 5.7 is to optimize InnoDB temp tables for better performance. As a first step toward this goal, we made temp table creation and removal a more light-weight operation by avoiding the unnecessary step of persisting temp table metadata to disk.

Separate tablespace for temp tables (WL#6560). This work by Krunal Bauskar moves temp tables to a separate tablespace so that the recovery process for temp tables becomes a single stateless step by simply re-creating it at start-up. The new separate temp tablespace is named ibtmp1 and resides in the data directory.

Optimize DDL for temp tables (WL#6469) This work by Krunal Bauskar removes unnecessary persistence for temp tables. Temp tables are only visible within the connection/session in which they were created, and they are bound by the lifetime of the server (the temp tablespace is re-created at start-up). Thus, for statements such as CREATE TEMPORARY TABLE we will not make the table definitions persistent. This will eliminate unnecessary disk IO during the table create/drop phases.

InnoDB Spatial Index

One of the goals for 5.7 is to implement spatial indexing in InnoDB. To achieve this InnoDB needs to support GEOMETRY datatypes, so that InnoDB understands and stores them properly.

InnoDB: GEOMETRY datatypes support (WL#6455) This work by Zheng Lai adds a new InnoDB internal datatype called DATA_GEOMETRY, and maps all MySQL GEOMETRY datatypes to this new internal datatype. InnoDB continues to use BLOBs for the underlying storage, but the columns are marked as the proper datatype. Since InnoDB uses BLOBs for the underlying datatype, the prefix index on the GEOMETRY data column can still be used. Also, for the same reason, nothing special needs to be done for recovery related processes.

Partitioning

Add limited SQL HANDLER support for partitioning  (WL#6497) This work by Mattias Jonsson adds step by step index reads for the SQL HANDLER statements. The purpose is to improve the test coverage of index reads in the partition engine.

Triggers

BEFORE triggers are not processed for NOT NULL columns (WL#6030) This work by Dmitry Shulga ensures that we check column constraints at the end of the SQL statement. This is in compliance with the SQL standard. In 5.6 and earlier, MySQL checks the column constraints too soon. For example, if a column was NOT NULL, one could not INSERT a NULL in the original statement (or UPDATE it to NULL), even though an associated trigger later set the column to a NON-NULL value (thus satisfying the NOT NULL column requirement for the final row that is created). See also Bug#6295 reported by Peter Gulutzan and Dmitry Shulga’s blog post  about his implementation.

Error Reporting

Stacked Diagnostic areas (WL#6406) This work by Jon Olav Hauglid implements support for stacked diagnostic areas, according to the SQL standard. This includes updating the current implementation of HANDLER statements, SIGNAL/RESIGNAL, and stored program execution. The GET DIAGNOSTICS statement is also extended to support GET [STACKED] DIAGNOSTICS.

Error Logging

Split info logging and binary logging code (WL#6613) This work by Jon Olav Hauglid splits the binary log code from the general/slow/error log code in order to improve modularity and maintainability. This work also improves the code documentation.

Replication

Non blocking show slave status (WL#6402) This work by Nuno Carvalho adds a non-blocking option to SHOW SLAVE STATUS. The new option makes SHOW SLAVE STATUS non-blocking when run in parallel with STOP SLAVE (the STOP SLAVE command may take a long time to complete when it is waiting for an SQL thread to finish applying large events).

Add idempotent mode to mysqlbinlog (WL#6403) This work by Rohit Kalhans provides an idempotent mode of operation for mysql server. In this mode the server will ignore errors while applying row based events from a binlog file. This mode is useful when a DBA wants to replay binlogs using mysqlbinlog, but against a MySQL server which may not contain all of the data, so suppressing duplicate-key and no-key-found errors can be useful.

Add rewrite-db option to mysqlbinlog for RBR (WL#6404) This work by Manish Kumar adds support for a rewrite-db option to mysqlbinlog, so that a RBR event can be applied to a different database/schema. I.E. replace “from_db” to ‘to_db’ when reading a ROW based event from the binlog.

MySQL Client

Log interactive commands to syslog (WL#6788) This work by Nirbhay Choubey introduces a new client option, --syslog, which enables/disables the logging of attributes like sudo_user (or user), mysql_user, connection_id, db_server, DB, and query entered in an interactive session. This has been requested by some users due to auditing compliance requirements, i.e. requirements to log all interactive commands entered at the mysql command-line prompt to syslog.

Parser refactoring

Use Bison ‘locations’ for accessing statement text elements (WL#5967) This work by Gleb Shchepa replaces a home grown mechanism with standard Bison locations.

Test suite improvements

One of the goals of 5.7 is to improve our automated tests.

Make the MTR binlog suite GTID_MODE agnostic (WL#6663) This work by Nuno Carvalho makes the MTR binlog suite GTID_MODE agnostic, so standard replication test suites can be run with --gtid-mode=ON.

Switch default engine to InnoDB in MTR (WL#6731) This work by Sayantan Dutta changes MySQL Test Run (MTR) such that it uses InnoDB, unless MyISAM is explicitly specified within the test.

Migrate parts of MTR test suite to run with InnoDB storage engine (WL#6734) This work by Viswanatham Gudipati migrates the partitioning (parts) test suite to run with InnoDB as the default storage engine.

Migrate sys_vars MTR test suite to run with the InnoDB storage engine (WL#6735) This work by Akhila Maddukuri migrates the sys_vars test suite to run with InnoDB as the default storage engine.

Deprecation & Removal

Remove support for .sym files (WL#6561) This work by Dmitry Lenev removes the old custom Windows symlink implementation from MySQL 5.7. Starting from Windows Vista/Server 2008, a native symlink mechanism is supported by Windows (see the mklink command). MySQL will now use the native Windows mechanism going forward.

Remove INSERT DELAYED (WL#6073) This work by Martin Hansson removes the support for INSERT DELAYED in 5.7 and onwards. INSERT DELAYED has a lot of limitations and issues attached to it, and with InnoDB as default engine it becomes even less relevant. We decided to keep the INSERT/REPLACE DELAYED syntax, but convert it to INSERT/REPLACE with a warning before execution.

Remove innodb_mirrored_log_groups (WL#6808) This work by Satya Bodapati removes the innodb_mirrored_log_groups option in 5.7. The allowed range of the parameter innodb_mirrored_log_groups was 1..10, but there has always been (at least since 3.23.53) a check within InnoDB startup that refuses any value except 1. This is not meaningful and thus we decided to remove it.

A Multi-Table Trick to Speed up Single-Table UPDATE/DELETE Statements

In MySQL, query optimization of single-table UPDATE/DELETE statements is more limited than for SELECT statements. I guess the main reason for this is to limit the optimizer overhead for very simple statements. However, this also means that optimization opportunities are sometimes missed for more complex UPDATE/DELETE statements.

Example

Using the DBT-3 database, the following SQL statement will increase prices by 10% on parts from suppliers in the specified country:

Visual EXPLAIN in MySQL Workbench shows that the optimizer will choose the following execution plan for this UPDATE statement:

update-subquery

That is, for every row in the part table, MySQL will check if this part is supplied by a supplier of the requested nationality.

Consider the following similar SELECT statement:

In MySQL 5.6, the query optimizer will apply semi-join transformation to this query. Hence, the execution plan is quite different from the similar UPDATE statement:

select-semijoin

As you can see, there is no sub-query in this plan. The query has been transformed into a three-way join. The great advantage of this semi-join transformation is that the optimizer is now free to re-arrange the order of the tables to be joined. Instead of having to go through all 179,000 parts, it will now start with the estimated 414 suppliers from the given country and find all parts supplied by them. This is obviously more efficient, and it would be good if MySQL would use the same approach for the UPDATE statement.

The Multi-Table Trick

Unlike single-table UPDATE statements, the MySQL Optimizer will use all available optimizations for multi-table UPDATE statements. This means that by rewriting the query as follows, the semi-join optimizations will apply:

Notice the extra dummy table in the first line. Here is what happens when I execute the single-table and multi-table variants on a DBT-3 database (scale factor 1):

As you can see, execution time is reduced from 2.63 seconds to 0.40 seconds by using this trick. (I had executed both statements several times before, so the reported execution times are for a steady state with all accessed data in memory.)

Multi-Table DELETE

The same trick can be used for DELETE statements. Instead of the single-table variant,

you can use the equivalent multi-table variant:

This rewrite gives a similar performance improvement as reported for the above UPDATE statement.

MySQL-5.7 improves DML oriented workloads

In MySQL 5.7, we have improved the scalability of DML oriented workloads in InnoDB. This is the result of a number of changes, which I will outline below.

(1) Fix index->lock contention

This RW lock protects all indexes, both the cluster and the secondary indexes.

Before 5.7, every modifications to non-leaf pages (every modifications for the tree structure) required to exclude the other threads’ access to the whole index by X-lock, and every concurrent accessing the index tree were blocked. This was the major reason of the index->lock contention in concurrent DML workloads.

In MySQL 5.7 concurrent access is now permitted to the non-leaf pages (internal nodes of the B+Tree) as long as they are not related to the concurrent tree structure modifications (WL#6326). This change reduces the major point of contention.

(2) Page cleaner thread optimizations

In MySQL 5.6, we introduced a dedicated page cleaner thread to handle background operations including flushing dirty pages from the buffer pool to storage and keeping number of free pages. By separating this task to its own thread, user threads are freed from doing this additional work. This has improved the CPU cost and should solve some cases of CPU bound problems. However, there still existed a scenario where in some DML oriented workloads there were too many tasks for a single page cleaner thread to keep up with. This could result in a reduction in performance as user threads were required to flush and keep sufficient pages free.

In MySQL 5.7, there have been two improvements in this area:

  1. The buffer pool list scans (e.g. flush_list, LRU) for flushing have been optimized and reduced in cost (WL#7047). This also improves the user threads’ flush/evict page operation (to obtain free page), which is necessary in the scenario that the page cleaner thread is too far behind. This change lowers the performance risk when the page cleaner is not able to perform enough work due to sub-optimal configuration settings.
  2. Multiple page cleaner threads are now supported, allowing these operations to occur in parallel. WL#6642.

(3) log_sys->mutex optimization

MySQL 5.7 reduces the impact of log_sys->mutex, which is held to control access to the log buffer and log writing. The impact of this change is most visible when innodb_flush_log_at_trx_commit=2, because the log writing without sync is not blocked waiting for a sync by the change.

(4) Avoiding the ‘read-on-write’ during transaction log writing

The InnoDB transaction log is written in block sizes of 512 bytes, which is often smaller than the block-size of the underlying device or file system. In the event that the transaction log is not memory-resident in an OS cache, a read may be required to be able to load the remainder of the underlying device’s block, write in place the InnoDB transaction log page, and then write out the underlying page. We refer to this problem as a read-on-write to save the contents of the transaction log which is not needed to save.

In MySQL 5.7 we address this problem by adding a new option of innodb_log_write_ahead_size. This allows the user to effectively pad write operations to complete the full block of the underlying device or file system, negating the need for a read-on-write modification. This change results in better stability of log throughput as there will no longer be a situation where some writes are effectively cached and others will not be cached.

We continue to investigate other ways of addressing this problem. For example, on an SSD, deallocation like FALLOC_FL_PUNCH_HOLE might be better if it is supported.

(5) Future improvements

We are continuing to focus on improving DML performance for 5.7. Some of our next areas of research include:

  • Implementing improvements to the adaptive flushing algorithm (suggestion by Dimitri Kravtchuk)
  • Setting a thread priority for the page_cleaner (in Linux for now)
  • Addressing an issue where an overload of flushing can occur when the oldest modification reaches max_modified_age_sync. (lowers risk to reach max_modified_age_sync; proper throughput along with flushing around max_modified_age_sync)
  • Introducing page fill factor to control frequency of merge/split of the index pages

Important Change in Behavior: MySQL 5.7 will be more sensitive for flushing related options

As the result of the above improvements (including the future works), MySQL 5.7 has will respect configuration settings much closer and adjusting settings to reflect underlying hardware device(s) IO capabilities will be more important to optimize throughput. For example: settings that are too conservative may prevent the page cleaner thread from competing enough work.

innodb_io_capacity_max ≤ [actual max write pages/s]

As the result of the adjustments, 5.7 will always try to respect innodb_io_capacity_max for flush_list flushing. If the amount of outstanding work is too large, the page cleaner might spend too much time performing flush_list flushing and not complete some of the other tasks required of it. The actual maximum “write pages/s” can be confirmed by watching PAGES_WRITTEN_RATE value of INFORMATION_SCHEMA.INNODB_BUFFER_POOL_STATS, for example.

innodb_buf_pool_instances × innodb_lru_scan_depth ≥ [actual max read page/s]

The setting innodb_lru_scan_depth can now be considered as the target of free pages for each buffer pool instance at flushing operation of the page cleaner. A single round of page cleaner tasks is also intended to be completed within one second. So, “read page/s” is affected by innodb_buf_pool_instances × innodb_lru_scan_depth. Setting innodb_lru_scan_depth to a very high high value is not recommended, because the free page keeping batch might take too long. (* The actual maximum “read pages/s” can be confirmed by watching PAGES_READ_RATE value of INFORMATION_SCHEMA.INNODB_BUFFER_POOL_STATS, also for example.)

Re-factoring some internals of prepared statements in 5.7

When the MySQL server receives a SELECT query, the query goes through several consecutive phases:

  • parsing: SQL words are recognized, the query is split into different parts following the SQL grammar rules: a list of selected expressions, a list of tables to read, a WHERE condition, …
  • resolution: the output of the parsing stage contains names of columns and names of tables. Resolution is about making sense out of this. For example, in “WHERE foo=3“, “foo” is a column name without a table name; by applying SQL name resolution rules, we discover the table who contains “foo” (it can be complicated if subqueries or outer joins are involved).
  • optimization: finding the best way to read tables: the best order of tables, and for each table, the best way to access it (index lookup, index scan, …). The output is the so-called “plan”.
  • execution: we read tables as dictated by the plan above, and send output rows to the client.

This design hasn’t changed since many many years. Originally, MySQL didn’t have prepared statements. So it took a query, passed it through the phases above, and then threw it away. This still happens nowadays when using non-prepared statements.
But, with a prepared statement, come more requirements. MySQL has to be able to execute the query a second time (and a third time, and so on, at the user’s will). No matter the used API (the C API or the SQL API), this usually looks like the following dialogue:

SQL mandates that syntax and semantic errors be reported by PREPARE, without postponing them to EXECUTE. So, PREPARE needs to do at least parsing and resolution, to spot any such error.
Then there is an implementation question: when it receives "EXECUTE stmt", how can MySQL remember what “stmt” is?
The simplest, most obvious solution would be, during PREPARE, to simply store "stmt" => "SELECT * FROM t" in some session-local key/value dictionary. When "EXECUTE stmt" comes, MySQL would find the query string in the dictionary, and go through all stages (parsing, resolution, optimization, execution). And when the next "EXECUTE stmt" comes, it would do the same again.
Of course, that would not be efficient. At least parsing and resolution theoretically don’t need to be re-done every time: query’s text doesn’t change, so only one parsing is necessary; the structure of tables (names and type of columns) doesn’t change, so only one resolution is necessary 1. So, what PREPARE actually stores in the key/value dictionary, is "stmt" => tree, where “tree” is the output of the resolution phase; it’s an in-memory tree of C++ objects, each object representing an expression, a column of a table, a table, a left join, a boolean condition, …
With this design, only optimization and execution are repeated by each "EXECUTE stmt" 2. Repeating optimization is sometimes a smart thing to do; imagine that between the first and second EXECUTE, the table’s data distribution has significantly changed (perhaps due to other clients), then the plan of the first EXECUTE is maybe not the best one anymore, and searching for a new plan makes sense. In other cases, repeating optimization is a waste, because the old plan would be good enough. As of today, MySQL always repeats optimization; making this more flexible would be an interesting enhancement for the future, but today’s situation is: we repeat optimization.

Optimization does everything it can to speed up the query. Imagine the query references 10 tables in its FROM clause. To find the best order to access those 10 tables, we need to consider 10! possible orders, roughly 3,000,000. It takes time to consider that many plans.  The optimizer has techniques to cut this down, let me show one. If one table is referenced like this in the WHERE clause:
WHERE t.pk=39 AND ...
then, because “t.pk” is a primary key, I know that at most one row of this table will participate in the result, and it makes sense to read it with a primary key lookup. And this read has to be done only once in this execution. So let’s put “t” first in the plan, unconditionally. This technique, known as “constant tables”, divides the number of possible orders by a factor of 10 in our example (only 9 tables left to order). Now, imagine that the WHERE clause looks like this:
WHERE t.pk=39 AND t.c1>t2.c2
As soon as I read the row of “t” with a primary key lookup (looking up 39), I get access to the value of t.c1; say it is 100. My condition can thus be simplified to:
WHERE 100>t2.c2
Notice how the AND has disappeared.
The inequality predicate “100>t2.c2″ is interesting: assuming that t2.c2 is indexed, it means that a range scan is a possibility for reading t2.
This little example was meant to demonstrate that MySQL, in the optimization phase, does data-based transformations to the query: transformations which depend on the content of tables, and which apply for one single execution; indeed, when the next execution comes, it may be that the row of “t” with t.pk=39 now has t.c1=1200, so our condition simplifications don’t hold anymore. Another way to say this, is: data-based transformations must be non-permanent.
So the AND operator which was killed above during the optimization of the first EXECUTE, must be resurrected for the optimization of the second EXECUTE.
To achieve this, at the end of resolution we rename tree to permanent_tree, then we make a copy of it, which we name tree. The optimization phase has access to tree, and no access to permanent_tree. The optimization operates on tree, does data-based transformations on it. When the second EXECUTE starts, permanent_tree is fetched, copied into a new tree, on which the optimization operates. For example, the AND operator always exists in permanent_tree.

What about permanent transformations? For example, consider this query:
SELECT * FROM t1 JOIN (t2 JOIN t3 ON t2.a=t3.a) ON t2.b+t3.b>t1.b
WHERE t1.b*2>15;

The result of this query is defined by the SQL standard as: get the result of the most inner join (t2,t3), filter with t2.a=t3.a, then join the result with t1, filter with t2.b+t3.b>t1.b, then filter with t1.b*2>15. With this definition, the order of tables’ reading and the order of application of conditions are constrained. For example, reading t2 then t1 then t3 is not possible. But if we notice that the query is equivalent to:
SELECT * FROM t1, t2, t3 WHERE t2.a=t3.a AND t2.b+t3.b>t1.b
AND t1.b*2>15;

then we have 3! = 6 possible orders for tables. More plans to examine, but more potential for finding a good plan – it is very possible that the best plan, the one yielding the fastest execution, is not among the ones suggested by the SQL standard’s definition.
The equivalence between both queries is semantics-based, not data-based. Thus, the transformation from the first to the second query can be permanent: it can be done once for all, not at every EXECUTE.
Permanent transformations include:

  • JOIN-to-WHERE, like in the example above
  • outer-join-to-inner-join, when there is something in the WHERE clause which allows to deduce that NULL-complemented rows will actually not participate in the result
  • semi-join, more or less merging an “IN (subquery)” predicate into the parent query
  • IN-to-EXISTS, rewriting “x IN (subquery)” to “EXISTS (modified subquery)”.

After giving all this background (phew…), I’m now ready to explain one re-factoring which I did in MySQL 5.7.4. The situation of permanent transformations in MySQL 5.6 is the following:

  1. IN-to-EXISTS : done in the resolution phase.
  2. semi-join, JOIN-to-WHERE, outer-join-to-inner-join: done in the optimization phase.

Doing permanent transformations during optimization (step (2) above) is surprising, as optimization is re-done at every EXECUTE. Fortunately, we can internally know if an optimization is the first or not; if it’s not, we skip step (2).
So in the end, efficiency is guaranteed – permanent transformations are not re-done. Though this design agreeably looks strange.

Now, putting pieces together, on the way to more strangeness: you remember that after the resolution phase, we produce permanent_tree and save it for all future EXECUTEs. It thus contains permanent transformations done in resolution, good. But, it does not contain those done in optimization (semijoin …), as optimization runs after permanent_tree has been produced. Still we do want semi-join and friends to be reused by all future EXECUTEs, so they must be put in permanent_tree! So we get an even more strange design:

  1. resolution
    1. do some permanent transformations (IN-to-EXISTS) on tree
    2. rename tree to permanent_tree
    3. copy permanent_tree to a new tree (for optimization to do nonpermanent transformations on it)
  2. optimization
    1. if first optimization:
      1. do some more permanent transformations (semijoin, etc) on tree
      2. throw permanent_tree away
      3. copy tree to a new permanent_tree
    2. do nonpermanent transformations on tree
  3. execution
    1. read tables and send rows
    2. throw tree away

This has a few nasty effects:

  • permanent transformations are scattered: the boundary between phases is blurred, which is a code readability problem, and an impediment for future improvements in the Optimizer in general
  • efficiency loss: copying a tree in resolution phase takes some time and memory; in the first optimization we throw permanent_tree away and do the copying again… Bad.
  • real bugs. Yes, because what the resolution phase thinks of permanent_tree is not true anymore: this object has been deleted and replaced by another one, in the first optimization… so the next EXECUTE gets confused…

In MySQL 5.7.4, I have moved all permanent transformations to where they belong, so now we have the more straightforward design:

  1. resolution
    1. do all permanent transformations on tree
    2. rename tree to permanent_tree
  2. optimization
    1. copy permanent_tree to a new tree (for optimization to do nonpermanent transformations on it)
    2. do nonpermanent transformations on tree
  3. execution
    1. read tables and send rows
    2. throw tree away

If we had done this re-factoring earlier, we would have saved us some headaches. But, better late than never. It’s at least comforting that nowadays we have time to do re-factoring, even if it means spending several man-months on a task like that. It really took that much time: the idea may look simple, but the devil was in the details as usual, and if you add, on top of coding time, review time by two reviewers, and QA time to verify that I didn’t break anything… But in my team we are convinced that this is a long-term investment which will pay. Moreover, when we do such re-factoring work, it gives us the occasion to remove the little hacks which accumulated over time to work around the root problem (which the re-factoring finally addresses). And that leads to even more code simplification.

There are a few more details in the Worklog page if you are interested (note the multiple tabs there).

That’s all for today. If you reached this line, congratulations :-)

Footnotes:
1. I hear you say: “what if an ALTER TABLE is done between PREPARE and EXECUTE?! This may change names and types of columns!”. Yes, you are right, but there exists, somewhere in the prepared statement subsystem, a detector for this; it runs when EXECUTE starts, and if it spots a table structure change since the statement was prepared, it throws away the prepared statement (as “out-of-date”), silently re-prepares it and then docks into EXECUTE; the user does not notice, except if she looks at com_stmt_reprepare in SHOW STATUS. This detector is external to the Optimizer and runs before it; so if the Optimizer is involved in EXECUTE, it can safely assume that nothing has changed since PREPARE.
2. Ok, a part of resolution is repeated too. It would be good to avoid that, in the future.

MySQL EXPLAIN Explained

In years past, MySQL was a bit of a black box when it came to understanding what was happening and why. In MySQL 5.6 and 5.7, we’ve added many new features that provide much needed transparency and insight into the inner workings of MySQL. The single biggest feature was the new Performance Schema, but some other examples are:

  1. The ability to see what query generated a row based binary log event.
  2. The ability to see a tremendous amount of data points for InnoDB.
  3. The ability to see an extended query execution plan using EXPLAIN FORMAT=JSON (which can also now be used for running queries).
  4. The ability to see a query execution trace via the new optimizer trace functionality.

I’m going to focus on #3 and #4 here. This will hopefully whet your appetite for a webinar coming up where I will dive deeper into the EXPLAIN related features.

Toward that end, I’ll show you an example of the JSON formatted explain, an optimizer trace, and finally of Workbench’s Visual Explain. For all of these, I’ll be using the Sakila test database.

JSON EXPLAIN

Optimizer Trace

Workbench Visual Explain
These new insights also allow for clients to provide additional visual representations of this data. The new Visual Explain functionality in MySQL Workbench 6.1 is a great example. Let’s look at that query again:

Example Visual Explain
Example Visual Explain

Conclusion
Again, if all of this looks interesting, but a bit overwhelming, then please join Matt and I for our webinar coming up this month.

Rankings with InnoDB Full-Text Search

Summary
Since MySQL 5.6 went GA—where among many other new features, we introduced full-text indexes for the InnoDB storage engine—questions have occasionally come up about InnoDB full-text search relevancy rankings when doing BOOLEAN MODE searches. Typically these questions revolved around a core issue: why do the results differ from that of MyISAM? In short, the InnoDB document search and relevancy rankings were modeled after the open source Sphinx full-text search engine, and not after the MyISAM storage engine (which sometimes simply produces incorrect results). The algorithms used for the InnoDB implementation are based on the well known BM25 and TF-IDF ranking algorithms.



Variables
Before we get to some examples, we should first talk about the variables involved that can alter the results (the resulting differences are also often the cause of slightly different results between InnoDB and MyISAM). I won’t go through each one again here, as they (innodb_ft_*) are described in the manual page. The ones that directly affect search results—innodb_ft_min_token_size, innodb_ft_max_token_size, innodb_ft_server_stopword_table, innodb_ft_user_stopword_table, innodb_ft_enable_stopword—are also discussed here. These variables affect the core concepts involved:

  • TOKENS — The index actually contains tokens, which you can simply think of as individual words. The common token separators or word boundaries (at least in latin based languages) are whitespace characters and punctuation characters: dashes, underscores, periods, etc. If a word is shorter than the min_token size, then it will not be indexed and thus not be used for the searches. The same is true if the word is longer than the max_token size.
  • STOPWORDS — A stopwords list contains words that will explicitly be skipped when creating the index records. So any word in this list will not be used for searches.



How the Calculation Is Done
The relevancy ranking system we use is a variation of “term frequency / inverse document frequency” (TF-IDF). In short, the more often the word appears in an individual document (“document” in this context is simply a full-text index record), and the less frequently the word appears in all of the documents, the higher the individual document is ranked.

The IDF (inverse document frequency) value of a search word is calculated as:
${IDF} = log10( ${total_records} / ${matching_records} )

Unlike the IDF value, the TF (term frequency) value is calculated for each matching record, and it is simply the number of times that the search word appears in that matching document. If the document contains the search word/token multiple times, then its IDF value is multiplied by the TF value.

The final relevancy ranking for a document is then calculated this way:
${rank} = ${TF} * ${IDF} * ${IDF}



Creating Some Sample Data
Let’s first create some sample data, so that we can walk through a simple example:



A Simple Example
So far, all of this may sound more complicated than it really is, so let’s look at a simple example (using MySQL 5.6 Enterprise):

We have 8 records in total, with 3 of them matching the search term “database”, and the first record (id 6) contains the search term 6 times. Let’s now focus on the first record (id 6), and see how its relevancy ranking was calculated at 1.0886961221694946. I’ll use easy to understand pseudo code to explain it, but you can see the source code and the actual calculations that InnoDB uses here.

First we calculate the IDF value for the search term “database”:
${IDF} = log10( 8 / 3 ) = 0.42596873216370745

Then we factor in the TF value (6) for the first record (id 6) in determining the final relevancy ranking value for it:
${rank} = (6 * 0.42596873216370745 * 0.42596873216370745) = 1.08869616468694

You may notice a slight difference between the two results: 1.08869616468694 versus 1.0886961221694946. That is the result of differences related to floating point precision and rounding decisions, and how they are done internally by InnoDB versus how they’re done in the mysql command-line client or in your chosen calculator (I happen to be using Calculator.app in OS X Mavericks in the calculations above).



Some Simple Multi-Word Search Examples
If you search for multiple words, then the final relevancy ranking value is simply a sum of the previously described calculation, done for each search word. Again, the calculation is:



Further Details
For further information, and a deeper dive into the functionality, please see Jimmy’s Dr. Dobb’s article.



Conclusion
Hopefully this has helped a bit in understanding the document relevancy rankings used for InnoDB full-text searches. If you feel that any specific results are incorrect, please let us know! When checking for correctness, the results should be validated against the BM25 and TF-IDF ranking algorithms, or simply compared against the open source Sphinx full-text search engine. If you have a specific odd case, feel free to mention the details in a comment below, or in a new bug report here.