InnoDB Crash Recovery Improvements in MySQL 5.7


InnoDB is a transactional storage engine. Two parts of the acronym ACID (atomicity and durability) are guaranteed by write-ahead logging (WAL) implemented by the InnoDB redo log.

A statement within a user transaction can consist of multiple operations, such as inserting a record into an index B-tree. Each low-level operation is encapsulated in a mini-transaction that groups page-level locking and redo logging. For example, if an insert would cause a page to be split, a mini-transaction will lock and modify multiple B-tree pages, splitting the needed pages, and finally inserting the record.

On mini-transaction commit, the local mini-transaction log will be appended to the global redo log buffer, the page locks will be released and the modified pages will be inserted into the flush list. Only after the log buffer has been written out to the redo log file, InnoDB can write the modified pages from the buffer pool back to the tablespace files.

If the system crashes before the end marker of the mini-transaction log is written out to the redo log, the entire mini-transaction (and any subsequent mini-transactions) will be discarded. An operation is made durable by a combination of mini-transaction commit and a write to the redo log file. (Unless the undo log header update for the user transaction commit was written to the log before the crash, the user transaction will be rolled back.)

Problems with Single-Table Tablespaces

MySQL 4.1 introduced the setting innodb_file_per_table, which instructs tables to be created in separate files, instead of creating them inside the InnoDB system tablespace, which is usually called ibdata1.

The InnoDB redo log is referring to tablespaces by a number, space_id, while the file system knows the tablespaces as tablename.ibd. If InnoDB startup notices that the system was not shut down cleanly (there were some log records written since the latest log checkpoint), it will have to construct a mapping from space_id to file names, so that the redo log records can be replayed to compensate for the missing writes of modified pages from the buffer pool to the files.

To construct this mapping, InnoDB used to traverse the data directory, reading the first pages of all *.ibd files. This can cause a lot of unnecessary downtime when only a few files were modified since the latest log checkpoint. There could be thousands of files.

When MySQL 5.6 implemented the DATA DIRECTORY clause for InnoDB tables, it introduced *.isl files as placeholders, pointing to the real location of *.ibd files. This added further complexity to the tablespace file discovery.

The MySQL 5.7 Solution: Write File Names to the Redo Log

MySQL 5.7 introduces a new redo log record type MLOG_FILE_NAME for identifying those non-predefined files that were changed since the latest log checkpoint. To avoid growing the volume of the redo log, only one record will be emitted for each tablespace that was modified since the latest checkpoint.

This change narrowly missed the MySQL 5.7.4 milestone release. It is included in the MySQL Labs Release based on 5.7.4. Note that this will change the redo log format, making upgrades and downgrades impossible unless the system was shut down cleanly before upgrading or downgrading.

The objective of this change is to eliminate the use of the file system as a ‘data dictionary’ during redo log processing (before applying redo log):

  • Do not read the first page of all $datadir/*/*.ibd files
  • Do not check the contents of $datadir/*/*.isl files

These changes will improve reliability as follows:

  • We can ignore extra *.ibd files that are not attached to the InnoDB instance. For example, if the system crashes before the completion of ALTER TABLE…IMPORT TABLESPACE, there could be files with duplicate space_id that could currently cause trouble. Thanks to the MLOG_FILE_NAME redo log records introduced in this feature, redo log apply can ignore such files unless there is a possible name clash due to RENAME TABLE.
  • We will not silently discard redo log records if some *.ibd file is missing without the redo log containing a MLOG_FILE_DELETE record. For example, if a file rename went bad and recovery failed because of a missing tablespace file, the DBA can manually rename the file and restart crash recovery. In innodb_force_recovery mode, missing *.ibd files will continue to be ignored.
  • Failure scenarios related to inconsistent *.isl files will be eliminated during redo log apply. Redo log records will contain references to *.ibd file names; the *.isl files will only be used after redo log apply when opening tables.

This work is only a part of ongoing reliability improvements for InnoDB. Watch this space for updates. ☺

About Marko Mäkelä

Marko Mäkelä joined Innobase Oy in 2003 after completing his doctoral thesis in theoretical computer science at the Helsinki University of Technology. Since then, companies have changed owners, but Marko is still working on the internals of the InnoDB storage engine.

4 thoughts on “InnoDB Crash Recovery Improvements in MySQL 5.7

  1. Alice, when InnoDB creates a non-temporary table, a large number of redo log entries will be written. The data dictionary is stored in tables as well. So, there will be redo log records to cover the following:
    1. Write undo log record for rolling back the INSERT to some dictionary table
    2. Insert into the clustered index of the dictionary table.
    3. Insert into each secondary index (if any).
    4. Update the undo log record to mark the transaction committed.
    Each of these lines would be a separate mini-transaction (mtr_t::commit in InnoDB source code).
    In addition to these, there would be the file creation, if it is a file-per-table tablespace. With this change, there will be no redo log record written for creating a file. Instead, there will be a MLOG_FILE_NAME record for initializing the first pages of the file.

    Note that the DDL operations will still not be entirely crash-safe in MySQL 5.7, even if we ignore the fact that the *.frm files can get out of sync with the internal data dictionary in InnoDB. See for example:
    Bug#74663 DROP TABLE corrupts table | inode line 1887 fseg_inode_get

    We aim to have crash-safe DDL in the Global Data Dictionary:

Leave a Reply

Your email address will not be published. Required fields are marked *