The InnoDB Change Buffer

One of the challenges in storage engine design is random I/O during a write operation. In InnoDB, a table will have one clustered index and zero or more secondary indexes.  Each of these indexes is a B-tree.  When a record is inserted into a table, the record is first inserted into clustered index and then into each of the secondary indexes.  So, the resulting I/O operation will be randomly distributed across the disk.  The I/O pattern is similarly random for update and delete operations. To mitigate this problem, the InnoDB storage engine uses a special data structure called the change buffer (previously known as the insert buffer, which is while you will see ibuf and IBUF used in various internal names).

The change buffer is another B-tree, with the ability to hold the record of any secondary index.  It is also referred to as a universal tree in the source code.  There is only one change buffer within InnoDB and it is persisted in the system tablespace.  The root page of this change buffer tree is fixed at FSP_IBUF_TREE_ROOT_PAGE_NO (which is equal to 4) in the system tablespace (which has space id of 0). When the server is started, the change buffer tree is loaded by making use of this fixed page number.   You can refer to the ibuf_init_at_db_start() function for further details.

The total size of the change buffer is configurable and is designed to ensure that the complete change buffer tree can reside in main memory.   The size of the change buffer is configured using the innodb_change_buffer_max_size system variable.

Overview of Change Buffering

Change buffering is applicable only to non-unique secondary indexes (NUSI).  InnoDB buffers 3 types of operations on NUSI: insert, delete marking, and delete.  These operations are enumerated by ibuf_op_t within InnoDB:

One important point to remember is that the change buffering is leaf page oriented. A particular operation to NUSI is buffered in the change buffer only if the relevant non-root leaf page of the NUSI is not already available in the buffer pool.  This means that the buffered change is predefined to happen in a particular leaf page of a NUSI within the InnoDB system.  This makes it necessary to track the free space available in the NUSI leaf pages.  This tracking is necessary because merging these buffered operations to the NUSI leaf page must not result in a B-tree page split or B-tree page merge.

Special Change Buffer Fields

When a NUSI record is buffered in the change buffer, 4 special change buffer fields are added to the beginnning of the NUSI record.  Each of these 4 fields and their contents are explained below.  The primary key of the change buffer tree is then {space_id, page_no, count}, where the count helps to maintain the order in which the change is buffered for that particular page.  The change buffer row format has evolved over a period of time and the following table provides information for MySQL 5.5+:

Field NumberMacro NameField LengthField Description
1IBUF_REC_FIELD_SPACE4 bytesThe identifier of the space in which NUSI exists.
2IBUF_REC_FIELD_MARKER1 bytesIndicates whether the change buffer row format is old or new. If present (and zero), row format is newer (MySQL 4.1+)
3IBUF_REC_FIELD_PAGE4 bytesThe leaf page number of the NUSI to which the buffered row belongs.
4IBUF_REC_FIELD_METADATA2 bytesCounter field, used to sort records within a (space_id, page_no) in the order they were added.
1 byteOperation type (ibuf_op_t).
1 byteRow format flag. If 0, the user index record is in REDUNDANT row format. If IBUF_REC_COMPACT, then user index record is in COMPACT row format.
Variable length. DATA_NEW_ORDER_NULL_TYPE_BUF_SIZE (which is 6) bytes per field.Type information affecting the alphabetical ordering of the fields and the storage size of an SQL NULL value.
5IBUF_REC_FIELD_USERthe first user field.

The row format of the change buffer records themselves are always REDUNDANT.

Change Buffer Bitmap Page

The free space information for each page is tracked in predefined pages called the change buffer bitmap page, which is also known as the ibuf bitmap page.  These pages always follow the extent descriptor pages.  The following table gives the predefined page numbers of the change buffer bitmap pages:

Page Size in Kilobytes (KB)Extent Size in PagesExtent Descriptor Page NumbersChange Buffer Bitmap Pages (ibuf bitmap pages)Number of pages described by one ibuf bitmap page
42560, 4096, 8192, 12288, ...1, 4097, 8193, 12289, ...4096
81280, 8192,16384, 24576, ...1, 8193, 16385, 24577, ...8192
16640, 16384, 32768, 49152, ...1, 16385, 32769, 49153, ...16384
32640, 32768, 65536, ...1, 32769, 65537, ...32768
64640, 65536, 131072, ...1, 65537, 131073, ...65536

The page number 1 is also referred to as the FSP_IBUF_BITMAP_OFFSET. These change buffer bitmap pages help to answer the following questions quickly:

  • Does the given page have any buffered changes in the ibuf (insert/change buffer) tree?  This question will be asked when a page is read into the buffer pool.  The buffered changes will be merged to the actual page before putting it into the buffer pool.
  • Does the given page have enough free space so that a change can be buffered? This question will be asked when we want to modify a leaf page of NUSI and it is not already available in the buffer pool.

In the next section we will see at the information stored in the ibuf bitmap page which helps to answer above questions.

Information Stored in Change Buffer Bitmap Page

The change buffer bitmap page uses 4 bits (IBUF_BITS_PER_PAGE) to describe each page. It contains an array of such 4 bits describing each of the pages.  This whole array is called the “ibuf bitmap” (insert/change buffer bitmap).  This array begins after the page header at an offset equal to IBUF_BITMAP (which is equal to 94). Given a page number, the ibuf bitmap page that contains the 4 bit information on the given page can be calculated as: ulint bitmap_page_no = FSP_IBUF_BITMAP_OFFSET + ((page_no / page_size) * page_size); .

You can refer to the ibuf_bitmap_page_no_calc() function for more details on the complete calculation.  Likewise, given a page number, the offset within the change buffer bitmap page contains the 4 bits that can be used easily for the calculations.  I leave this as an exercise to the reader (refer to function ibuf_bitmap_page_get_bits_low() for further info). The following table provides details about these 4 bits:

IBUF_BITMAP_FREE0The first two bits are used to represent the free space available in the leaf page of the NUSI.
IBUF_BITMAP_BUFFERED2The third bit if set means that the leaf page has buffered entries in the change buffer.
IBUF_BITMAP_IBUF3The fourth bit if set means that this page is part of the change buffer.

This means that only 2 bits are available to store the free space information of the page.  There are only 4 possible values: 0, 1, 2, 3.  Using these 2 bits, we try to encode the free space information for a page.  The rule is as follows — there must be at least UNIV_PAGE_SIZE / IBUF_PAGE_SIZE_PER_FREE_SPACE bytes of free space for the change buffer to be used:

Tracking Free Space of Pages

Before an insert operation (IBUF_OP_INSERT) is buffered, the free space available in the target NUSI leaf page is approximately calculated using the information available in the change buffer bitmap page. This conversion is done in the ibuf_index_page_calc_free_from_bits() function and the formula used is:

The following table provides the conversions done from the encoded value found within the change buffer bitmap page to a meaningful value in bytes:

Page Size in Kilo Bytes (KB)Free Space Information from IBUF bitmap page
The free space available in NUSI leaf page (approximately assumed)
400 bytes
1128 bytes
2256 bytes
3512 bytes
1600 bytes
1512 bytes
21024 bytes
32048 bytes

Using this information, we can determine if the record to be buffered will fit into the page or not.  If there is enough space then the insert will be buffered.  Using this approach, we ensure that merging these records to the target NUSI will not result in a page split.

Updating Free Space Information

After buffering an insert or delete operation, the free space information in the change buffer bitmap page must be updated accordingly (a delete mark operation will not change the free space information).  To update the free space information we need to convert the free space in bytes back to the IBUF encoded value.  This is done in the ibuf_index_page_calc_free_bits() function using the following formula:

In the above formula, max_ins_size is the maximum insert size (maximum free space) available in the page after page re-organization.

Record Count in NUSI Leaf Pages

In the case of a purge operation  (IBUF_OP_DELETE),  we need to ensure that the number of records in the target NUSI leaf page doesn’t go to zero because in InnoDB, the leaf pages of B-tree are not allowed to become empty.  They must have at least 1 record. Since the number of records in the target NUSI page is unknown (because it is not loaded into the buffer pool yet), the buffered insert operations are taken into account.  If 1 insert operation is buffered then it is assumed that the target NUSI page has 1 record, and if 2 insert operations are buffered then it is assumed that the target NUSI page has 2 records and so on.  In this way the number of records in the target NUSI page is calculated.  Based on this calculated record count, a purge operation is either buffered or not.  This means that if there are no buffered insert or delete-mark operations, then no purge operations can be buffered.

Merging the Buffered Changes to Target NUSI Page

The changes to NUSI leaf pages are buffered in the change buffer if the NUSI leaf page is not already in the buffer pool. These buffered operations are merged back into the actual NUSI leaf page under various circumstances:

  1. When these NUSI leaf pages are subsequently read into the buffer pool, the buffered operations will be merged.  A NUSI leaf page could be read into the buffer pool during an index lookup, an index scan or because of read ahead.
  2. When the master thread of InnoDB periodically does change buffer merges by calling ibuf_merge_in_background().
  3. When there are too many operations buffered for a particular NUSI leaf page.
  4. When the change buffer tree reaches its maximum allowed size.

The change buffer merge operation is initiated by calling the ibuf_merge_in_background() or ibuf_contract() functions. The change buffer merges are done either in the foreground or in the background. The foreground change buffer merges are done as part of a DML operation and hence will affect the performance experienced by the end user.  The background change buffer merges are instead done periodically when there is less activity in the server.

We ensure that the change buffer merge does not result in a B-tree page split or page merge operation. It also shouldn’t result in an empty leaf page.  Before the target NUSI leaf page are placed into the buffer pool, the buffered changes are applied to them. Once the buffered changes are merged for a page, its associated 4 bits of information in the change buffer bitmap page are also updated.


This article provided an overview of the change buffer subsystem of InnoDB.  It explained the additional fields that are added to a secondary index record before storing it within the change buffer.  It provided information about how the change buffer keeps track of free space information of the NUSI leaf pages by making use of the change buffer bitmap page.  It also explained the need to calculate the number of records in the NUSI leaf pages so that it doesn’t become empty.

Thanks to Marko Makela for reviewing this article and helping to make it more accurate. If you have any questions, please feel free to post a comment below.

That’s all for now. THANK YOU for using MySQL!


Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter * Time limit is exhausted. Please reload CAPTCHA.