Externally Stored Fields in InnoDB

This article discusses the storage (inline and external) of field data in the InnoDB storage engine. All fields of variable length types like VARCHAR, VARBINARY, BLOB and TEXT can be stored inline within the clustered index record, or stored externally in separate BLOB pages outside of the index record (but within the same tablespace). All of these fields can be classified as large objects. These large objects are either binary large objects or character large objects. The binary large objects do not have an associated character set, whereas the character large objects do.

Within the InnoDB storage engine there is no difference between the way character large objects and binary large objects are handled. Throughout this article we will use the term “BLOB field” to refer to any of the aforementioned field types that can be chosen for external storage.

This article will provide information about the following:

  • Explain when a BLOB field will be stored inline and when it will be stored externally, with respect to the clustered index record.
  • The structure of the BLOB reference.
  • The BLOB prefix that is stored in the clustered index record, when the BLOB is stored externally.
  • Utility gdb functions to examine the BLOB reference and the record offsets array.

The BLOB fields are associated with the clustered index records (the primary key) of a table. Only the clustered index can store a BLOB field externally. A secondary index cannot have externally stored fields. For the purposes of this article, we won’t deal with any secondary indexes.

The Schema

The following example table will be used to present the information:

Note: ௱ – Tamil number one hundred (Unicode 0BF1), ௲ – Tamil number one thousand (Unicode 0BF2)

A single clustered index record can have 1 or more externally stored BLOBs. So for the given table definition of t1, there are 4 possible ways that the BLOB fields of f2 and f3 can be stored:

  1. f2 and f3 are both stored inline within the clustered index page
  2. f2 is stored inline, while f3 is stored externally
  3. f3 is stored inline, while f2 is stored externally
  4. both f2 and f3 are stored externally

In the following sections, let us see which of the BLOB columns are externally stored and which of them are stored inline, for each sample row we created above. Note that the row format of table t1 is not explicitly specified. In MySQL 5.6, it will default to the COMPACT row format.  Please keep this in mind as we discuss the example.

Overview of BLOB Storage

The BLOB data can be stored inline in the clustered index record, or it can be stored externally in separate BLOB pages. These external BLOB pages are allocated from the same tablespace in which the clustered index resides. The BLOB data will always be stored inline whenever possible though. If and only if this is not possible because of the record size, then the BLOB field will be stored externally.  This is true for all of the current row formats: REDUNDANT, COMPACT, DYNAMIC, and COMPRESSED. Let’s now take a look at the storage details for the BLOB columns in our example table.

In MySQL 5.6, the default row format of a table is COMPACT, so that’s what our t1 table is using. The default page size is 16K, so that’s also what we’re using. The maximum record size that can be stored in a 16K page using the COMPACT row format is 8126 bytes. The function page_get_free_space_of_empty() will tell us the total free space available in a page. The value returned by this function, which is then divided by 2, will give us the maximum allowed record size. The division by 2 is required because an index page must contain a minimum of 2 records. Let’s look at an example (the argument value of “1″ tells the function that the row format for the page is COMPACT):

The following table shows the storage details of the BLOB columns for each row in table t1. Keeping in mind the maximum allowed record size, it is clear as to why the 60000 bytes of BLOB data is always stored externally. It simply will not fit within a single clustered index record.

Primary Key
Length of f2
Storage of f2
Length of f3
Storage of f3
13000 bytesInline3000 bytesInline
260000 bytesExternal60000 bytesExternal
360000 bytesExternal4500 bytesInline
44500 bytesInline60000 bytesExternal

Inline Storage of BLOBs

As mentioned previously, no BLOB fields will be externally stored if the size of the record is less than the maximum record size allowed in a page. In our example table, for the row with a primary key value of 1, no BLOB fields need to be stored externally because the full record size is less than 8126 bytes. The following tables give the sizing details for each of the rows in our example table:

Primary Key
Clustered Index Record Size (in bytes)
Fields Moved Out
Record Size After Moving Blob Outside (in bytes)
Maximum Allowed Record Size (in bytes)
16027-60278126
2120027f2, f316038126
364527f253158126
464527f353158126

As we can see, the BLOB fields are stored externally until the record size falls below the limit. In the table above, column 2 gives the initial clustered index record size. If this size is greater than the allowed maximum size of the record (shown in column 5), then the function dtuple_convert_big_rec() is invoked to choose the fields destined for external storage. Column 3 lists the fields that have been chosen by this function for external storage. Column 4 shows the clustered index record size after moving the chosen fields to external storage. Again, this value must be less than the maximum record size, shown in column 5 (which is 8126 bytes in our example).

All of the size details provided above are obtained through the debugger from the callers of this function.

For more clarity, let me explain the clustered index record length for the first row with a primary key value of 1  (shown as 6027 bytes). The length of the user fields f1, f2, and f3 are 4 bytes, 3000 bytes, and 3000 bytes respectively. The length of the system fields DB_ROLL_PTR and DB_TRX_ID are 7 bytes and 6 bytes respectively. The record header stores the length of the f2 and f3 fields, taking 4 bytes (2 bytes for each field). The record header also contains a null bit array, which for this record takes up 1 byte. Lastly, the record header contains REC_N_NEW_EXTRA_BYTES—which is 5 bytes for the COMPACT row format—of additional information. The complete storage details for the final record are presented in the following table:

Field
Length (in bytes)
Total6027
Length of f32
Length of f22
Null bit array1
REC_N_NEW_EXTRA_BYTES5
f14
DB_TRX_ID6
DB_ROLL_PTR7
f23000
f33000

You can refer to the documentation in storage/innobase/rem/rem0rec.cc for more details about the COMPACT and REDUNDANT row formats. The REDUNDANT row format is also explained in the blog article InnoDB REDUNDANT Row Format.

Choosing Fields for External Storage

As discussed above, the function dtuple_convert_big_rec() is invoked to decide which parts of the oversized clustered index record will be chosen for external storage. This function makes use of the following rules to decide this:

  • No fixed length fields can be chosen for external storage.
  • No variable length fields whose size is less than or equal to 40 (2 * BTR_EXTERN_FIELD_REF_SIZE) bytes will be chosen for external storage.
  • No variable length fields whose size is less than the BLOB prefix size will be chosen for external storage.  This means that in the case of REDUNDANT and COMPACT row formats, if the field data length is less than or equal to 768 bytes (DICT_ANTELOPE_MAX_INDEX_COL_LEN), then it will not be chosen for external storage. This rule is not applicable for DYNAMIC and COMPRESSED row formats, because their BLOB prefix is 0 bytes.

In the function dtuple_convert_big_rec(), we examine one BLOB field at a time for potential external storage, and then move it to external storage if it passes the criteria noted above, until the clustered index record size falls within the maximum allowed. Larger fields will be selected for external storage before smaller fields, to ensure that maximum space savings happens in the clustered index page. This ensures that more records can be stored in each clustered index page.

BLOB Reference

When a BLOB field is stored externally, a BLOB reference is stored in the clustered index record. The BLOB reference will be stored after the BLOB prefix, if any. This BLOB reference is 20 bytes, and it contains the following information:

  1. The space identifier (4 bytes)
  2. The page number where the first page of the BLOB is stored (4 bytes)
  3. The offset of the BLOB header within that page (4 bytes)
  4. The total size of the BLOB data (8 bytes)

Even though 8 bytes are available to store the total size of the BLOB data, only the last 4 bytes are actually used. This means that within InnoDB, the maximum size of a single BLOB field is currently 4GB.

Structure of InnoDB BLOB Reference

In the length field, two bits are used to store the ownership and inheritance information, which are not discussed in this article. We will cover that in a subsequent blog article. The most significant bit of the length field is used to store ownership information and the second most significant bit is used to store the inheritance information.

Here is a gdb function to print the contents of a BLOB reference. This function takes a pointer to the external BLOB reference as an argument. The calls to ntohl() are required because all data on disk is stored in network byte order by InnoDB.

BLOB Prefix

When a BLOB field is stored externally, we may also store a prefix of the value in the clustered index record, depending on the row format used. For the REDUNDANT and COMPACT row formats, a BLOB prefix of 768 bytes is stored in the clustered index record. For the DYNAMIC and COMPRESSED row formats, a BLOB prefix is never stored in the clustered index record. The BLOB prefix would be followed by the BLOB reference.

The BLOB prefix, when available, helps to calculate the secondary index key without needing to fetch the externally stored BLOB data (which involves at least one extra page load/fetch). This is possible because the maximum length of a secondary index key is 767 bytes. If we attempt to create a secondary index with a bigger length it will be automatically truncated with a warning. For example, consider the following statement:

This will generate the following warning to the user: “Specified key was too long; max key length is 767 bytes”. The resulting index records of s1 will only be on a prefix of f2, covering only the first 767 bytes.

External Storage of BLOBs

A BLOB field is considered to be stored externally if it is stored outside of the clustered index B-tree, and just a reference to the BLOB (along with any BLOB prefix) is stored in the clustered index record. This means that the size and structure of the clustered index B-tree will not be affected by the externally stored BLOB data.

For the row in our sample table with a primary key value of 2, the BLOB fields f2 and f3 will be stored externally from the clustered index record. This is because their size is too big to fit within the clustered index record. Instead, a reference to the externally stored BLOB data will be stored in the clustered index record. A pictorial view of this external storage is shown below:

Externally Stored Fields in InnoDB

Note: The diagram above shows each clustered index record having exactly one externally stored BLOB. This is just a simplication. Each clustered index record can have any number of externally stored BLOBs.

Let’s verify that the BLOB fields for the row with a primary key value of 2 are indeed externally stored with the help of the debugger. Run the server through a debugger (we used gdb) and issue the query SELECT * FROM t1 WHERE f1 = 2. Put a breakpoint in the function row_search_mvcc(), which searches for rows using a cursor. Within this function, once the offsets are calculated for the selected record, let’s print the offsets to verify which of the columns are stored externally. I’ll use the gdb function ib_print_offsets (provided in the appendix) to examine the offsets array. The (*) in the field output shows that it is in fact stored externally.

The above output shows that the record has 5 fields (3 fields are specified by the user, and 2 are internal system fields, namely DB_TRX_ID and DB_ROLL_PTR). The field number 0 begins at offset 0 and ends at offset 4. The first field begins at offset 4 and ends at offset 10. The second field begins at offset 10 and ends at offset 17. The third field, whose data is stored externally, begins at offset 17 and ends at offset 805. The fourth field, whose data is stored externally, begins at offset 805 and ends at 1593. This information is presented in a tabular format below.

Field Number
Description
External
Begin Offset
End Offset
Length
0The field f1 (Primary Key)No044
1DB_TRX_IDNo4106
2DB_ROLL_PTRNo10177
3The field f2Yes17805788 (768 bytes blob prefix + 20 bytes of blob reference)
4The field f3Yes8051593788 (768 bytes blob prefix + 20 bytes of blob reference)

The length of the fields f2 and f3 are each 788 bytes. This includes the 768 bytes of the BLOB prefix, and the 20 bytes for the external BLOB reference. The gdb function that I used to examine the offsets array is provided in the appendix.

The BLOB reference can be examined by using the gdb function ib_print_blob_ref (which we provided earlier). It takes the BLOB reference as an argument. The BLOB reference of field f2 is available at rec+805-20, where 805 is the end offset of f2 and 20 is the size of the BLOB reference. The BLOB reference of field f3 is available at rec+1593-20, where 1593 is the end offset of f2 and 20 is the size of the BLOB reference. The output is provided below:

The length of the externally stored part of the BLOB is 59232 bytes. The BLOB prefix is 768 bytes. So the total length of the field is 59232 + 768 = 60000 bytes, which is the size of both the f2 and f3 fields.

Conclusion

In this article we saw when the BLOB data will be stored inline, and when it will be stored externally. We also saw the rules applied in choosing particular field data for external storage. In a subsequent blog article, we will cover further details regarding BLOB handling within InnoDB.


Appendix: A gdb Function to Examine the Offsets Array

Within InnoDB, to manipulate and access the various fields of a record, an offsets array is calculated by the function rec_get_offsets(). This function returns an offsets array of type ulint*. This array contains the following pieces of information:

  1. The number of fields in the index.
  2. Whether the record has any externally stored fields.
  3. The offset at which each of the fields begin and end.
  4. Whether the row format is COMPACT or REDUNDANT. Note that DYNAMIC and COMPRESSED row formats are a special type of the COMPACT row format. In other words, a DYNAMIC row format is a COMPACT row format plus additional rules. And the COMPRESSED row format is a DYNAMIC row format with additional rules.
  5. In a debug build, it also contains the rec pointer and the index pointer.

Here is the gdb function referenced earlier that can be used to examine an offsets array in a debug build of MySQL:

Authors

This article is jointly authored by Annamalai Gurusami (கு அண்ணாமலை), Aditya and Thirunarayanan Balathandayuthapani of the MySQL Server Sustaining Team, Bengaluru.

I would also like to thank Marko Makela and Matt Lord for their reviews and support. They helped to improve this article to a great extent.

 

5 thoughts on “Externally Stored Fields in InnoDB

  1. Excellent overview. Thanks!
    Max record size of 8126 bytes looks confusing. If so, two records wouldn’t fit into a page. Do I miss something, I thought it should be < 8k-200 ?

    include/btr0btr.h:

    /** Maximum record size which can be stored on a page, without using the
    special big record storage structure */
    #define BTR_PAGE_MAX_REC_SIZE (UNIV_PAGE_SIZE / 2 – 200)

  2. Hi Aleksandr, Thanks for your comment. The macro BTR_PAGE_MAX_REC_SIZE is inaccurate. This macro is not used in the code (except in one debug assert). I’ll follow it up and ensure that this macro is removed. The correct way to calculate the maximum row size is as given in the article above (by making use of the page_get_free_space_of_empty() function).

  3. I think it would be wonderful to give more control over field store.

    There are number of cases where I have the field which size might range say from 4K to 64K and Innodb will store it inline in many cases together with say 200 bytes of other field.

    This causes performance problem when BLOB is rarely accessed as it consumes most of the space in the database pages and as such much more pages has to be accessed than non blob portion of the data is scanned. Of course one can set it out in separate table and create a view but it is ugly.

Leave a Reply

Your email address will not be published. Required fields are marked *


− 2 = four

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">