MySQL 8.0: When to use utf8mb3 over utf8mb4?

Long time MySQL users will recognize that there are two varieties of utf8 support in MySQL; utf8mb3 and utf8mb4.  Let me dig a little bit deeper in explaining the history between the two:

  • MySQL 4.1 (2004) was the first version to support character sets and collations.  The default character set was latin1, but utf8[mb3] was available as an option.  An optimization was chosen to limit utf8 to 3 bytes, enough to handle almost all modern languages. 
  • MySQL 5.5 (2010) added support for up to 4 byte utf8 using the new utf8mb4 character set.
  • MySQL 5.7 (2015) added some optimizations such as a variable length sort buffer, and also changed InnoDB’s default row format to DYNAMIC.  This allows for indexes on VARCHAR(255) with utf8mb4; something that made migrations more difficult prior.
  • MySQL 8.0 (In development) vastly improves the performance of utf8mb4, as well as adding several new collations.  It is now the default character set for MySQL.

Is utf8mb3 still faster?

With the original purpose of utf8mb3 being a performance optimization, the next question is, does this still yield true today?  The short answer is no; the new utf8mb4-based collations are much faster than any of the old utf8mb3-based ones:

utf8mb4 shown in red.  Results in transactions per second; higher is better.

We expect cases where utf8mb3 is faster to be quite rare, and any such case will be considered a bug 🙂

Making the case for utf8mb4

If the performance gains in MySQL 8.0 aren’t enough to entice you, perhaps these additional points will:

  • Even for English speaking markets, the prevalence of emojis as character input is driving adoption of utf8mb4 over utf8mb3 and latin1.
  • We have improved our collations to account for a number of language specific sorting rules.  The collations for utf8mb3 are correct for the common cases, but the devil is in the details.
  • The new collations also support accent and case sensitivity.
  • Even in Asia, we are seeing adoption of utf8mb4 over CJK character sets, largely because it supports a super-set of possible characters.  We also now support a collation specific to Japanese.

Future Steps

As we no longer see a strong use-case for utf8mb3, we intend to mark it as deprecated in MySQL 8.0.  Because upgrading from earlier character-sets requires tables to be rebuilt, we expect that it may be some it time before we are able to move from deprecation to removal.  However, in making this first step we are communicating that it is a legacy feature that should no longer be used in new applications.

15 thoughts on “MySQL 8.0: When to use utf8mb3 over utf8mb4?

  1. I can’t remember anyone saying CHARACTER SET utf8mb3. Will the deprecation be just “utf8mb3” or also “utf8”? utf8 will be prevalent (legacy) for a long time.

  2. “MySQL 5.7 (2015) added some optimizations such as a variable length sort buffer, and also changed InnoDB’s default row format to DYNAMIC. This allows for indexes on VARCHAR(255) with utf8mb4;”

    MySQL 5.6 also allows indexes on VARCHAR(255) with utf8mb4. It’s not by default but you can enable options.

    1. Indeed. This statement refers to the sentence that precedes it:
      > and also changed InnoDB’s default row format to DYNAMIC

      To be pedantic, it requires a change in row format, large prefix and file format. This is a lot to ask of a user – which is why it is important to point out the change in MySQL 5.7.

    1. Hi Olaf,

      This is a slide from Tomas’ keynote from Percona Live a few weeks ago 🙂

      The improvements are largely the result of the new collations, so I would expect the results from MySQL 8.0 utf8mb3 < -> MySQL 8.0 utf8mb4 to be similar.

  3. My previous data convert into special characters, unable to convert or change into correct characters

    mysql> SHOW VARIABLES LIKE ‘character\_set\_%’;
    +————————–+——–+
    | Variable_name | Value |
    +————————–+——–+
    | character_set_client | latin1 |
    | character_set_connection | latin1 |
    | character_set_database | latin1 |
    | character_set_filesystem | binary |
    | character_set_results | latin1 |
    | character_set_server | utf8 |
    | character_set_system | utf8 |
    +————————–+——–+
    7 rows in set (0.29 sec)

    Correct script
    Bienvenu à l’École Mondiale de la Bible

    After converted script
    Bienvenu lcole Mondiale de la Bible

    mysql> INSERT INTO users (id, name, lastname)
    -> VALUES(20, “Bienvenu “, “INTOL cole”);
    Query OK, 1 row affected (0.27 se
    mysql> select * from users;
    +—-+———–+————+
    | id | name | lastname |
    +—-+———–+————+
    | 20 | Bienvenu | INTOL cole |
    +—-+———–+————+
    1 row in set (0.27 sec)

    1. Hi. There’s a lot of things that may go wrong when dealing with character sets, and your description does not have enough detail (how the table was created, the settings in your environment, which server and client versions you used etc.) to pinpoint the problem. But from you what you write, I can come up with the following theory:

      You inserted data into the table in a 5.7 database using the 8.0.2 client. If that is the case, you have hit this bug: https://bugs.mysql.com/bug.php?id=86421

      There are two possible workarounds: 1) Do SET NAMES utf8mb4; before inserting the data, or 2) Use a 5.7 client when communicating with a 5.7 database.

  4. Pingback: URL

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter * Time limit is exhausted. Please reload CAPTCHA.