MySQL 5.7: only_full_group_by Improved, Recognizing Functional Dependencies, Enabled by Default!

Here’s an account of recent work which has kept me busy and excited for a few months. For those unfamiliar with the only_full_group_by sql mode, let me provide some context. I’ll use the world database, which can be downloaded from this MySQL Documentation page. You can find details on how to install the database on the same page. We have a table of countries and a table of languages spoken in each country. It’s a 1-to-N relationship: a country can have many languages so CountryLanguage.CountryCode is a foreign key referencing Country.Code:

Here are two sample rows from the Country table:

And corresponding rows in CountryLanguage:

Consider this grouped  SELECT:

This SELECT groups by “Language”, so the first thing it has to do is “segregate” rows by language:

Then it must collapse (“aggregate”) all rows of the same group to form one row; this row should have, according to the columns listed in the SELECT, a value for “Language” and a value for “CountryCode”. For the “French” group, the value for “Language” is obviously “French”; but what is the value for “CountryCode”: “BEL” or “FRA” or something else? That’s non-deterministic, and that means the query is likely nonsensical.

The SQL-92 Standard solved this non-determinism by requiring that all selected columns be part of the GROUP BY clause. So it would then reject the example query above.

Historically, MySQL took a different route: it allowed the query above, and picked one value of “CountryCode” among the group. So you may obtain
(French, BEL), (Spanish, ABW)
or
(French, FRA), (Spanish, AND)
depending on MySQL’s implementation details, the access paths chosen to read the table, and subject to change in future versions; so in practice it was simply unpredictable.

In other words, users were allowed to execute non-standard SQL queries and it was up to them to think and decide whether the query was non-sensical or not. That made it far too easy for users to make mistakes that resulted in incorrect data presented to the user/application. Imagine that the user meant this reasonable query:

tin order to know how many countries speak each language. Alas he/she forgot the COUNT word, writing just “(cl.CountryCode)”, and the query accidentally became the bad one which we saw earlier where we produce incorrect and unpredictable results. MySQL didn’t raise any error or warning here, which is not being helpful at all in this scenario.

Users wary of this “dangerous permissive behaviour” would prefer MySQL to complain here, that’s why the only_full_group_by flag was added to the sql_mode server variable years ago. When this sql_mode was set, then MySQL rejected the bad query by implementing the behaviour prescribed by the SQL-92 standard. Here’s a test to demonstrate this behavior in MySQL 5.6:

See: MySQL 5.6 requires that all SELECTed columns be in the GROUP BY clause; “CountryCode” is not, so an error message signals it.

Using this new safe GROUP BY mode was rightfully advertised in the manual, and by users such as http://mechanics.flite.com/blog/2013/02/12/why-i-use-only-full-group-by/
or http://codeascraft.etsy.com/2013/03/19/the-perils-of-sql_mode/.

Still, this safe mode wasn’t enabled by default. I wasn’t around at that time (as it was more than twelve years ago), but I can speculate that there were the following obstacles:

  1. Certain non-standard queries can be perfectly reasonable; for example this one, which wants a list of country names and how many languages are spoken in each country:

    In a group of rows of the same co.Code value, all rows have the same co.Name because co.Code is a primary key of the Country table. So co.Name is deterministic, but it’s not in the GROUP BY clause so SQL-92 and 5.6’s only_full_group_by mode rejected it.
  2. So enabling the mode would make some users’ life unduly difficult (the kind of user who “knows what they are doing”); they would have to put all selected columns in the GROUP BY clause, thus making this clause more  complex and thus increasing the chances that MySQL switches from an index scan to a table scan plus a filesort and a temporary table, thus making the query far slower.
  3. The Standard itself evolved to accomodate these reasonable queries! Starting from SQL-99, the concept of “functional dependency” was introduced; a column B is functionally dependent on columns {A1, A2} in a result set if all rows which share the same value of A1 and the same value of A2 necessarily share the same value of B. Then SQL-99 accordingly modifies the constraints on GROUP BY: by default it keeps the same rules as in SQL-92, but it allows a DBMS to optionally use a relaxed rule, specifically that all columns in the SELECT list merely need to be functionally dependent on the GROUP BY columns. That is known as “SQL optional feature T301” and it guarantees that inside a group of rows, all selected columns have identical values, thus making the query’s result deterministic. That is exactly the case of the last query we showed where “co.Name” is determined by “co.Code”. Of course the old SQL-92 rule satisfies the relaxed rule: if all selected columns are part of the GROUP BY clause then they are functionally dependent on the GROUP BY columns. So we can say that the relaxed rule allows all SQL-92 legal queries, plus some new ones, while preserving determinism.

That was well explained by my former colleague Roland Bouman in
http://rpbouman.blogspot.co.uk/2007/05/debunking-group-by-myths.html .

For us then, it was a problem. The safe mode, only_full_group_by, was too restrictive and thus we could not make it the default. But that also meant that our innocent users were left prey to the potentially dangerous default mode. Once in a while a user would come to us with a strange result and we would have to explain, once again, that it’s a non-deterministic query and she/he should try only_full_group_by.
Another headache was that our QA team, who runs many thousands of automatically generated queries every day, was also experiencing this non-determinism; query result differences between, say, MySQL 5.5 and 5.6, would be flagged as regressions, though it was just a non-deterministic query with a GROUP BY. To avoid these false positives, QA would be running with only_full_group_by, which alas limited them to queries which were only a subset of what our users were allowed to do with the default permissive setting. So we were losing some testing depth, which was obviously not desirable.

Good news — those days are now over!

In MySQL 5.7.5, I have made only_full_group_by detect functional dependencies, which can be inferred not only from keys’ definitions as in the example above, but also from column equalities in the WHERE clause or [LEFT] JOIN ON clauses and from selected expressions in a view’s or derived table’s body. It was not an easy job, and it took a while to get it right, especially with outer joins and with views. Along the way, I even found a couple of small bugs in the relevant section of the SQL-2011 Standard, and thanks to the help from a colleague who has access to the Standard’s redaction committee, those bugs should be corrected in the next version of the SQL Standard.

The different types of functional dependencies are best explained with examples; so together with my colleague Paul DuBois we put them in a new section within the 5.7 manual. I really encourage you to read it as it shows well when columns are determined by other columns and when they are not, in spite of sometimes deceptive appearances! We used the world database there too, and our queries are understandable — so please check it out!

And, for a higher-level view, this is the master page which describes the new behaviour of only_full_group_by: http://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html .

Because this mode is now smarter than before, there is very little reason not to use it. So I have made it a default in 5.7.5 and later. This means that some old queries might return an error when upgrading. Should this happen, please check them; quite possibly they are non-deterministic and should be corrected (remember the COUNT() example earlier in this post). If they are deterministic, and MySQL simply missed it, or if you really want non-determinism, a new function called ANY_VALUE has been added to force MySQL to accept the intentionally/knowingly non-deterministic query. Of course, you can also take the more brutal approach and turn only_full_group_by off in your session or globally.

Several prominent DMBSs that I know continue using the SQL-92 rule, while another one detects functional dependencies due to keys only. I think MySQL is now the DBMS with the most complete implementation, inferring functional dependencies from keys, equalities, and in views. We don’t have everything that’s described in SQL-2011, but we’re not far.

Implementing this feature was also an occasion to refactor the relevant code. It used to be scattered in several functions, accumulating pieces of information as resolution of the query progressed (in Item::fix_fields() calls), and drawing a conclusion in a final phase of resolution. Consolidating all this into one single entity of code, which looks only at the fully resolved query, made things clearer and automatically fixed an obscure bug with views. The main code file is sql/aggregate_check.h and it has an extensive Doxygen multi-page comment to explain the logic.

Here’s another eliminated annoyance: in only_full_group_by mode, using an alias of a SELECT list element in the HAVING clause was forbidden. That was known as http://bugs.mysql.com/bug.php?id=51058, which I have also now fixed as part of this work. You can now do this in only_full_group_by mode:

That’s far more readable than copying the full COUNT() expression into the HAVING clause and it’s also a bit faster to execute.

There was one last fix done as part of this Worklog. That is having to do with some rare non-grouped queries of the form SELECT DISTINCT ... ORDER BY also being non-deterministic, but this post is already long so I’ll discuss that in a subsequent post. Stay tuned!

That’s all for now. I really hope that this new feature makes users’ lives easier and improves their experience with MySQL! We also look forward to your feedback on this new feature! If you have any questions or encounter any bugs, please do let us know by opening a support ticket or filing a bug. As always, THANK YOU for using MySQL!

16 thoughts on “MySQL 5.7: only_full_group_by Improved, Recognizing Functional Dependencies, Enabled by Default!

  1. Hi Guilhelm!

    wonderfully explained. Very excited about the range of the fd detection improvements – a lot of new stuff and all very useful. Thanks again for taking these challenges on and describing them here. 🙂

    warm regards,

    Roland

  2. Guilhelm, I am amazed by the work you guys are doing. I agree with you that this new implementation is probably the most complete and correct. Will test it as much as I can and will share it.
    Btw I have A question because I am always suspicious about exceptions, what exactly ANY_VALUES does, what kind of rules relaxing it implements and how to identify if this can be “dangerous” before executing the query?
    Thanks for all the work done …

  3. @Roland, Devki: thanks for the appreciation!
    @Marco: sorry I didn’t reply earlier [some misconfiguration of notifications]. About ANY_VALUE: anything it wraps is not checked by only_full_group_by logic. So you would use it in the SELECT list (or perhaps in the ORDER BY list) to say: “yes I know this expression might be randomly chosen among the group, I just want *any value* of this expression among the group”. It’s also described at
    http://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_any-value

  4. Thank you so much for this work, and for making only_full_group_by the default. Now if only you’d make it impossible to disable, I would be even more happy. In general I think you should try to stop people from writing bad queries, it’s better to educate the user than to guess at his intentions!

    1. @Kalle: you’re right; we can’t take away the ability to turn off only_full_group_by, yet, as such setting is necessary for those who have legacy applications and currently don’t have the time/possibility to rewrite offending queries. But over time we may deprecate the setting and eventually remove it. I imagine, if we don’t get lots of complaints about the new 5.7 behaviour (only_full_group_by is now on by default), we could absolutely consider doing that in future versions. Using any_value() would still be a solution for problematic queries.

  5. Great work Guilhem! Happy to see that you have started to understand some of the darker part of the optimizer!

  6. Let’s consider 2 queries:

    — Are there any countries with more than 1 language
    SELECT exists(SELECT * FROM CountryLanguage cl GROUP BY cl.countryCode HAVING count(*) > 1);

    — How many countries have more than 1 language
    SELECT count(*) FROM (SELECT * FROM CountryLanguage cl GROUP BY cl.countryCode HAVING count(*) > 1) t;

    1) Both are deterministic, irrespective of the implementation used
    2) Both will not work with only_full_group_by enabled
    3) We have lots of clients still using MySQL 5.6 so we will not be using any_value(…) anywhere
    4) In order to use these types of queries with only_full_group_by, we would most likely change the ‘SELECT * FROM’ to ‘SELECT 1 FROM’ or to ‘SELECT cl.countryCode FROM’ or to ‘SELECT NULL FROM’
    5) We would need to consider any performance implications of doing this (one might assume that MySQL would optimise all of these in the same way)
    6) Checking huge amounts of code for this type of thing will be challenging as developers might not have even used an explicit GROUP BY, for example ‘SELECT max(cl.countryCode), cl.language FROM CountryLanguage cl;’ which would also be affected by having only_full_group_by enabled. I cannot think of a good reason why anyone would do that, but we wouldn’t want any errors happening for clients.
    7) There also seems to have been some unintended side effects in 5.7.15, such as the following query no longer working with only_full_group_by enabled
    SELECT DISTINCT c.code, c.name, cl.language
    FROM Country c
    INNER JOIN CountryLanguage cl ON cl.countryCode = c.code
    ORDER BY cl.countryCode;

    Some questions:
    1) What are the performance implications of having only_full_group_by enabled compared to having it disabled in 5.7?
    2) Is it reasonable that we should change all of the queries such as the top ones?
    3) Was 7) above intended? (I have not seen it documented anywhere)

    1. Hi. Here’s a reply to numbered questions:
      1) disabling only_full_group_by means the server will not have to check conformance of queries; so it will save time, but likely only very small time.
      2) In my view, yes. The Standard is fairly clear, and with MySQL’s permissive rules of the past, we regularly had bug reports from users who were having a non-deterministic query (but never noticed it was) and reported that its results were varying. We hope most of our users will work with only_full_group_by, as it’s a safe default.
      3) Yes. It’s a requirement in the Standard and it makes sense in general. Because there is a column in ORDER BY which isn’t in the SELECT list of columns, so it has to be silently added to the SELECT list (as a “hidden column”) to be available at sorting time, but if in the SELECT list, should it be subject to the DISTINCT? If we choose to reply “yes”: but then if there are two rows with equal c.code, c.name, cl.language and different cl.countryCode, those will both appear in the result, and be identical (as the differing column is hidden), violating DISTINCT; if we choose to reply “no”: but then there is non-determinism of which cl.countryCode is retained when one of the two rows is eliminated. In your case, the ON clause avoids those problems, but the Standard doesn’t require that we detect that (it does require it for GROUP BY functional dependencies, and we satisfy the requirement).
      This was mentioned in the changelog of 5.7.5 “Queries of the form SELECT DISTINCT col1 … ORDER BY col2 qualify as forbidden by SQL2003 (hidden ORDER BY columns combined with DISTINCT), but were not rejected with the ONLY_FULL_GROUP_BY SQL mode enabled.” http://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-5.html

      1. Hi,

        Thanks for the replies.

        In that case it would be our strong preference that we are never forced to have only_full_group_by enabled.
        We would always want our clients to have this disabled (along with each of the other sql-mode options) in order that they have the best performance possible, and we would not want an error to be raised on a client system simply because a developer has written a query in a slightly different way than only_full_group_by is allowing even though it would always give the same results every time irrespective of implementation.
        Furthermore we would prefer that this option is disabled by default, so that we don’t accidentally end up with it enabled on client systems.
        We have been using MySQL for a very long time, and so it is unrealistic for us to make sure that every single one of our queries would work with only_full_group_by enabled.

        That said we will most likely be encouraging developers to have only_full_group_by enabled in 5.7 and on when they are using their development environments, in order that it helps them avoid writing queries without clearly defined results. This is something that obviously was not possible for us to do in 5.6 as only_full_group_by was much too restrictive (even though it allowed some things that the 5.7 version does not).

  7. I was hit with this for this query, any guidance on how to rewrite the query so it doesn’t trigger this error or have a link where I can get some info on how to rewrite this?

    SELECT CONCAT(LEFT(option_name, LENGTH(option_name) – LOCATE(‘_’, REVERSE(option_name))), ”) as Name,
    COUNT(1) as Counts
    FROM wp_options WHERE option_name LIKE ‘\_%\_%’
    GROUP BY LEFT(option_name, LENGTH(option_name) – LOCATE(‘_’, REVERSE(option_name)))
    ORDER BY Counts DESC;

    1. How about
      SELECT CONCAT(LEFT(option_name, LENGTH(option_name) – LOCATE(‘_’, REVERSE(option_name))), ”) as Name,
      COUNT(1) as Counts
      FROM wp_options WHERE option_name LIKE ‘\_%\_%’
      GROUP BY Name
      ORDER BY Counts DESC;

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter * Time limit is exhausted. Please reload CAPTCHA.