Row numbering, ranking: how to use LESS user variables in MySQL queries

User variables, even though not part of standard SQL, are great objects in MySQL. They allow to keep some “state” for the life of a session: a session can execute a statement to compute a value, store that value in a user variable, and use it in all next  statements. This avoids recalculating the same value in each of the next statements, thus gaining  efficiency.

For example, let’s simulate a lottery: pick up a random user, store him in the table of winners, and send him a congratulations email:

Using @winner has allowed us to run the sub-query-with-RAND once instead of twice; that’s faster, and more correct: running it twice would have generated potentially two different winners for the INSERT and the email sending, causing a logic error in our design.

This is all nice and neat. The above example clearly separates phases:

  • phase 1 sets the @winner variable
  • phase 2 reads @winner in next two statements.

But it’s also possible to set and read a user variable in a single statement. The use case is: to have a “moving value” which changes for every row, based on the value which it had for the previous row (this assumes a notion of “previous row” defined by some ordering).

Here’s a classical example which generates row numbers (1, 2, …) according to some order. I have a table of famous people:

I want to order and number them by birth date:

It works.

I add: surprisingly. Let me explain.

Standard SQL says that we should join tables (here we have only one, people), put the result of the join in an imaginary table, then evaluate selected expressions for each row, forming yet another imaginary table, then sort that table and return it to the user.

Yes, in theory ordering is the very final stage of execution, after evaluation of the selected expressions for all matching rows (which is logical, as ORDER BY is allowed to reference selected expressions by alias). So, if we follow this, we read people, evaluate selected expressions for each row, so calculate a row number for each row, and finally we order the result by birth date. Being calculated before ordering, row numbers would have no reason to be according to the order.

Fortunately, here, MySQL sees that there is only one table and that the ordering expression is a column of this table, so it makes an optimization: it first orders the rows of people, then reads this ordered result and, for each row, calculates selected expressions (including @rownum). So @rownum increases in order, as desired. But, as noted, it’s relying on an ORDER BY optimization done by MySQL, which is not guaranteed to remain constant in the future (who knows? we revisit heuristic and cost-based optimization decisions from time to time). Moreover, if later an application developer modifies our query, this modification may lead MySQL to change its optimization strategy and do ordering after evaluating the selected expressions, totally breaking @rownum’s logic. To demonstrate it, let’s say this unaware developer just wants to add to our query a column displaying the full name of the country of citizenship – pretty innocent, uh? He first makes a table with full names:

and now he edits our row-numbering query, joining people with country:

num is plain wrong! Because the Optimizer has chosen a plan which does ordering last, thus after evaluation of user variables. A non-limiting list of factors which can disturb num are: join order, use of join buffering, use of temporary table, GROUP BY…

Going further… to avoid the round-trip which sends the initializing SET statement, some clever users (no irony on “clever”) bundle two in one with this trick:

It works. It relies on the fact that MySQL evaluates the content of the derived table named initialization. But in the future, MySQL may become smarter and say that “this derived table isn’t used by the statement so I can just skip its evaluation”. And it would thus skip the initialization.

The point I’m making is: these queries make assumptions which may hold today, but are not future-proof. If they hold today we have no conscious intention to break them, but one day we unavoidably will, as old code gets refactored, as new SQL features are implemented, as we come closer to the SQL standard. What has never been explicitly guaranteed may change. The accidental change may come from our side, or from your side (see the example with the citizenship’s full name and the join).

Our manual cautiously warns about this:

” As a general rule, other than in SET statements, you should never assign a value to a user variable and read the value within the same statement. For example, to increment a variable, this is okay: SET @a = @a + 1. For other statements, such as SELECT, you might get the results you expect, but this is not guaranteed.”

I hear someone saying: “you’re criticizing, but there’s no other way to get numbered rows in MySQL!“. Here are good news: in MySQL 8.0, there finally is another way: window functions. Here’s a query using the ROW_NUMBER window function:

It gives the right result. If you are new to window functions’ syntax, you may want to check the manual or Dag’s blog.

Unlike for the query with a user variable, adding a join to the query with a window function doesn’t disturb num at all:

A similar use case is getting a row number inside the group; let’s say I want to group by sex; for that, clever users have written the following query, which detects if the sex is the same as the previous row; if it is, we’re in the same group as the previous row so increase the counter, otherwise we are entering a new group so reset it to 1:

It works. But it relies on the fact that MySQL sets @rownum before setting @prev_sex, i.e. evaluates selected expressions from left to right. Which is true for this query, but not for all queries. As the manual says: “the order of evaluation for expressions involving user variables is undefined“. The same result is now achievable with a PARTITION sub-clause for the window:

What are the benefits of the new approach with window functions? First, it’s a much shorter query. Second, it’s standard-compliant so:
– the result is predictable and guaranteed in future versions,
– the query is easily portable to other database software,
– the query is easily understood by database developers not yet familiar with MySQL.

Possibly some especially tricky and smart queries still need user variables read-and-set techniques, or possibly some perform better than their equivalent with window functions (please, let me know about them… we’re eager to know what to optimize).

But as a rule of thumb, this old black-magic practice should not be necessary anymore, and is advantageously replaced with window functions ROW_NUMBER, RANK, LEAD, LAG, FIRST_VALUE

That’s it for today. Thank you for using MySQL!

4 thoughts on “Row numbering, ranking: how to use LESS user variables in MySQL queries

    1. Hi Daniël. Note that the post is about generation of row numbers and ranks for a SELECT result, not about generation of persistent numbers to be stored physically. For bug 1214, reading the discussions there, it seems to have ended with something like “MySQL please support a UUID-like column type”. A much requested feature, indeed. Until it’s done, BINARY(16) and UUID_TO_BIN(UUID()) provide a workaround. For 45362, if the wanted ROWID has to be unique across tables and servers then UUID() is a solution; if it only has to be unique inside the table, an AUTO_INCREMENT column does the trick, I guess.

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter * Time limit is exhausted. Please reload CAPTCHA.