Supporting all kinds of outer references in derived tables (lateral, or not)

(Image credit: Pixabay).

In my earlier post, I showed how MySQL, since version 8.0.14, has support for LATERAL derived tables. With LATERAL, a JOIN can have a second table – a subquery-based derived table – be defined based on values from columns of the first table, and thus be re-calculated for each row of the first table. Typically:

We say that in the second table (derived), t1.col is a “lateral outer reference” to the first table t1. The word “lateral” hints at the fact that the referenced table is placed “next to”, “on the side of” the derived table (i.e. both are part of the same FROM clause). The arrow goes “laterally”.

While implementing this LATERAL feature, I added another related one at the same time: support for non-lateral outer references in derived tables.

A formal definition of the feature may be too dry matter here; the design doc (here) would be ever dryer. I’ll rather show a practical example, so you can see what it’s useful for.

Let’s pick some sample hierarchical data, which I already used in some earlier posts:

And let’s ask the question: how many direct and indirect reports does each person have?

Here’s my solution:

Description: for each employee (each row of ’emp’ – line 13):

  1. evaluate a scalar subquery (lines 2-12) which:
  2. builds a CTE (lines 3-10), by recursively finding all direct and indirect reports of the employee
  3. counts the rows of the CTE (line 11), substracting one to not count the employee
  4. returns the count.

Result:

And that brings me to the feature announced in the introduction. Please look at the CTE’s definition: it starts recursion from “SELECT emp.id” which is a reference to the current employee we want to count for; this “emp.id” comes from a row of “emp”, which comes from outside of the CTE.
If we draw an arrow from “reference” to “referenced column”, this arrow starts from the CTE, traverses its boundaries, traverses the surrounding scalar subquery’s boundaries too, and reaches to the top query. That is why it’s not a lateral reference.

Before MySQL 8.0.14 this wasn’t possible (MySQL would complain that it doesn’t know what “emp.id” is, in the CTE’s definition).

Now MySQL detects this reference; it concludes that the scalar subquery and its contained CTE must be re-calculated for every row of “emp.id”.

Looking at EXPLAIN for our query:

we see that MySQL has recognized that the scalar subquery is “dependent” (depends on outer data), and the same is true for the derived table. It has also seen that the content of the UNION in the CTE is “uncacheable” – cannot be “cached”, must be re-calculated every time.

To recap, starting from MySQL 8.0.14:

  • by default, when parsing a derived table’s definition, MySQL accepts non-lateral outer references, as in the example query above.
  • if you add the LATERAL keyword, MySQL also accepts lateral outer references; in other words, it additionally searches in the FROM clause containing the derived table.

Note: there are other solutions to the reports-count problem. One solution is to build a large result in one pass using one recursive CTE, to list all connections between all employees and each of their indirect managers, then use this large result to aggregate
for each manager; I had coded it at the end of this post. It works, but it’s hard to read.  What  we do above, instead, is to generate smaller sets from the hierarchy, piece by piece. So it’s “walk a part of the hierarchy / aggregate / repeat” instead of “walk the whole  hierarchy / aggregate”.

That’s all for today. I hope I have demonstrated that the new feature gives you more expressive power when writing queries, and of course one less “oh apparently the DBMS doesn’t like it”  😀

Thank you for using MySQL!

Leave a Reply