MySQL 8.0.1: [Recursive] Common Table Expressions in MySQL (CTEs), Part Four – depth-first or breadth-first traversal, transitive closure, cycle avoidance

CTEs and Recursive CTEs appeared in MySQL 8.0; first in a Labs release and now in the official release 8.0.1. Several blogs have been published: herehere and here ; my colleague Øystein also wrote about how using a CTE made one DBT3 query run twice faster.

Today I will continue the series, focusing on these points of recursive CTEs:

  1. obtaining depth-first or breadth-first ordering of the result
  2. doing simple cycle elimination with UNION DISTINCT, to compute a transitive closure
  3. achieving more complex cycle elimination.

Depth-first or breadth-first

I already covered this previously (in paragraph retrieving a full tree) but let’s recap the problem and solution. We assume we have a table of people with a parent-child
relationship:

This is a part of the genealogy of some ancient French kings, around the 8th century AD. In a picture:

Now we want to know all descendants of Thurimbert:

Breadth-first ordering means we want to see all children (direct descendants) first,  grouped together, and only then should their own children appear. This is what the query returned, but you should not rely on it (per the usual rule in SQL: “no ORDER BY specified? no predictable order!”). As direct descendants all appear at the first iteration, we just need to count iterations in a new “level” column, and sort by it:

Depth-first ordering means we want to see children grouped immediately under their parent. For that we build a “path” column and sort by it:

I had already mentioned the necessity of CAST here : the type of the “path” column is determined from the initial SELECT, and it has to be long enough to fit the path of the deepest leaf in the tree: CAST ensures that.

In the SQL standard, there is an optional SEARCH BY clause to request breadth-first or depth-first ordering; as we saw above, it is easy to emulate in MySQL with a “level” or “path” column.

Computing transitive closures with simple cycle avoidance

In year 2300, the Earth is overcrowded and people are encouraged to emigrate to nearby planets by boarding on these space rockets:

Note that the rulers of the Earth have intentionally not set up any way to return from those planets back to the Earth.

Let’s list all destinations which can be reached from the Earth: we first find the planets which can be directly reached, then we find what can be directly reached from these first planets, and so on:

Now it is year 2400 and the population of the Earth has decreased so much that the rulers decide to bring some emigrates back, so they set up a new rocket from Saturn to Earth:

Let’s repeat our query to list all destinations which can be reached from the Earth:

The query seemed to never finish, I had to interrupt it. Why is it so? We have a cycle in our data: starting from the Earth, we add Mars, then Jupiter, then Saturn, then the Earth (because of the new rocket), so we’re back to the beginning – the Earth. Then we add Mars, and so on, forever.

Setting max_execution_time will surely ensure that such runaway query is automatically killed, but… how do we modify the query to work properly, i.e. not run away?

A simple solution is to not add a planet to the result if it’s already there. This elimination of duplicates is done by using UNION DISTINCT instead of UNION ALL:

More complex cycle avoidance

So we have a nicely working query with UNION DISTINCT. Let’s extend it just a bit, to additionally show the time taken by each trip (pretty innocent idea, right?):

What? again it never ends? Indeed, consider what the first generated rows must be:

  • Initial SELECT: Mars, 2
  • Iteration 1: Jupiter, 5 (2+3)
  • Iteration 2: Saturn, 9 (5+4)
  • Iteration 3: Earth, 18 (9+9)
  • Iteration 4: Mars, 20 (18+2)

The time of the Earth->Mars->Jupiter->Saturn->Earth->Mars trip is 20, while the time of the Earth->Mars trip is 2; the two trips now form two different rows as the their differing time is part of the rows: (Mars, 2) and (Mars, 20). UNION DISTINCT will not help with cycle elimination, anymore, as it eliminates only rows for which all columns match… So (Mars, 20) is put into the result, and it leads to (Jupiter, 23) and so on.

The SQL standard has an optional CYCLE clause for that, which we can easily emulate: the solution is to build a “path” column (as in the depth/breadth-first paragraph above), and break when we see we’re adding something which is already in the path. We needn’t use DISTINCT anymore so we use ALL which avoids the overhead of (pointless) duplicate elimination:

Note that if the “row keys” (the planet’s name, here) are allowed to contain a comma, it may be safer to encode the keys when adding them to the path, for example with TO_BASE64 or HEX, so that FIND_IN_SET works reliably.

If, in your particular use case, cycles are considered an anomaly which you want to hunt for, a variant of our query will flag them. We just need to allow the cycle to enter our result and advertise itself (with a new “is_cycle” column), but we still must prevent the query from running forever, by not iterating on rows where is_cycle=1:

A “1” in the “is_cycle” column points us to the cycle.

That’s all for today. I hope the above tricks will help you write powerful queries to navigate your hierarchical or graph data. As always: Thank you for using MySQL!

One thought on “MySQL 8.0.1: [Recursive] Common Table Expressions in MySQL (CTEs), Part Four – depth-first or breadth-first traversal, transitive closure, cycle avoidance

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter * Time limit is exhausted. Please reload CAPTCHA.