MySQL 8.0.1: [Recursive] Common Table Expressions in MySQL (CTEs), Part Four – depth-first or breadth-first traversal, transitive closure, cycle avoidance

CTEs and Recursive CTEs appeared in MySQL 8.0; first in a Labs release and now in the official release 8.0.1. Several blogs have been published: herehere and here ; my colleague Øystein also wrote about how using a CTE made one DBT3 query run twice faster.

Today I will continue the series, focusing on these points of recursive CTEs:

  1. obtaining depth-first or breadth-first ordering of the result
  2. doing simple cycle elimination with UNION DISTINCT, to compute a transitive closure
  3. achieving more complex cycle elimination.

Depth-first or breadth-first

I already covered this previously (in paragraph retrieving a full tree) but let’s recap the problem and solution. We assume we have a table of people with a parent-child
relationship:

This is a part of the genealogy of some ancient French kings, around the 8th century AD. In a picture:

Now we want to know all descendants of Thurimbert:

Breadth-first ordering means we want to see all children (direct descendants) first,  grouped together, and only then should their own children appear. This is what the query returned, but you should not rely on it (per the usual rule in SQL: “no ORDER BY specified? no predictable order!”). As direct descendants all appear at the first iteration, we just need to count iterations in a new “level” column, and sort by it:

Depth-first ordering means we want to see children grouped immediately under their parent. For that we build a “path” column and sort by it:

I had already mentioned the necessity of CAST here : the type of the “path” column is determined from the initial SELECT, and it has to be long enough to fit the path of the deepest leaf in the tree: CAST ensures that.

In the SQL standard, there is an optional SEARCH BY clause to request breadth-first or depth-first ordering; as we saw above, it is easy to emulate in MySQL with a “level” or “path” column.

Computing transitive closures with simple cycle avoidance

In year 2300, the Earth is overcrowded and people are encouraged to emigrate to nearby planets by boarding on these space rockets:

Note that the rulers of the Earth have intentionally not set up any way to return from those planets back to the Earth.

Let’s list all destinations which can be reached from the Earth: we first find the planets which can be directly reached, then we find what can be directly reached from these first planets, and so on:

Now it is year 2400 and the population of the Earth has decreased so much that the rulers decide to bring some emigrates back, so they set up a new rocket from Saturn to Earth:

Let’s repeat our query to list all destinations which can be reached from the Earth:

The query seemed to never finish, I had to interrupt it manually (note that starting from MySQL 8.0.3, the query would instead abort automatically, thanks to @@cte_max_recursion_depth). Why is it so? We have a cycle in our data: starting from the Earth, we add Mars, then Jupiter, then Saturn, then the Earth (because of the new rocket), so we’re back to the beginning – the Earth. Then we add Mars, and so on, forever.

Setting max_execution_time will surely ensure that such runaway query is automatically killed, but… how do we modify the query to work properly, i.e. not run away?

A simple solution is to not add a planet to the result if it’s already there. This elimination of duplicates is done by using UNION DISTINCT instead of UNION ALL:

More complex cycle avoidance

So we have a nicely working query with UNION DISTINCT. Let’s extend it just a bit, to additionally show the time taken by each trip (pretty innocent idea, right?):

What? again it never ends? Indeed, consider what the first generated rows must be:

  • Initial SELECT: Mars, 2
  • Iteration 1: Jupiter, 5 (2+3)
  • Iteration 2: Saturn, 9 (5+4)
  • Iteration 3: Earth, 18 (9+9)
  • Iteration 4: Mars, 20 (18+2)

The time of the Earth->Mars->Jupiter->Saturn->Earth->Mars trip is 20, while the time of the Earth->Mars trip is 2; the two trips now form two different rows as the their differing time is part of the rows: (Mars, 2) and (Mars, 20). UNION DISTINCT will not help with cycle elimination, anymore, as it eliminates only rows for which all columns match… So (Mars, 20) is put into the result, and it leads to (Jupiter, 23) and so on.

The SQL standard has an optional CYCLE clause for that, which we can easily emulate: the solution is to build a “path” column (as in the depth/breadth-first paragraph above), and break when we see we’re adding something which is already in the path. We needn’t use DISTINCT anymore so we use ALL which avoids the overhead of (pointless) duplicate elimination:

Note that if the “row keys” (the planet’s name, here) are allowed to contain a comma, it may be safer to encode the keys when adding them to the path, for example with TO_BASE64 or HEX, so that FIND_IN_SET works reliably.

If, in your particular use case, cycles are considered an anomaly which you want to hunt for, a variant of our query will flag them. We just need to allow the cycle to enter our result and advertise itself (with a new “is_cycle” column), but we still must prevent the query from running forever, by not iterating on rows where is_cycle=1:

A “1” in the “is_cycle” column points us to the cycle.

That’s all for today. I hope the above tricks will help you write powerful queries to navigate your hierarchical or graph data. As always: Thank you for using MySQL!

11 thoughts on “MySQL 8.0.1: [Recursive] Common Table Expressions in MySQL (CTEs), Part Four – depth-first or breadth-first traversal, transitive closure, cycle avoidance

  1. Great post! Thank you so much.

    Could you help me with the following sample. I do have a solution for PostgresSQL but not for mySQL (see https://academy.vertabelo.com/blog/get-to-know-the-power-of-sql-recursive-queries/)

    I’d like to get all possible paths between nodes which are connected via directional links and an associated length-value; e.g. node 1 is connected with node 2 and the link has a length of 100.

    Sample of graph:
    1 –(100)–> 2
    1 —(70)–> 3
    2 —(20)–> 3
    2 —(30)–> 5
    3 —(20)–> 2
    3 —(40)–> 4
    4 —(20)–> 5
    4 —(80)–> 6
    5 —(20)–> 6

    I’d like to get all paths from node 1 to node 6 ordered by total path length!
    e.g.
    path 1 is via ids {1,3,2,5,6} with a length of 140
    path 2 is via ids {1,2,5,6} with a length of 150
    etc.

    1. Hello. Yes it’s probably possible to do this in MySQL now. Could you please provide the CREATE TABLE and INSERT statements to create the base tables (named ‘node’ and ‘link’, I think)?

      1. Yes of course!

        CREATE TABLE node (
        id int(11) NOT NULL,
        name varchar(45) COLLATE utf8_unicode_ci NOT NULL,
        PRIMARY KEY (id,name)
        ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

        INSERT INTO node (id, name) VALUES (1, ‘node 1’);
        INSERT INTO node (id, name) VALUES (2, ‘node 2’);
        INSERT INTO node (id, name) VALUES (3, ‘node 3’);
        INSERT INTO node (id, name) VALUES (4, ‘node 4’);
        INSERT INTO node (id, name) VALUES (5, ‘node 5’);
        INSERT INTO node (id, name) VALUES (6, ‘node 6’);

        CREATE TABLE link (
        node_from_id int(11) NOT NULL,
        node_to_id int(11) NOT NULL,
        length int(11) DEFAULT NULL,
        PRIMARY KEY (node_from_id,node_to_id),
        KEY node_to_id_idx (node_to_id),
        CONSTRAINT node_from_id FOREIGN KEY (node_from_id) REFERENCES node (id),
        CONSTRAINT node_to_id FOREIGN KEY (node_to_id) REFERENCES node (id)
        ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

        INSERT INTO link (node_from_id, node_to_id, length) VALUES (1, 2, 100);
        INSERT INTO link (node_from_id, node_to_id, length) VALUES (1, 3, 70);
        INSERT INTO link (node_from_id, node_to_id, length) VALUES (2, 3, 20);
        INSERT INTO link (node_from_id, node_to_id, length) VALUES (2, 5, 30);
        INSERT INTO link (node_from_id, node_to_id, length) VALUES (3, 2, 20);
        INSERT INTO link (node_from_id, node_to_id, length) VALUES (3, 4, 40);
        INSERT INTO link (node_from_id, node_to_id, length) VALUES (4, 5, 20);
        INSERT INTO link (node_from_id, node_to_id, length) VALUES (4, 6, 80);
        INSERT INTO link (node_from_id, node_to_id, length) VALUES (5, 6, 20);

        1. Hi Florian. So here’s a query which gives the expected result:

          WITH RECURSIVE

          node_links_select AS (
          SELECT
          n1.id AS node_id,
          n1.name AS node_name,
          n2.id AS destination_node_id,
          n2.name AS destination_node_name,
          l.length AS link_length
          FROM
          node n1
          JOIN link l ON (n1.id = l.node_from_id)
          JOIN node n2 ON (l.node_to_id = n2.id)
          ),

          search_path (path_ids, length, is_visited) AS
          (
          SELECT
          JSON_ARRAY(node_id, destination_node_id),
          link_length,
          node_id = destination_node_id
          FROM node_links_select

          UNION ALL

          SELECT
          JSON_ARRAY_APPEND(path_ids, ‘$’, d.destination_node_id),
          f.length + d.link_length,
          JSON_CONTAINS(f.path_ids, CAST(d.destination_node_id AS JSON) )
          FROM node_links_select d,
          search_path f
          WHERE
          f.path_ids->’$[last]’ = d.node_id
          AND NOT f.is_visited
          )

          SELECT * FROM search_path
          WHERE path_ids->’$[0]’ = 1 AND path_ids->’$[last]’ = 6
          ORDER BY length;

          I took your PostgreSQL query and did the following changes:
          – MySQL doesn’t have an array type, but it has a JSON type, in particular it supports JSON arrays, so I’m using a JSON array
          – PG’s || concatenation becomes json_array_append
          – PG’s =ANY becomes json_contains (the value to search for must be a valid JSON value, so I cast the integer to JSON)
          – PG’s array indexing with the length:
          f.path_ids[array_length(path_ids, 1)] = d.node_id becomes
          f.path_ids->’$[last]’ = d.node_id
          – the recursive CTE (search_path) references the non-recursive one (node_links_select) so I swapped their definitions.

          Result:
          +——————–+——–+————+
          | path_ids | length | is_visited |
          +——————–+——–+————+
          | [1, 3, 2, 5, 6] | 140 | 0 |
          | [1, 2, 5, 6] | 150 | 0 |
          | [1, 3, 4, 5, 6] | 150 | 0 |
          | [1, 3, 4, 6] | 190 | 0 |
          | [1, 2, 3, 4, 5, 6] | 200 | 0 |
          | [1, 2, 3, 4, 6] | 240 | 0 |
          +——————–+——–+————+
          6 rows in set

          PS: if you’re interested, the reference for json functions is at https://dev.mysql.com/doc/refman/8.0/en/json-search-functions.html
          https://dev.mysql.com/doc/refman/8.0/en/json-creation-functions.html

          1. Hi Guilhem. Thanks a lot. I appreciate your help! Your implementation and explanation told me how to do it right.

            PS: For some reason I had to replace the single quotes used with the path-operators by double quotes, e.g. line 33: f.path_ids->‘$[last]‘ = d.node_id => f.path_ids->”$[last]” = d.node_id

            1. You’re right; all quote symbols which WordPress’ formatting has inserted in my query should be changed to ordinary ASCII single-quotes, or ordinary ASCII double-quotes (except that the latter won’t work if @@sql_mode contains the ANSI_QUOTES flag).

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter * Time limit is exhausted. Please reload CAPTCHA.