MySQL 8.0.1: [Recursive] Common Table Expressions in MySQL (CTEs), Part Four – depth-first or breadth-first traversal, transitive closure, cycle avoidance

CTEs and Recursive CTEs appeared in MySQL 8.0; first in a Labs release and now in the official release 8.0.1. Several blogs have been published: herehere and here ; my colleague Øystein also wrote about how using a CTE made one DBT3 query run twice faster.

Today I will continue the series, focusing on these points of recursive CTEs:

  1. obtaining depth-first or breadth-first ordering of the result
  2. doing simple cycle elimination with UNION DISTINCT, to compute a transitive closure
  3. achieving more complex cycle elimination.

Depth-first or breadth-first

I already covered this previously (in paragraph retrieving a full tree) but let’s recap the problem and solution. We assume we have a table of people with a parent-child

This is a part of the genealogy of some ancient French kings, around the 8th century AD. In a picture:

Now we want to know all descendants of Thurimbert:

Breadth-first ordering means we want to see all children (direct descendants) first,  grouped together, and only then should their own children appear. This is what the query returned, but you should not rely on it (per the usual rule in SQL: “no ORDER BY specified? no predictable order!”). As direct descendants all appear at the first iteration, we just need to count iterations in a new “level” column, and sort by it:

Depth-first ordering means we want to see children grouped immediately under their parent. For that we build a “path” column and sort by it:

I had already mentioned the necessity of CAST here : the type of the “path” column is determined from the initial SELECT, and it has to be long enough to fit the path of the deepest leaf in the tree: CAST ensures that.

In the SQL standard, there is an optional SEARCH BY clause to request breadth-first or depth-first ordering; as we saw above, it is easy to emulate in MySQL with a “level” or “path” column.

Computing transitive closures with simple cycle avoidance

In year 2300, the Earth is overcrowded and people are encouraged to emigrate to nearby planets by boarding on these space rockets:

Note that the rulers of the Earth have intentionally not set up any way to return from those planets back to the Earth.

Let’s list all destinations which can be reached from the Earth: we first find the planets which can be directly reached, then we find what can be directly reached from these first planets, and so on:

Now it is year 2400 and the population of the Earth has decreased so much that the rulers decide to bring some emigrates back, so they set up a new rocket from Saturn to Earth:

Let’s repeat our query to list all destinations which can be reached from the Earth:

The query seemed to never finish, I had to interrupt it manually (note that starting from MySQL 8.0.3, the query would instead abort automatically, thanks to @@cte_max_recursion_depth). Why is it so? We have a cycle in our data: starting from the Earth, we add Mars, then Jupiter, then Saturn, then the Earth (because of the new rocket), so we’re back to the beginning – the Earth. Then we add Mars, and so on, forever.

Setting max_execution_time will surely ensure that such runaway query is automatically killed, but… how do we modify the query to work properly, i.e. not run away?

A simple solution is to not add a planet to the result if it’s already there. This elimination of duplicates is done by using UNION DISTINCT instead of UNION ALL:

More complex cycle avoidance

So we have a nicely working query with UNION DISTINCT. Let’s extend it just a bit, to additionally show the time taken by each trip (pretty innocent idea, right?):

What? again it never ends? Indeed, consider what the first generated rows must be:

  • Initial SELECT: Mars, 2
  • Iteration 1: Jupiter, 5 (2+3)
  • Iteration 2: Saturn, 9 (5+4)
  • Iteration 3: Earth, 18 (9+9)
  • Iteration 4: Mars, 20 (18+2)

The time of the Earth->Mars->Jupiter->Saturn->Earth->Mars trip is 20, while the time of the Earth->Mars trip is 2; the two trips now form two different rows as the their differing time is part of the rows: (Mars, 2) and (Mars, 20). UNION DISTINCT will not help with cycle elimination, anymore, as it eliminates only rows for which all columns match… So (Mars, 20) is put into the result, and it leads to (Jupiter, 23) and so on.

The SQL standard has an optional CYCLE clause for that, which we can easily emulate: the solution is to build a “path” column (as in the depth/breadth-first paragraph above), and break when we see we’re adding something which is already in the path. We needn’t use DISTINCT anymore so we use ALL which avoids the overhead of (pointless) duplicate elimination:

Note that if the “row keys” (the planet’s name, here) are allowed to contain a comma, it may be safer to encode the keys when adding them to the path, for example with TO_BASE64 or HEX, so that FIND_IN_SET works reliably.

If, in your particular use case, cycles are considered an anomaly which you want to hunt for, a variant of our query will flag them. We just need to allow the cycle to enter our result and advertise itself (with a new “is_cycle” column), but we still must prevent the query from running forever, by not iterating on rows where is_cycle=1:

A “1” in the “is_cycle” column points us to the cycle.

That’s all for today. I hope the above tricks will help you write powerful queries to navigate your hierarchical or graph data. As always: Thank you for using MySQL!

19 thoughts on “MySQL 8.0.1: [Recursive] Common Table Expressions in MySQL (CTEs), Part Four – depth-first or breadth-first traversal, transitive closure, cycle avoidance

  1. Great post! Thank you so much.

    Could you help me with the following sample. I do have a solution for PostgresSQL but not for mySQL (see

    I’d like to get all possible paths between nodes which are connected via directional links and an associated length-value; e.g. node 1 is connected with node 2 and the link has a length of 100.

    Sample of graph:
    1 –(100)–> 2
    1 —(70)–> 3
    2 —(20)–> 3
    2 —(30)–> 5
    3 —(20)–> 2
    3 —(40)–> 4
    4 —(20)–> 5
    4 —(80)–> 6
    5 —(20)–> 6

    I’d like to get all paths from node 1 to node 6 ordered by total path length!
    path 1 is via ids {1,3,2,5,6} with a length of 140
    path 2 is via ids {1,2,5,6} with a length of 150

    1. Hello. Yes it’s probably possible to do this in MySQL now. Could you please provide the CREATE TABLE and INSERT statements to create the base tables (named ‘node’ and ‘link’, I think)?

      1. Yes of course!

        CREATE TABLE node (
        id int(11) NOT NULL,
        name varchar(45) COLLATE utf8_unicode_ci NOT NULL,
        PRIMARY KEY (id,name)
        ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

        INSERT INTO node (id, name) VALUES (1, ‘node 1’);
        INSERT INTO node (id, name) VALUES (2, ‘node 2’);
        INSERT INTO node (id, name) VALUES (3, ‘node 3’);
        INSERT INTO node (id, name) VALUES (4, ‘node 4’);
        INSERT INTO node (id, name) VALUES (5, ‘node 5’);
        INSERT INTO node (id, name) VALUES (6, ‘node 6’);

        CREATE TABLE link (
        node_from_id int(11) NOT NULL,
        node_to_id int(11) NOT NULL,
        length int(11) DEFAULT NULL,
        PRIMARY KEY (node_from_id,node_to_id),
        KEY node_to_id_idx (node_to_id),
        CONSTRAINT node_from_id FOREIGN KEY (node_from_id) REFERENCES node (id),
        CONSTRAINT node_to_id FOREIGN KEY (node_to_id) REFERENCES node (id)
        ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

        INSERT INTO link (node_from_id, node_to_id, length) VALUES (1, 2, 100);
        INSERT INTO link (node_from_id, node_to_id, length) VALUES (1, 3, 70);
        INSERT INTO link (node_from_id, node_to_id, length) VALUES (2, 3, 20);
        INSERT INTO link (node_from_id, node_to_id, length) VALUES (2, 5, 30);
        INSERT INTO link (node_from_id, node_to_id, length) VALUES (3, 2, 20);
        INSERT INTO link (node_from_id, node_to_id, length) VALUES (3, 4, 40);
        INSERT INTO link (node_from_id, node_to_id, length) VALUES (4, 5, 20);
        INSERT INTO link (node_from_id, node_to_id, length) VALUES (4, 6, 80);
        INSERT INTO link (node_from_id, node_to_id, length) VALUES (5, 6, 20);

        1. Hi Florian. So here’s a query which gives the expected result:


          node_links_select AS (
 AS node_id,
 AS node_name,
 AS destination_node_id,
 AS destination_node_name,
          l.length AS link_length
          node n1
          JOIN link l ON ( = l.node_from_id)
          JOIN node n2 ON (l.node_to_id =

          search_path (path_ids, length, is_visited) AS
          JSON_ARRAY(node_id, destination_node_id),
          node_id = destination_node_id
          FROM node_links_select

          UNION ALL

          JSON_ARRAY_APPEND(path_ids, ‘$’, d.destination_node_id),
          f.length + d.link_length,
          JSON_CONTAINS(f.path_ids, CAST(d.destination_node_id AS JSON) )
          FROM node_links_select d,
          search_path f
          f.path_ids->’$[last]’ = d.node_id
          AND NOT f.is_visited

          SELECT * FROM search_path
          WHERE path_ids->’$[0]’ = 1 AND path_ids->’$[last]’ = 6
          ORDER BY length;

          I took your PostgreSQL query and did the following changes:
          – MySQL doesn’t have an array type, but it has a JSON type, in particular it supports JSON arrays, so I’m using a JSON array
          – PG’s || concatenation becomes json_array_append
          – PG’s =ANY becomes json_contains (the value to search for must be a valid JSON value, so I cast the integer to JSON)
          – PG’s array indexing with the length:
          f.path_ids[array_length(path_ids, 1)] = d.node_id becomes
          f.path_ids->’$[last]’ = d.node_id
          – the recursive CTE (search_path) references the non-recursive one (node_links_select) so I swapped their definitions.

          | path_ids | length | is_visited |
          | [1, 3, 2, 5, 6] | 140 | 0 |
          | [1, 2, 5, 6] | 150 | 0 |
          | [1, 3, 4, 5, 6] | 150 | 0 |
          | [1, 3, 4, 6] | 190 | 0 |
          | [1, 2, 3, 4, 5, 6] | 200 | 0 |
          | [1, 2, 3, 4, 6] | 240 | 0 |
          6 rows in set

          PS: if you’re interested, the reference for json functions is at

          1. Hi Guilhem. Thanks a lot. I appreciate your help! Your implementation and explanation told me how to do it right.

            PS: For some reason I had to replace the single quotes used with the path-operators by double quotes, e.g. line 33: f.path_ids->‘$[last]‘ = d.node_id => f.path_ids->”$[last]” = d.node_id

            1. You’re right; all quote symbols which WordPress’ formatting has inserted in my query should be changed to ordinary ASCII single-quotes, or ordinary ASCII double-quotes (except that the latter won’t work if @@sql_mode contains the ANSI_QUOTES flag).

  2. Hi Guilhem. Suppose I want to sort differently to parent/child id’s. Say I use a numerical column to define how the rows are sorted. How would i recursively implement this.

    Regards Phill

      1. Hi Guilhem,

        The purpose is to list projects, tasks and sub-tasks etc.

        I’m using a recursive cte query to generate a path (a “WBS” of sorts) using a “listorder” column to control the ranking (eg “1.2.3”). However, as the path is a string, “20” displays before “3”, “30” displays before “4” and so on.

        To get around this problem, I use a custom function to add leading zeros to each path “segment”. ( eg “001.002.003”). So, 1.2.3 lists before 1.2.10.

        I hoped there was a smarter way.

        Example data: (Table name = “projects”)
        | ID | PID | listorder | name |
        | 1000 | NULL | 1 | root |
        | 1001 | 1000 | 3 | child c |
        | 1002 | 1000 | 1 | child a |
        | 1003 | 1000 | 2 | child b |
        | 1004 | 1002 | 1 | child a.1 |
        | 1005 | 1002 | 2 | child a.2 |
        | 1006 | 1005 | 2 | child c.2.2 |
        | 1007 | 1005 | 1 | child c.2.1 |

        Current Query:

        WITH RECURSIVE proj as (

        SELECT ID, PID, Name, ListOrder,
        CAST(projects.ListOrder AS char(512)) AS path,
        0 AS depth
        FROM projects
        WHERE projects.ProjectPID IS NULL

        UNION ALL
        SELECT ID, PID, Name, ListOrder,
        CONCAT(proj.path, “.”, projects.ListOrder),
        proj.depth + 1
        FROM projects JOIN proj ON proj.ProjectID = projects.ProjectPID
        SELECT *
        FROM proj JOIN projects ON proj.ProjectID = projects.ProjectID

        Order by
        MPATH(COALESCE(path,0),(select Length(max(projects.ListOrder)) FROM projects))

        Note: MPATH is my custom function that takes 2 parameters. The first is the path string and the second is an integer that sets the number of characters per segment.

        Hopefully this clarifies.

        Regards, Phill

        1. Your query is ordering solely on the “path”; “child a” and “child a.1” have the same parent (root) and the same listorder (1) so they have the same “path” (1.1 and 1.1), so they will come out sorted next to each other. So, sub-tasks of “child a” will be separated from “child a” in the output. Is it really what’s wanted? What’s the desired output’s order, for the sample data you showed?

          1. The order is

            |1000 | NULL | 1 | root |
            | 1002 | 1000 | 1 | child a |
            | 1004 | 1002 | 1 | child a.1 |
            | 1005 | 1002 | 2 | child a.2 |
            | 1003 | 1000 | 2 | child b |
            | 1001 | 1000 | 3 | child c |
            | 1007 | 1005 | 1 | child c.2.1 |
            | 1006 | 1005 | 2 | child c.2.2 |

            I need to control the order in which a child nodes are listed

            I get the correct ordering until list order is in single digits. beyond that, I’m not getting the correct ordering. as 10 for example appears before 2 as the number is converted to a string.

            eg the row with a path “1.2.10” appears before the row with a path of “1.2.2”. What I want is the row to appear after the row that has a path of “1.2.9”.

            1. Assuming you want all children listed under their parent, and between these children, you want them ordered by “listorder”, then I would build a “path” which is made of the IDs of all parents from root to current node (excluding current node), then I’d do
              order by path, listorder;
              This way, all children of the same parent would be together under their parent, as they would have the same path, and inside their group they would be sorted by listorder.
              Such path made of PIDs may not have the padding issue, if all PIDs are of same length.
              I still think that using only a concatenation of listorders cannot work, because “child a” and “child a.1” have the same listorder so they would sort next to each other.

Leave a Reply