[MDEV-13482] CONNECT BY computation algorithm Created: 2017-08-09  Updated: 2019-10-17

Status: Open
Project: MariaDB Server
Component/s: None
Fix Version/s: None

Type: Task Priority: Major
Reporter: Sergei Petrunia Assignee: Unassigned
Resolution: Unresolved Votes: 1
Labels: None

Issue Links:
PartOf
is part of MDEV-13428 Oracle-compatible recursive queries w... Open

 Description   

Background - how Oracle does it

//Vicentiu, based on Oracle's documentation:

Oracle selects the child rows of each root row. Each child row must satisfy the 
condition of the CONNECT BY condition with respect to one of the root rows.
 
Oracle selects successive generations of child rows. Oracle first selects the 
children of the  rows returned in step 2, and then the children of those children, 
and so on. Oracle always selects children by evaluating the CONNECT BY 
condition with respect to a current parent row.

This describes a breadth-first traversal.

  • Also, the docs specify that output rows are returned in the order that one would get if they were doing a depth-first traversal (search for (Oracle returns the rows in the order shown in Figure 9-1).
  • An experiment has shown that depth-first traversal is used: MDEV-13473.

Looking at CONNECT BY from Recursive CTE point of view

  • CONNECT BY recursion is always linear (that is, only last-generation items contribute to the next generation).
  • CONNECT BY recursion detects cycles (and either aborts or stops producing further offspring there)
    • cycle detection is based on remembering all ancestors for each elements and not producing elements that have themselves as ancestors.

Computing CONNECT BY as a Recursive CTE

  • Recursive CTE code does a breadth-first computation.
  • As far as our understanding goes, one can use breadth-first computation to compute CONNECT BY queries (but is it a good idea?)

Detecting loops

The documentation says:

A loop occurs if one row is both the parent (or grandparent or direct ancestor)
and a child (or a grandchild or a direct descendent) of another row.

When CONNECT BY does not have NOCYCLE specifier, we only need to detect looping and abort the query with an error.

When NOCYCLE specifier is used, we need to stop before row R (that is, not include R in the query output or in further computations) if R together with its ancestors forms a loop. R's parent row will have CYCLE=1, which is a bit counter-intuitive, as the cycle has not formed yet (See https://lists.launchpad.net/maria-developers/msg10840.html for details).

TODO: How does one produce values for the CYCLE column? We will need to perform either "lookahead" (to see if current row has children) or "back-patching" (the value of CYCLE of some row R1 will be computed when we try to compute children(R1)). Considering that connect_by_expr may use CYCLE itself, lookahead might be the only option?

The algorithm

TODO

Other details

  • A select with CONNECT BY construct is not a derived table, so if it is a subquery, it may have outer references and so will be re-executed multiple
    times.


 Comments   
Comment by Vicențiu Ciorbaru [ 2017-08-23 ]

There is a bit of confusion with the CONNECT_BY_ISCYCLE (I'll use CYCLE for short) column. The above mentioned explanation that we set the CYCLE value to be 1 for the parent of the row R which would cause a cycle if it were added to the list. This seems to hold for most examples, however here is a counter example:

create table t2 (row_id char, id int, parent int);
insert into t2 values ('a', 1, NULL);
insert into t2 values ('b', 2, 1);
insert into t2 values ('c', 3, 2);
insert into t2 values ('d', 4, 3);
insert into t2 values ('e', 5, 4);
insert into t2 values ('f', 6, 5);
insert into t2 values ('g', 2, 6);

This table has a simple loop within it. Running the following connect by query seems to corroborate the theory described in the entry.

SELECT row_id, id, parent, connect_by_iscycle, SYS_CONNECT_BY_PATH(row_id || id, '/') "Path"
from t2
start with id = 1
connect by nocycle prior id = parent;
 
| ROW_ID | ID | PARENT | CONNECT_BY_ISCYCLE |               Path |
|--------|----|--------|--------------------|--------------------|
|      a |  1 | (null) |                  0 |                /a1 |
|      b |  2 |      1 |                  0 |             /a1/b2 |
|      c |  3 |      2 |                  0 |          /a1/b2/c3 |
|      d |  4 |      3 |                  0 |       /a1/b2/c3/d4 |
|      e |  5 |      4 |                  0 |    /a1/b2/c3/d4/e5 |
|      f |  6 |      5 |                  1 | /a1/b2/c3/d4/e5/f6 |

As we can see, there is no 'g2' entry in the list, since if g2 were present, we would have a loop through b2, given this connect by condition.

Now lets try a different example:

create table t1 (id int, fst_child int, snd_child int);
 
insert into t1 values (1, 2, NULL);
insert into t1 values (2, 3, 4);
insert into t1 values (3, 5, NULL);
insert into t1 values (4, 5, NULL);
insert into t1 values (5, 6, NULL);
insert into t1 values (6, 2, 4); 

Here, the relationship is expressed in a different direction. Instead of children having references to parents, nodes have references to their children.

Here the relationship diagram looks like this:

--    1
--    |
--    2   <--|
--   / \     |    
--  3   4  <-|   
--   \ /     |    
--    5      | 
--    |      | 
--    6 -----|

There are 2 features in this graph:
1) We have a diamond shape.
2) We have 2 possible cycles, from 6 to 2, and from 6 to 4.

Let's run a connect by with this graph, with no cycles, so row with id 6 has no children and is defined like this instead:

delete from t1 where id = 6;
insert into t1 values (6, NULL, NULL);
 
SELECT id, fst_child, snd_child, connect_by_iscycle, SYS_CONNECT_BY_PATH(id, '/') "Path"
from t1
start with id = 1
connect by nocycle ((id = prior fst_child) or (id = prior snd_child));
 
| ID | FST_CHILD | SND_CHILD | CONNECT_BY_ISCYCLE |       Path |
|----|-----------|-----------|--------------------|------------|
|  1 |         2 |    (null) |                  0 |         /1 |
|  2 |         3 |         4 |                  0 |       /1/2 |
|  3 |         5 |    (null) |                  0 |     /1/2/3 |
|  5 |         6 |    (null) |                  0 |   /1/2/3/5 |
|  6 |    (null) |    (null) |                  0 | /1/2/3/5/6 |
|  4 |         5 |    (null) |                  0 |     /1/2/4 |
|  5 |         6 |    (null) |                  0 |   /1/2/4/5 |
|  6 |    (null) |    (null) |                  0 | /1/2/4/5/6 |

In this case we have no loops and the diamond inheritance generates two paths, one going through node 4 and one going through node 3.

Now lets add a loop through node 6 to node 2.

delete from t1 where id = 6;
insert into t1 values (6, 2, NULL);
 
SELECT id, fst_child, snd_child, connect_by_iscycle, SYS_CONNECT_BY_PATH(id, '/') "Path"
from t1
start with id = 1
connect by nocycle ((id = prior fst_child) or (id = prior snd_child));
 
| ID | FST_CHILD | SND_CHILD | CONNECT_BY_ISCYCLE |     Path |
|----|-----------|-----------|--------------------|----------|
|  1 |         2 |    (null) |                  0 |       /1 |
|  2 |         3 |         4 |                  0 |     /1/2 |
|  3 |         5 |    (null) |                  0 |   /1/2/3 |
|  5 |         6 |    (null) |                  1 | /1/2/3/5 |
|  4 |         5 |    (null) |                  0 |   /1/2/4 |
|  5 |         6 |    (null) |                  1 | /1/2/4/5 |

As expected, we are missing row with id 6, as it would introduce a cycle if it were added to the result set. It's parent row number 5 has CYCLE set to 1.
So far so good. Now come the strange examples.

Lets change the loop from going to node with id 2 to go to node with id 4.

delete from t1 where id = 6;
insert into t1 values (6, 4, NULL);
 
SELECT id, fst_child, snd_child, connect_by_iscycle, SYS_CONNECT_BY_PATH(id, '/') "Path"
from t1
start with id = 1
connect by nocycle ((id = prior fst_child) or (id = prior snd_child));
| ID | FST_CHILD | SND_CHILD | CONNECT_BY_ISCYCLE |       Path |
|----|-----------|-----------|--------------------|------------|
|  1 |         2 |    (null) |                  0 |         /1 |
|  2 |         3 |         4 |                  0 |       /1/2 |
|  3 |         5 |    (null) |                  0 |     /1/2/3 |
|  5 |         6 |    (null) |                  0 |   /1/2/3/5 |
|  6 |         4 |    (null) |                  1 | /1/2/3/5/6 |
|  4 |         5 |    (null) |                  0 |     /1/2/4 |
|  5 |         6 |    (null) |                  0 |   /1/2/4/5 |
|  6 |         4 |    (null) |                  1 | /1/2/4/5/6 |

This case has row with id 6 show up TWICE and it in turn has cycle set to 1. This doesn't change wether we go through the diamond by node 3 or by node 4.

An even stranger result happens if 6 has children both 2 and 4.

delete from t1 where id = 6;
insert into t1 values (6, 2, 4);
 
SELECT id, fst_child, snd_child, connect_by_iscycle, SYS_CONNECT_BY_PATH(id, '/') "Path"
from t1
start with id = 1
connect by nocycle ((id = prior fst_child) or (id = prior snd_child));
| ID | FST_CHILD | SND_CHILD | CONNECT_BY_ISCYCLE |       Path |
|----|-----------|-----------|--------------------|------------|
|  1 |         2 |    (null) |                  0 |         /1 |
|  2 |         3 |         4 |                  0 |       /1/2 |
|  3 |         5 |    (null) |                  0 |     /1/2/3 |
|  5 |         6 |    (null) |                  0 |   /1/2/3/5 |
|  6 |         2 |         4 |                  1 | /1/2/3/5/6 |
|  4 |         5 |    (null) |                  0 |     /1/2/4 |
|  5 |         6 |    (null) |                  0 |   /1/2/4/5 |
|  6 |         2 |         4 |                  1 | /1/2/4/5/6 |

In this case we get the same result as if we had a loop through 4 only.

To further add to the confusion, lets set fst_child and snd_child both to equal 2 for node 6.

delete from t1 where id = 6;
insert into t1 values (6, 2, 2);
 
SELECT id, fst_child, snd_child, connect_by_iscycle, SYS_CONNECT_BY_PATH(id, '/') "Path"
from t1
start with id = 1
connect by nocycle ((id = prior fst_child) or (id = prior snd_child));
 
| ID | FST_CHILD | SND_CHILD | CONNECT_BY_ISCYCLE |       Path |
|----|-----------|-----------|--------------------|------------|
|  1 |         2 |    (null) |                  0 |         /1 |
|  2 |         3 |         4 |                  0 |       /1/2 |
|  3 |         5 |    (null) |                  0 |     /1/2/3 |
|  5 |         6 |    (null) |                  0 |   /1/2/3/5 |
|  6 |         2 |         2 |                  1 | /1/2/3/5/6 |
|  4 |         5 |    (null) |                  0 |     /1/2/4 |
|  5 |         6 |    (null) |                  0 |   /1/2/4/5 |
|  6 |         2 |         2 |                  1 | /1/2/4/5/6 |

This produces different results than if we only have fst_child set to 2 and snd_child set to NULL.

The same result is present if we have 4 for both fst_child and snd_child.

delete from t1 where id = 6;
insert into t1 values (6, 2, 2);
 
SELECT id, fst_child, snd_child, connect_by_iscycle, SYS_CONNECT_BY_PATH(id, '/') "Path"
from t1
start with id = 1
connect by nocycle ((id = prior fst_child) or (id = prior snd_child));
 
| ID | FST_CHILD | SND_CHILD | CONNECT_BY_ISCYCLE |       Path |
|----|-----------|-----------|--------------------|------------|
|  1 |         2 |    (null) |                  0 |         /1 |
|  2 |         3 |         4 |                  0 |       /1/2 |
|  3 |         5 |    (null) |                  0 |     /1/2/3 |
|  5 |         6 |    (null) |                  0 |   /1/2/3/5 |
|  6 |         4 |         4 |                  1 | /1/2/3/5/6 |
|  4 |         5 |    (null) |                  0 |     /1/2/4 |
|  5 |         6 |    (null) |                  0 |   /1/2/4/5 |
|  6 |         4 |         4 |                  1 | /1/2/4/5/6 |

Comment by Vicențiu Ciorbaru [ 2017-08-23 ]

CC: zhangyuan plinux Is there a way to explain the behaviour from above?

Comment by jametong [ 2017-08-24 ]

1. the g2 in the first example is a reasonable result, it should not be contained in the result.

create table t2 (row_id char, id int, parent int);
insert into t2 values ('a', 1, NULL);
insert into t2 values ('b', 2, 1);
insert into t2 values ('c', 3, 2);
insert into t2 values ('d', 4, 3);
insert into t2 values ('e', 5, 4);
insert into t2 values ('f', 6, 5);
insert into t2 values ('g', 2, 6);
 
SELECT row_id, id, parent, connect_by_iscycle, SYS_CONNECT_BY_PATH(row_id || id, '/') "Path"
from t2
start with id = 1
connect by nocycle prior id = parent;
 
| ROW_ID | ID | PARENT | CONNECT_BY_ISCYCLE |               Path |
|--------|----|--------|--------------------|--------------------|
|      a |  1 | (null) |                  0 |                /a1 |
|      b |  2 |      1 |                  0 |             /a1/b2 |
|      c |  3 |      2 |                  0 |          /a1/b2/c3 |
|      d |  4 |      3 |                  0 |       /a1/b2/c3/d4 |
|      e |  5 |      4 |                  0 |    /a1/b2/c3/d4/e5 |
|       f |  6 |      5 |                  1 | /a1/b2/c3/d4/e5/f6 |

We can just assume g2 can be showed above, then the result should be "| g | 2 | 6 | 1 | /a1/b2/c3/d4/e5/f6/g2 |",we can find that "id=2" appeared twice in the hierarchical tree, it violate the cycle semantic, so the g2 should not be returned in the result.

2. I think "the diamond example with first child of 2 second child of null" is an Oracle bug.

create table t1 (id int, fst_child int, snd_child int);
 
--base data
insert into t1 values (1, 2, NULL);
insert into t1 values (2, 3, 4);
insert into t1 values (3, 5, NULL);
insert into t1 values (4, 5, NULL);
insert into t1 values (5, 6, NULL);
insert into t1 values (6, 2, 4); 
 
--construct a cycle with id=6 to id=2
delete from t1 where id = 6;
insert into t1 values (6, 2, NULL);
 
SELECT id, fst_child, snd_child, connect_by_iscycle, SYS_CONNECT_BY_PATH(id, '/') "Path"
from t1
start with id = 1
connect by nocycle ((id = prior fst_child) or (id = prior snd_child));
 
| ID | FST_CHILD | SND_CHILD | CONNECT_BY_ISCYCLE |     Path |
|----|-----------|-----------|--------------------|----------|
|  1 |         2 |    (null) |                  0 |       /1 |
|  2 |         3 |         4 |                  0 |     /1/2 |
|  3 |         5 |    (null) |                  0 |   /1/2/3 |
|  5 |         6 |    (null) |                  1 | /1/2/3/5 |
|  4 |         5 |    (null) |                  0 |   /1/2/4 |
|  5 |         6 |    (null) |                  1 | /1/2/4/5 |

Since id=5 to id=6 will not get a cycle(duplicate value) in the hierarchical tree in the SYS_CONNECT_BY_PATH function, so I think the results should contain the below two rows for id=6

|  6 |         2 |         4 |                  1 | /1/2/3/5/6 |
|  6 |         4 |         4 |                  1 | /1/2/4/5/6 |

3. I think all the other results are correct, they all enforce the semantic "no cycle/duplicate value in the hierarchical tree in the SYS_CONNECT_BY_PATH function".

Comment by Alexander Bienemann (Inactive) [ 2018-01-23 ]

For implementing CONNECT BY as a CTE, the following code or similar can be used. Here with CONNECT_BY_ISLEAF:

select * from (
– transitive closure without CONNECT_BY_ISLEAF
WITH RECURSIVE trcl AS
(
select PARENT_ID p, CHILD_ID c, 1 depth, CONCAT('', IF(PARENT_ID ISNULL,'',PARENT_ID),'',CHILD_ID,'_') trace
from RELATION
where CHILD_ID in (select some_id from SOME_TABLE)
and PARENT_ID_TYPE = 'sometype'
UNION ALL
select r.PARENT_ID p, t.c c, t.depth+1, CONCAT('', IF(r.PARENT_ID ISNULL,'',r.PARENT_ID),'',t.trace)
from RELATION r, trcl t
where r.CHILD_ID = t.p and INSTR(t.trace,CONCAT('', IF(r.PARENT_ID ISNULL,'',r.PARENT_ID),'')) = 0
)
SELECT depth lev, c, p, trace
from (
– this is the equivalent to an “isleaf” alias
select t.p, t.c, t.depth, t.trace
from trcl t
where not exists (select 1 from RELATION s where s.CHILD_ID = t.p and IF(s.PARENT_ID IS NULL,'',s.PARENT_ID) != IF(s.CHILD_ID ISNULL,'',s.CHILD_ID) and INSTR(t.trace,CONCAT('',IF(s.PARENT_ID ISNULL,'',s.PARENT_ID),'')) = 0)
UNION ALL
select t.p, t.c, t.depth, t.trace
from trcl t
where exists (select 1 from RELATION s where s.CHILD_ID = t.p andIF(s.PARENT_ID IS NULL,'',s.PARENT_ID) != IF(s.CHILD_ID ISNULL,'',s.CHILD_ID) and INSTR(t.trace,CONCAT('',IF(s.PARENT_ID ISNULL,'',s.PARENT_ID),'')) = 0)
);

Generated at Thu Feb 08 08:05:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.