[MDEV-13352] Server crashes in st_join_table::remove_duplicates Created: 2017-07-19  Updated: 2020-08-25  Resolved: 2018-01-22

Status: Closed
Project: MariaDB Server
Component/s: Optimizer - Window functions
Affects Version/s: 10.2
Fix Version/s: 10.2.13

Type: Bug Priority: Critical
Reporter: Elena Stepanova Assignee: Sergei Petrunia
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-13344 Server crashes in in AGGR_OP::put_rec... Closed
relates to MDEV-16990 server crashes in base_list_iterator:... Closed
Sprint: 10.2.10, 10.2.12

 Description   

CREATE TABLE t1 (i INT);
INSERT INTO t1 VALUES (1),(2);
SELECT DISTINCT ROW_NUMBER() OVER(), i FROM t1 WHERE 0;

10.2 bc75c57cfc18be64f167d91c431076f581b0382b

#3  <signal handler called>
#4  0x00007fe7a608f32e in st_join_table::remove_duplicates (this=0x7fe78c0136e0) at /data/src/10.2/sql/sql_select.cc:21785
#5  0x00007fe7a608965d in join_init_read_record (tab=0x7fe78c0136e0) at /data/src/10.2/sql/sql_select.cc:19449
#6  0x00007fe7a609bf7e in AGGR_OP::end_send (this=0x7fe78c013c20) at /data/src/10.2/sql/sql_select.cc:26432
#7  0x00007fe7a6087192 in sub_select_postjoin_aggr (join=0x7fe78c013020, join_tab=0x7fe78c0136e0, end_of_records=true) at /data/src/10.2/sql/sql_select.cc:18263
#8  0x00007fe7a6086cce in do_select (join=0x7fe78c013020, procedure=0x0) at /data/src/10.2/sql/sql_select.cc:18094
#9  0x00007fe7a6061527 in JOIN::exec_inner (this=0x7fe78c013020) at /data/src/10.2/sql/sql_select.cc:3477
#10 0x00007fe7a60609f6 in JOIN::exec (this=0x7fe78c013020) at /data/src/10.2/sql/sql_select.cc:3278
#11 0x00007fe7a6061b9f in mysql_select (thd=0x7fe78c000b00, tables=0x7fe78c012878, wild_num=0, fields=..., conds=0x7fe78c012ea8, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=2147748609, result=0x7fe78c013000, unit=0x7fe78c0045e8, select_lex=0x7fe78c004d20) at /data/src/10.2/sql/sql_select.cc:3672
#12 0x00007fe7a60564ea in handle_select (thd=0x7fe78c000b00, lex=0x7fe78c004520, result=0x7fe78c013000, setup_tables_done_option=0) at /data/src/10.2/sql/sql_select.cc:373
#13 0x00007fe7a60223c8 in execute_sqlcom_select (thd=0x7fe78c000b00, all_tables=0x7fe78c012878) at /data/src/10.2/sql/sql_parse.cc:6443
#14 0x00007fe7a6018407 in mysql_execute_command (thd=0x7fe78c000b00) at /data/src/10.2/sql/sql_parse.cc:3458
#15 0x00007fe7a6025d88 in mysql_parse (thd=0x7fe78c000b00, rawbuf=0x7fe78c012378 "SELECT DISTINCT ROW_NUMBER() OVER(), i FROM t1 WHERE 0", length=54, parser_state=0x7fe79dc0f200, is_com_multi=false, is_next_command=false) at /data/src/10.2/sql/sql_parse.cc:7879
#16 0x00007fe7a6013e34 in dispatch_command (command=COM_QUERY, thd=0x7fe78c000b00, packet=0x7fe78c168011 "", packet_length=54, is_com_multi=false, is_next_command=false) at /data/src/10.2/sql/sql_parse.cc:1817
#17 0x00007fe7a6012775 in do_command (thd=0x7fe78c000b00) at /data/src/10.2/sql/sql_parse.cc:1362
#18 0x00007fe7a615e4e7 in do_handle_one_connection (connect=0x7fe7a9a8b760) at /data/src/10.2/sql/sql_connect.cc:1354
#19 0x00007fe7a615e274 in handle_one_connection (arg=0x7fe7a9a8b760) at /data/src/10.2/sql/sql_connect.cc:1260
#20 0x00007fe7a64a5922 in pfs_spawn_thread (arg=0x7fe7a9b2df40) at /data/src/10.2/storage/perfschema/pfs.cc:1862
#21 0x00007fe7a5635064 in start_thread (arg=0x7fe79dc10700) at pthread_create.c:309
#22 0x00007fe7a381a62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111



 Comments   
Comment by Varun Gupta (Inactive) [ 2017-07-20 ]

In the function remove_duplicates st_join_table::remove_duplicates

List<Item> *fields= (this)->fields;

(gdb) p *this->fields
$29 = {<base_list> = {<Sql_alloc> = {<No data fields>}, first = 0x7fffd4013ed8, last = 0x7fffd4013fe0, elements = 2}, <No data fields>}
(gdb) p *(this-1)->fields
$27 = {<base_list> = {<Sql_alloc> = {<No data fields>}, first = 0x555556dbd9d8 <vtable for Window_spec+16>, last = 0xa5a5a5a5a5a5a501, elements = 0}, <No data fields>}
(gdb) p this->fields
$28 = (List<Item> *) 0x7fffd4013648

List_iterator<Item> it(*fields);

(gdb) p it
$20 = {<base_list_iterator> = {list = 0x7fffd4012660, el = 0x7fffd4012660, prev = 0x0, current = 0x0}, <No data fields>}

So here I see that we are referring to the incorrect fields by using (this-1)->fields
It looks that this->fields has the fields

Comment by Sergei Petrunia [ 2017-07-21 ]

Varun please review

Comment by Sergei Petrunia [ 2017-07-21 ]

http://lists.askmonty.org/pipermail/commits/2017-July/011312.html

Comment by Elena Stepanova [ 2017-07-21 ]

Still getting a crash with the same stack trace on bb-10.2-mdev13355:

4bdf9318b6bab1bf8d3718c50f87871a4ae5ec9b

#3  <signal handler called>
#4  0x00007fce9ade0390 in st_join_table::remove_duplicates (this=0x7fce80013ac0) at /data/src/bb-10.2-mdev13355/sql/sql_select.cc:21792
#7  0x00007fce9add81f4 in sub_select_postjoin_aggr (join=0x7fce80013198, join_tab=0x7fce80013ac0, end_of_records=true) at /data/src/bb-10.2-mdev13355/sql/sql_select.cc:18270
#8  0x00007fce9add7d30 in do_select (join=0x7fce80013198, procedure=0x0) at /data/src/bb-10.2-mdev13355/sql/sql_select.cc:18101
#9  0x00007fce9adb2589 in JOIN::exec_inner (this=0x7fce80013198) at /data/src/bb-10.2-mdev13355/sql/sql_select.cc:3484
#10 0x00007fce9adb1a38 in JOIN::exec (this=0x7fce80013198) at /data/src/bb-10.2-mdev13355/sql/sql_select.cc:3279
#11 0x00007fce9adb2c01 in mysql_select (thd=0x7fce80000b00, tables=0x7fce800129f0, wild_num=0, fields=..., conds=0x7fce80013020, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=2147748609, result=0x7fce80013178, unit=0x7fce800045e8, select_lex=0x7fce80004d20) at /data/src/bb-10.2-mdev13355/sql/sql_select.cc:3679
#12 0x00007fce9ada750c in handle_select (thd=0x7fce80000b00, lex=0x7fce80004520, result=0x7fce80013178, setup_tables_done_option=0) at /data/src/bb-10.2-mdev13355/sql/sql_select.cc:373
#13 0x00007fce9ad733ea in execute_sqlcom_select (thd=0x7fce80000b00, all_tables=0x7fce800129f0) at /data/src/bb-10.2-mdev13355/sql/sql_parse.cc:6443
#14 0x00007fce9ad69429 in mysql_execute_command (thd=0x7fce80000b00) at /data/src/bb-10.2-mdev13355/sql/sql_parse.cc:3458
#15 0x00007fce9ad76daa in mysql_parse (thd=0x7fce80000b00, rawbuf=0x7fce80012378 "SELECT DISTINCT MAX(i), ROW_NUMBER() OVER () FROM t1 WHERE 0", length=60, parser_state=0x7fce92960200, is_com_multi=false, is_next_command=false) at /data/src/bb-10.2-mdev13355/sql/sql_parse.cc:7879
#16 0x00007fce9ad64e56 in dispatch_command (command=COM_QUERY, thd=0x7fce80000b00, packet=0x7fce80163bd1 "", packet_length=60, is_com_multi=false, is_next_command=false) at /data/src/bb-10.2-mdev13355/sql/sql_parse.cc:1817
#17 0x00007fce9ad63797 in do_command (thd=0x7fce80000b00) at /data/src/bb-10.2-mdev13355/sql/sql_parse.cc:1362
#18 0x00007fce9aeaf549 in do_handle_one_connection (connect=0x7fce9e6e4a50) at /data/src/bb-10.2-mdev13355/sql/sql_connect.cc:1354
#19 0x00007fce9aeaf2d6 in handle_one_connection (arg=0x7fce9e6e4a50) at /data/src/bb-10.2-mdev13355/sql/sql_connect.cc:1260
#20 0x00007fce9b1f6984 in pfs_spawn_thread (arg=0x7fce9e6c7620) at /data/src/bb-10.2-mdev13355/storage/perfschema/pfs.cc:1862
#21 0x00007fce9a386064 in start_thread (arg=0x7fce92961700) at pthread_create.c:309
#22 0x00007fce9856b62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

CREATE TABLE t1 (i INT);
INSERT INTO t1 VALUES (1),(2);
SELECT DISTINCT MAX(i), ROW_NUMBER() OVER () FROM t1 WHERE 0;

Here is also a variation of the test case with a less obvious degenerative condition, might be worth checking too; stack trace is the same.

CREATE TABLE t1 (i INT NOT NULL);
INSERT INTO t1 VALUES (1),(2);
SELECT DISTINCT MAX(i), ROW_NUMBER() OVER () FROM t1 WHERE i IS NULL;

Both test cases crash with InnoDB and MyISAM.

Comment by Sergei Petrunia [ 2017-12-26 ]

Link to the patch:https://github.com/MariaDB/server/commit/bd45c4377329bd0873ee64a5e3f8beca7d8e7c49?diff=unified

obvious flaws in the patch:

  • It mentions BUG#13353 while this bug is 13352.
  • The MTR testcase doesn't print "MDEV-nnnn" like other testscases do
  • Running mysql-test-run t/win.test after the patch fails like so

--- /home/psergey/dev-git/10.2-r5/mysql-test/r/win.result       2017-12-26 13:32:38.182427886 +0300
+++ /home/psergey/dev-git/10.2-r5/mysql-test/r/win.reject       2017-12-26 14:12:48.455733684 +0300
@@ -2782,7 +2782,7 @@
 NULL   NULL
 select max(a), row_number() over () from t1 where 1 = 2;
 max(a) row_number() over ()
-NULL   1
+NULL   NULL
 select max(a), sum(max(a)) over () from t1 where 1 = 2;
 max(a) sum(max(a)) over ()
 NULL   NULL

This is easily reproducible

Comment by Sergei Petrunia [ 2017-12-26 ]

(this is not immediately clear from comments so I'll mention it explicitly: my patch has been pushed, but apparently, there were cases that it did not fix.
Next time, please open another issue and make it related).

Vesa's patch builds on top of mine. It fixes Elena's crash but causes regression in another testcase in win.test.

Comment by Sergei Petrunia [ 2017-12-26 ]

Thoughts:

(Elena's crash from July,21)

  • remove_duplicates() assumes that there is a computation step done before the duplicate removal.
  • If the query has "Impossible where", there is no previous step, that's why we get a crash.
  • Q: why do we have duplicate removal step if we know there will be just one row? We could avoid it.

Wrong result after Vesa's patch

  • when the query has zero_result_cause="impossible WHERE", we call return_zero_rows()
  • which may still check HAVING and produce a result row, when appropriate.
  • It doesn't compute window function values though. This is wrong. When "impossible where" produced one "NULL-row" after grouping, then we need Window Functions computation step to process this row and compute window function value.
  • (Perhaps we need some special way to compute window functions that we will invoke in return_zero_rows? I am not sure if this makes sense.)
Comment by Sergei Petrunia [ 2017-12-27 ]

Proposed alternative fix:

diff --git a/sql/sql_select.cc b/sql/sql_select.cc
index 676b26e..e82e130 100644
--- a/sql/sql_select.cc
+++ b/sql/sql_select.cc
@@ -2654,7 +2654,15 @@ bool JOIN::make_aggr_tables_info()
         curr_tab->having= having;
         having->update_used_tables();
       }
-      curr_tab->distinct= true;
+      /*
+        We only need DISTINCT operation if the join is not degenerate.
+        If it is, we must not request DISTINCT processing, because
+        remove_duplicates() assumes there is a preceding computation step (and
+        in the degenerate join, there's none)
+      */
+      if (top_join_tab_count)
+        curr_tab->distinct= true;
+
       having= NULL;
       select_distinct= false;
     }

Comment by Varun Gupta (Inactive) [ 2017-12-28 ]

So the problem with using distinct is that it assumes that we have a previous join_tab but with the case of having an IMPOSSIBLE WHERE. In normal cases we don't use distinct if we have an IMPOSSIBLE where clause. With window functions we do the execution so we have to check for distinct too and then we start referring to a previous tab which is not valid.

 /* Number of tables actually joined at the top level */
uint exec_join_tab_cnt() { return tables_list ? top_join_tab_count : 0; }

curr_tab= join_tab + exec_join_tab_cnt();

So I agree with the patch psergey posted above.

Comment by Sergei Petrunia [ 2017-12-29 ]

Studying the question of what kinds of JOIN_TABs exist when remove_duplicates is used:

Example #1:

select distinct max(a) from t10 group by b;

join_tab[0]
  table = {t10}
  aggr= NULL
  distinct= false
 
join_tab[1]  (call it $join_tab_1)
  table= <ha_heap object, $work_table_1>
  aggr= pointer
  distinct= true

remove_duplicates() will be called from join_init_read_record($join_tab_1) call.
That is, remove_duplicates is a "preparation" operation for a join_tab.
The join_tab's "primary" read loop will read the de-duplicated rows (and pass them to query output in the above example).

Comment by Sergei Petrunia [ 2018-01-03 ]

Example #2

select distinct row_number() over (order by a ) from t1;

join_tab[0]
  table=t1
  distinct= false
  window_funcs_step= NULL
 
join_tab[1]
  table= <temp_table>
  window_funcs_step=<pointer>
  distinct= true

Window function computation step is also a "pre-read" operation.
When a join tab has both distinct=true and window_funcs_step, window_funcs_step is performed first.

Comment by Sergei Petrunia [ 2018-01-03 ]

Example #3:

select max(b), row_number() over (order by a ) from t1 where 1 = 2 ;

join_tab[0]
  // join->first_select = sub_select_postjoin_aggr
  table= <temp. table>
  window_funcs_step = <pointer>

The join_tab does not have a sub_select call. Execution starts directly with sub_select_postjoin_aggr call. Inside that call, a record is written into the temp. table. Then, window function computation step is performed.

select distinct max(b), row_number() over (order by a ) from t1 where 1 = 2 ;

has the same data structure as above but also join_tab[0].distinct=true.

Comment by Sergei Petrunia [ 2018-01-18 ]

http://lists.askmonty.org/pipermail/commits/2018-January/011841.html

Comment by Sergei Petrunia [ 2018-01-18 ]

As discussed on the previous optimizer call: Igor, please review.

Comment by Igor Babaev [ 2018-01-22 ]

Sergey,
It's OK to push your patch.

Comment by Sergei Petrunia [ 2018-01-22 ]

Fix pushed.

Generated at Thu Feb 08 08:04:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.