[MDEV-21554] Crash in JOIN_CACHE_BKAH::skip_index_tuple when mrr=on and join_cache_level=6+ Created: 2020-01-22  Updated: 2020-08-25  Resolved: 2020-03-20

Status: Closed
Project: MariaDB Server
Component/s: Optimizer
Affects Version/s: 10.3.22, 10.4.12, 10.5.1
Fix Version/s: 10.3.23, 10.4.13, 10.5.2

Type: Bug Priority: Major
Reporter: Valerii Kravchuk Assignee: Igor Babaev
Resolution: Fixed Votes: 3
Labels: crash, mrr, regression

Attachments: File mdev21544-fix-crash.diff    

 Description   

Crash with the following backtrace:

Version: '10.4.11-MariaDB-log'  socket: '/var/run/mysql/3302.sock'  port: 3302  MariaDB Server
200122  9:00:00 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.4.11-MariaDB-log
key_buffer_size=17179869184
read_buffer_size=131072
max_used_connections=158
max_threads=2050
thread_count=164
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 21288472 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7fd8780009a8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fdd48181cf0 thread_stack 0x49000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x5647307a9ebe]
/usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x5647302424cf]
sigaction.c:0(__restore_rt)[0x7fdd5121b630]
/usr/sbin/mysqld(_ZN15JOIN_CACHE_BKAH16skip_index_tupleEPv+0xb0)[0x56473014a2d0]
/usr/sbin/mysqld(+0x7ae35e)[0x56473014a35e]
/usr/sbin/mysqld(_ZN23Mrr_simple_index_reader8get_nextEPPv+0x47)[0x56473014d2b7]
/usr/sbin/mysqld(_ZN10DsMrr_impl10dsmrr_nextEPPv+0x42)[0x56473014e4f2]
/usr/sbin/mysqld(+0xddc910)[0x564730778910]
/usr/sbin/mysqld(+0xddddb3)[0x564730779db3]
/usr/sbin/mysqld(_ZN17JOIN_TAB_SCAN_MRR4nextEv+0x30)[0x564730145860]
/usr/sbin/mysqld(_ZN10JOIN_CACHE21join_matching_recordsEb+0xcd)[0x564730148a2d]
/usr/sbin/mysqld(_ZN10JOIN_CACHE12join_recordsEb+0xdc)[0x5647301484cc]
/usr/sbin/mysqld(_Z16sub_select_cacheP4JOINP13st_join_tableb+0x2e)[0x5647300717de]
/usr/sbin/mysqld(_ZN4JOIN10exec_innerEv+0xa86)[0x564730093636]
/usr/sbin/mysqld(_ZN4JOIN4execEv+0x33)[0x5647300939a3]
/usr/sbin/mysqld(_ZN18st_select_lex_unit4execEv+0xb34)[0x5647300d8d14]
/usr/sbin/mysqld(_Z11mysql_unionP3THDP3LEXP13select_resultP18st_select_lex_unitm+0x32)[0x5647300dbb22]
/usr/sbin/mysqld(_Z13handle_selectP3THDP3LEXP13select_resultm+0x67)[0x5647300926a7]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x942e)[0x56473003d98e]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x3ad)[0x56473003e23d]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x217a)[0x5647300415fa]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x11c)[0x564730042f3c]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x1fa)[0x56473011f24a]
/usr/sbin/mysqld(handle_one_connection+0x3d)[0x56473011f32d]
pthread_create.c:0(start_thread)[0x7fdd51213ea5]
/lib64/libc.so.6(clone+0x6d)[0x7fdd4f5b48cd]

happens for INSERT ... SELECT accessing several tables when these options are set:

set join_cache_level =8; -- or 6
set optimizer_switch='mrr=on';
set optimizer_switch='mrr_sort_keys=on';

It was not the case fro the same query and data in older 10.2.x and 10.3.x versions.



 Comments   
Comment by Igor Babaev [ 2020-01-24 ]

Analysis:

1.In the function ha_partition::multi_range_key_create_key() the code

m_mrr_range_current->key_multi_range.ptr= m_mrr_range_current

assigns the pointer m_mrr_range_current of the type PARTITION_KEY_MULTI_RANGE*
m_mrr_range_current->key_multi_range.ptr that is of the type void*

2. In the function partition_multi_range_key_next() the code

*range= partition_key_multi_range->key_multi_range;

assigns to range.ptr the value of partition_key_multi_range->key_multi_range.ptr that is not
the same as m_mrr_range_current->ptr and actually is taken from some other location of m_mrr_range_current.

3. Finally in the function Mrr_simple_index_reader::init() the call in the code

 while (!(res= file->handler::multi_range_read_next(range_info)))

brings us to execution of the code

*range_info= mrr_cur_range.ptr;

that returns some garbage from mrr_cur_range.ptr;

Comment by Sergei Petrunia [ 2020-01-26 ]

igor, I've looked again at the patch I was trying yesterday, and figured it had a basic error. After I fix it, the testcase no longer crashes and returns the expected number of rows.

I'm attaching the patch: mdev21544-fix-crash.diff . The patch also adds some comments.

The patch is missing:

  • the test coverage
  • AFAIU partition_multi_range_key_skip_record() must also convert the rowid. It receives the rowid of an individual partition (which it can find in hld->part_id). While the hld->partition->m_seq_if->skip_record function expects to see the rowid of the ha_partition (it starts with 2-byte partition_id and then continues with the rowid in the partition).
Comment by Igor Babaev [ 2020-02-01 ]

This is a setting that is simpler than the one used in the bug report and yet with this setting the bug still manifests itself though in a different way:

set join_cache_level=6;
set optimizer_switch='mrr=on';
set optimizer_switch='index_condition_pushdown=on';

(The last line is not necessary with default settings in 10.3).
With the above setting the offensive query does not crash, but rather returns an empty result set and this is not correct.

Here's a simple test case that demonstrates the problem:

set join_cache_level=6;
set optimizer_switch='mrr=on';
 
create table t0 (
tp int, a int, b int, c varchar(12), index idx (a,b)
) engine=myisam;
insert into t0 values
(1,3,30,'yyzy'), (1,3,30,'yxxyy'), (1,3,30,'yyxy'), (1,3,30,'xxyy'),
(1,3,30,'yyxz'), (1,3,30,'yyzz'), (1,3,30,'xxyzy'), (1,3,30,'yyyy'),
(1,3,30,'yzzy'), (1,93,30,'zzzy'),
(2,3,40,'yxx'), (2,4,40,'xx'), (2,3,10,'zxz'),
(2,3,40,'yyx'), (2,4,40,'xxx'), (2,3,10,'zyyz'),
(2,3,40,'xzzzz'), (2,4,40,'yyyxx'), (2,3,10,'zyz'),
(2,3,40,'xxx'), (2,4,40,'xx'), (2,3,10,'zzz'),
(2,3,40,'yyxzx'), (2,4,40,'xyx'), (2,3,10,'xzz'),
(2,3,40,'xxxzz'), (2,4,40,'xxz'), (2,3,10,'zzzy'),
(2,3,40,'zzxxx'), (2,4,40,'zxx'), (2,3,10,'yzzz'),
(2,3,40,'xyyxx'), (2,4,40,'xzzzx'), (2,3,10,'zzxxz'),
(2,3,40,'yzxxx'), (2,4,40,'xxzy'), (2,3,10,'zzzyx'),
(2,93,40,'xzx'), (2,94,40,'xz'), (2,93,10,'zyyyz'),
(3,4,30,'yx'), (3,4,30,'yyxxx'), (3,4,30,'zzyy'), (3,4,30,'zxyy'),
(3,4,30,'xxyy'), (3,4,30,'yyzx'),  (3,4,30,'zyyy'), (3,4,30,'yzy'),
(3,4,30,'zzzyy'), (3,94,30,'yyz');
 
create table t1 (
tp int, a int, b int, c varchar(12), index idx (a,b)
) engine=myisam
partition by list (tp)
( partition p1 values in (1),
  partition p2 values in (2),
  partition p3 values in (3));
insert into t1 select * from t0;
 
create table t2 (a int, index idx(a)) engine=myisam;
insert into t2 values (1), (2), (3), (4), (5);
insert into t2 select a+10 from t2;
insert into t2 select a+20 from t2;
 
analyze table t0,t1,t2;
 
select * from t0,t2 where t2.a in (3,4) and t0.a=t2.a and (t0.b / 10) = t2.a-1;
select * from t1,t2 where t2.a in (3,4) and t1.a=t2.a and (t1.b / 10) = t2.a-1;

The tables t0 and t1 are of the same structure and populated with the same set of rows. The only difference is that t1 is partitioned.
We see that the result set for the second query is empty:

MariaDB [test]> select * from t1,t2 where t2.a in (3,4) and t1.a=t2.a and (t1.b / 10) = t2.a-1;
Empty set

and that the execution employs index condition pushdown

MariaDB [test]> explain extended select * from t1,t2 where t2.a in (3,4) and t1.a=t2.a and (t1.b / 10) = t2.a-1;
+------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+
| id   | select_type | table | type  | possible_keys | key  | key_len | ref       | rows | filtered | Extra                                                                              |
+------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+
|    1 | SIMPLE      | t2    | range | idx           | idx  | 5       | NULL      |    2 |   100.00 | Using where; Using index                                                           |
|    1 | SIMPLE      | t1    | ref   | idx           | idx  | 5       | test.t2.a |    5 |   100.00 | Using index condition(BKA); Using join buffer (flat, BKA join); Rowid-ordered scan |
+------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+

At the same time for the first query the result is correct:

MariaDB [test]> select * from t0,t2 where t2.a in (3,4) and t0.a=t2.a and (t0.b / 10) = t2.a-1;
+------+------+------+-------+------+
| tp   | a    | b    | c     | a    |
+------+------+------+-------+------+
|    3 |    4 |   30 | yx    |    4 |
|    3 |    4 |   30 | yyxxx |    4 |
|    3 |    4 |   30 | zzyy  |    4 |
|    3 |    4 |   30 | zxyy  |    4 |
|    3 |    4 |   30 | xxyy  |    4 |
|    3 |    4 |   30 | yyzx  |    4 |
|    3 |    4 |   30 | zyyy  |    4 |
|    3 |    4 |   30 | yzy   |    4 |
|    3 |    4 |   30 | zzzyy |    4 |
+------+------+------+-------+------+
9 rows in set 

and to get this result the executor uses index condition pushdown as well

MariaDB [test]> explain extended select * from t0,t2 where t2.a in (3,4) and t0.a=t2.a and (t0.b / 10) = t2.a-1;
+------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+
| id   | select_type | table | type  | possible_keys | key  | key_len | ref       | rows | filtered | Extra                                                                              |
+------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+
|    1 | SIMPLE      | t2    | range | idx           | idx  | 5       | NULL      |    2 |   100.00 | Using where; Using index                                                           |
|    1 | SIMPLE      | t0    | ref   | idx           | idx  | 5       | test.t2.a |    13 |   100.00 | Using index condition(BKA); Using join buffer (flat, BKA join); Rowid-ordered scan |
+------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+

The correct result set for the second query we get with the setting

set optimizer_switch='index_condition_pushdown=off';

MariaDB [test]> set optimizer_switch='index_condition_pushdown=off';
 
 MariaDB [test]> select * from t1,t2 where t2.a in (3,4) and t1.a=t2.a and (t1.b / 10) = t2.a-1;
+------+------+------+-------+------+
| tp   | a    | b    | c     | a    |
+------+------+------+-------+------+
|    3 |    4 |   30 | yx    |    4 |
|    3 |    4 |   30 | yyxxx |    4 |
|    3 |    4 |   30 | zzyy  |    4 |
|    3 |    4 |   30 | zxyy  |    4 |
|    3 |    4 |   30 | xxyy  |    4 |
|    3 |    4 |   30 | yyzx  |    4 |
|    3 |    4 |   30 | zyyy  |    4 |
|    3 |    4 |   30 | yzy   |    4 |
|    3 |    4 |   30 | zzzyy |    4 |
+------+------+------+-------+------+
9 rows in set 

Comment by Sergei Petrunia [ 2020-02-04 ]

igor, on the question of whether range_info in the sequence of ranges passed from ha_partition to individual partitions could be the same as range_info that SQL layer has passed to ha_partition:

Here are the places where ha_partition makes use of range_info passed to individual partitions. (I'll explain the reasoning below)

in ha_partition::handle_ordered_index_scan:

      error= file->multi_range_read_next(&m_range_info[i]);
      ...
      m_stock_range_seq[i]= (((PARTITION_KEY_MULTI_RANGE *)
                              m_range_info[i])->id);

in ha_partition::handle_ordered_next:

    error= file->multi_range_read_next(&m_range_info[part_id]);
    ...
    if (((PARTITION_KEY_MULTI_RANGE *) m_range_info[part_id])->id !=
        m_mrr_range_current->id)

The idea here is to get the number of the range we've got the record from. This is used in "Ordered MRR scans".

An explanation about how "Ordered MRR scans" are done:

Ordered MRR scan an MRR scan done with HA_MRR_SORTED flag. it has these properties:

  • The ranges come in key order
  • The rows must also be returned in key order.

ha_partition supports Ordered MRR scans, if the underlying partitions do.

It does it as follows:

  • Start the (sorted) MRR scans on all partitions, get the first row for each partition.
  • for each partition: check which range the first row is from, and determine the "earliest" range that we've got records from.

Example: Suppose MRR scan is done on sequence of ranges: { key=1, key=2, key=3, key=4}
 
- partition p1 returns a row with key=2 
- partition p2 returns a row with key=4
- partition p3 returns a row with key=2

Here, the "earliest" range is the one with key=2, and the partitions that have it are p1 and p2.

  • Collect a set of partitions that produce rows from the "earliest" range.
  • Continue to merge rows from them. As soon as partition returns a row from a range that's greater than the "earliest", stop reading rows from it.
  • This continues until we've finished reading all rows from the "earliest" range.
  • Then the process repeats for the next range.
Comment by Sergei Petrunia [ 2020-02-04 ]

Now, one may ask a question, why can't we just merge the ordered streams?

The only explanation I have is that the above code will do less key comparisons.
Consider a case of two ranges: key=1, key=2 and two partitions, p1 and p2.
Suppose that only partition p1 has rows for key=1. ha_partition will continue reading the rows from p1 and will avoid comparing each of them with the first row that we got from partition p2. It will compare the range_id instead, which should be faster.

Comment by Sergei Petrunia [ 2020-02-04 ]

I'll commit a patch shortly with comments for ha_partition class members.

Comment by Igor Babaev [ 2020-02-25 ]

The bug was introduced in the commit 8eeb689e9fc57afe19a8dbff354b5f9f167867a9:
Adding multi_range_read support to partitions.
This patch allowed to use BKA+MRR to access partition tables. Before the patch such access to partition tables was not supported at all.

The bug is in the code of the function partition_multi_range_key_skip_index_tuple().
A similar bug can be seen in the function partition_multi_range_key_skip_record().

Here's how we can reproduce the bug in partition_multi_range_key_skip_record().
Create table t0 similar to that used to demonstrate the first bug

create table t0 (tp int, a int, b int not null, c varchar(12), index idx (a,b));

and populate it with the same data as in the previous test case.
Now create and populate table t1 with the following stements:

create table t1 (
  tp int, a int, b int not null, c varchar(12), index idx (a,b)
)
partition by list (tp)
( partition p1 values in (1),
  partition p2 values in (2),
  partition p3 values in (3));
insert into t1 select * from t0;

Use the following table t2:

create table t2 (a int, index idx(a));
insert into t2 values (1), (2), (3), (4), (5);
insert into t2 select a+10 from t2;
insert into t2 select a+20 from t2;
insert into t2 values (3), (4), (5);

After executing the statement

analyze table t0,t1,t2;

and changing the setting to use BKA+MRR

set join_cache_level=6;
set optimizer_switch='mrr=on';

run the queries

select * from t2 left join t0 on t2.a=t0.a where t2.a in (3,4) and t0.b is null;
select * from t2 left join t1 on t2.a=t1.a where t2.a in (3,4) and t1.b is null;

For the first query you have an empty result set and this is expected:

MariaDB [test]> select * from t2 left join t0 on t2.a=t0.a where t2.a in (3,4) and t0.b is null;
Empty set (0.004 sec)e

For the second query you have the result set

MariaDB [test]> select * from t2 left join t1 on t2.a=t1.a where t2.a in (3,4) and t1.b is null;
+------+------+------+------+------+
| a    | tp   | a    | b    | c    |
+------+------+------+------+------+
|    3 | NULL | NULL | NULL | NULL |
+------+------+------+------+------+
1 row in set (0.020 sec)

and this is incorrect.

Comment by Igor Babaev [ 2020-03-20 ]

A fix for this bug was pushed into 10.3.

Generated at Thu Feb 08 09:08:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.