Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21554

Crash in JOIN_CACHE_BKAH::skip_index_tuple when mrr=on and join_cache_level=6+

Details

    Description

      Crash with the following backtrace:

      Version: '10.4.11-MariaDB-log'  socket: '/var/run/mysql/3302.sock'  port: 3302  MariaDB Server
      200122  9:00:00 [ERROR] mysqld got signal 11 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
       
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed, 
      something is definitely wrong and this may fail.
       
      Server version: 10.4.11-MariaDB-log
      key_buffer_size=17179869184
      read_buffer_size=131072
      max_used_connections=158
      max_threads=2050
      thread_count=164
      It is possible that mysqld could use up to 
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 21288472 K  bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.
       
      Thread pointer: 0x7fd8780009a8
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x7fdd48181cf0 thread_stack 0x49000
      /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x5647307a9ebe]
      /usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x5647302424cf]
      sigaction.c:0(__restore_rt)[0x7fdd5121b630]
      /usr/sbin/mysqld(_ZN15JOIN_CACHE_BKAH16skip_index_tupleEPv+0xb0)[0x56473014a2d0]
      /usr/sbin/mysqld(+0x7ae35e)[0x56473014a35e]
      /usr/sbin/mysqld(_ZN23Mrr_simple_index_reader8get_nextEPPv+0x47)[0x56473014d2b7]
      /usr/sbin/mysqld(_ZN10DsMrr_impl10dsmrr_nextEPPv+0x42)[0x56473014e4f2]
      /usr/sbin/mysqld(+0xddc910)[0x564730778910]
      /usr/sbin/mysqld(+0xddddb3)[0x564730779db3]
      /usr/sbin/mysqld(_ZN17JOIN_TAB_SCAN_MRR4nextEv+0x30)[0x564730145860]
      /usr/sbin/mysqld(_ZN10JOIN_CACHE21join_matching_recordsEb+0xcd)[0x564730148a2d]
      /usr/sbin/mysqld(_ZN10JOIN_CACHE12join_recordsEb+0xdc)[0x5647301484cc]
      /usr/sbin/mysqld(_Z16sub_select_cacheP4JOINP13st_join_tableb+0x2e)[0x5647300717de]
      /usr/sbin/mysqld(_ZN4JOIN10exec_innerEv+0xa86)[0x564730093636]
      /usr/sbin/mysqld(_ZN4JOIN4execEv+0x33)[0x5647300939a3]
      /usr/sbin/mysqld(_ZN18st_select_lex_unit4execEv+0xb34)[0x5647300d8d14]
      /usr/sbin/mysqld(_Z11mysql_unionP3THDP3LEXP13select_resultP18st_select_lex_unitm+0x32)[0x5647300dbb22]
      /usr/sbin/mysqld(_Z13handle_selectP3THDP3LEXP13select_resultm+0x67)[0x5647300926a7]
      /usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x942e)[0x56473003d98e]
      /usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x3ad)[0x56473003e23d]
      /usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x217a)[0x5647300415fa]
      /usr/sbin/mysqld(_Z10do_commandP3THD+0x11c)[0x564730042f3c]
      /usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x1fa)[0x56473011f24a]
      /usr/sbin/mysqld(handle_one_connection+0x3d)[0x56473011f32d]
      pthread_create.c:0(start_thread)[0x7fdd51213ea5]
      /lib64/libc.so.6(clone+0x6d)[0x7fdd4f5b48cd]
      

      happens for INSERT ... SELECT accessing several tables when these options are set:

      set join_cache_level =8; -- or 6
      set optimizer_switch='mrr=on';
      set optimizer_switch='mrr_sort_keys=on';
      

      It was not the case fro the same query and data in older 10.2.x and 10.3.x versions.

      Attachments

        Activity

          Analysis:

          1.In the function ha_partition::multi_range_key_create_key() the code

          m_mrr_range_current->key_multi_range.ptr= m_mrr_range_current
          

          assigns the pointer m_mrr_range_current of the type PARTITION_KEY_MULTI_RANGE*
          m_mrr_range_current->key_multi_range.ptr that is of the type void*

          2. In the function partition_multi_range_key_next() the code

          *range= partition_key_multi_range->key_multi_range;
          

          assigns to range.ptr the value of partition_key_multi_range->key_multi_range.ptr that is not
          the same as m_mrr_range_current->ptr and actually is taken from some other location of m_mrr_range_current.

          3. Finally in the function Mrr_simple_index_reader::init() the call in the code

           while (!(res= file->handler::multi_range_read_next(range_info)))
          

          brings us to execution of the code

          *range_info= mrr_cur_range.ptr;
          

          that returns some garbage from mrr_cur_range.ptr;

          igor Igor Babaev (Inactive) added a comment - Analysis: 1.In the function ha_partition::multi_range_key_create_key() the code m_mrr_range_current->key_multi_range.ptr= m_mrr_range_current assigns the pointer m_mrr_range_current of the type PARTITION_KEY_MULTI_RANGE* m_mrr_range_current->key_multi_range.ptr that is of the type void* 2. In the function partition_multi_range_key_next() the code *range= partition_key_multi_range->key_multi_range; assigns to range.ptr the value of partition_key_multi_range->key_multi_range.ptr that is not the same as m_mrr_range_current->ptr and actually is taken from some other location of m_mrr_range_current. 3. Finally in the function Mrr_simple_index_reader::init() the call in the code while (!(res= file->handler::multi_range_read_next(range_info))) brings us to execution of the code *range_info= mrr_cur_range.ptr; that returns some garbage from mrr_cur_range.ptr;

          igor, I've looked again at the patch I was trying yesterday, and figured it had a basic error. After I fix it, the testcase no longer crashes and returns the expected number of rows.

          I'm attaching the patch: mdev21544-fix-crash.diff . The patch also adds some comments.

          The patch is missing:

          • the test coverage
          • AFAIU partition_multi_range_key_skip_record() must also convert the rowid. It receives the rowid of an individual partition (which it can find in hld->part_id). While the hld->partition->m_seq_if->skip_record function expects to see the rowid of the ha_partition (it starts with 2-byte partition_id and then continues with the rowid in the partition).
          psergei Sergei Petrunia added a comment - igor , I've looked again at the patch I was trying yesterday, and figured it had a basic error. After I fix it, the testcase no longer crashes and returns the expected number of rows. I'm attaching the patch: mdev21544-fix-crash.diff . The patch also adds some comments. The patch is missing: the test coverage AFAIU partition_multi_range_key_skip_record() must also convert the rowid. It receives the rowid of an individual partition (which it can find in hld->part_id ). While the hld->partition->m_seq_if->skip_record function expects to see the rowid of the ha_partition (it starts with 2-byte partition_id and then continues with the rowid in the partition).
          igor Igor Babaev (Inactive) added a comment - - edited

          This is a setting that is simpler than the one used in the bug report and yet with this setting the bug still manifests itself though in a different way:

          set join_cache_level=6;
          set optimizer_switch='mrr=on';
          set optimizer_switch='index_condition_pushdown=on';
          

          (The last line is not necessary with default settings in 10.3).
          With the above setting the offensive query does not crash, but rather returns an empty result set and this is not correct.

          Here's a simple test case that demonstrates the problem:

          set join_cache_level=6;
          set optimizer_switch='mrr=on';
           
          create table t0 (
          tp int, a int, b int, c varchar(12), index idx (a,b)
          ) engine=myisam;
          insert into t0 values
          (1,3,30,'yyzy'), (1,3,30,'yxxyy'), (1,3,30,'yyxy'), (1,3,30,'xxyy'),
          (1,3,30,'yyxz'), (1,3,30,'yyzz'), (1,3,30,'xxyzy'), (1,3,30,'yyyy'),
          (1,3,30,'yzzy'), (1,93,30,'zzzy'),
          (2,3,40,'yxx'), (2,4,40,'xx'), (2,3,10,'zxz'),
          (2,3,40,'yyx'), (2,4,40,'xxx'), (2,3,10,'zyyz'),
          (2,3,40,'xzzzz'), (2,4,40,'yyyxx'), (2,3,10,'zyz'),
          (2,3,40,'xxx'), (2,4,40,'xx'), (2,3,10,'zzz'),
          (2,3,40,'yyxzx'), (2,4,40,'xyx'), (2,3,10,'xzz'),
          (2,3,40,'xxxzz'), (2,4,40,'xxz'), (2,3,10,'zzzy'),
          (2,3,40,'zzxxx'), (2,4,40,'zxx'), (2,3,10,'yzzz'),
          (2,3,40,'xyyxx'), (2,4,40,'xzzzx'), (2,3,10,'zzxxz'),
          (2,3,40,'yzxxx'), (2,4,40,'xxzy'), (2,3,10,'zzzyx'),
          (2,93,40,'xzx'), (2,94,40,'xz'), (2,93,10,'zyyyz'),
          (3,4,30,'yx'), (3,4,30,'yyxxx'), (3,4,30,'zzyy'), (3,4,30,'zxyy'),
          (3,4,30,'xxyy'), (3,4,30,'yyzx'),  (3,4,30,'zyyy'), (3,4,30,'yzy'),
          (3,4,30,'zzzyy'), (3,94,30,'yyz');
           
          create table t1 (
          tp int, a int, b int, c varchar(12), index idx (a,b)
          ) engine=myisam
          partition by list (tp)
          ( partition p1 values in (1),
            partition p2 values in (2),
            partition p3 values in (3));
          insert into t1 select * from t0;
           
          create table t2 (a int, index idx(a)) engine=myisam;
          insert into t2 values (1), (2), (3), (4), (5);
          insert into t2 select a+10 from t2;
          insert into t2 select a+20 from t2;
           
          analyze table t0,t1,t2;
           
          select * from t0,t2 where t2.a in (3,4) and t0.a=t2.a and (t0.b / 10) = t2.a-1;
          select * from t1,t2 where t2.a in (3,4) and t1.a=t2.a and (t1.b / 10) = t2.a-1;
          

          The tables t0 and t1 are of the same structure and populated with the same set of rows. The only difference is that t1 is partitioned.
          We see that the result set for the second query is empty:

          MariaDB [test]> select * from t1,t2 where t2.a in (3,4) and t1.a=t2.a and (t1.b / 10) = t2.a-1;
          Empty set
          

          and that the execution employs index condition pushdown

          MariaDB [test]> explain extended select * from t1,t2 where t2.a in (3,4) and t1.a=t2.a and (t1.b / 10) = t2.a-1;
          +------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+
          | id   | select_type | table | type  | possible_keys | key  | key_len | ref       | rows | filtered | Extra                                                                              |
          +------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+
          |    1 | SIMPLE      | t2    | range | idx           | idx  | 5       | NULL      |    2 |   100.00 | Using where; Using index                                                           |
          |    1 | SIMPLE      | t1    | ref   | idx           | idx  | 5       | test.t2.a |    5 |   100.00 | Using index condition(BKA); Using join buffer (flat, BKA join); Rowid-ordered scan |
          +------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+
          

          At the same time for the first query the result is correct:

          MariaDB [test]> select * from t0,t2 where t2.a in (3,4) and t0.a=t2.a and (t0.b / 10) = t2.a-1;
          +------+------+------+-------+------+
          | tp   | a    | b    | c     | a    |
          +------+------+------+-------+------+
          |    3 |    4 |   30 | yx    |    4 |
          |    3 |    4 |   30 | yyxxx |    4 |
          |    3 |    4 |   30 | zzyy  |    4 |
          |    3 |    4 |   30 | zxyy  |    4 |
          |    3 |    4 |   30 | xxyy  |    4 |
          |    3 |    4 |   30 | yyzx  |    4 |
          |    3 |    4 |   30 | zyyy  |    4 |
          |    3 |    4 |   30 | yzy   |    4 |
          |    3 |    4 |   30 | zzzyy |    4 |
          +------+------+------+-------+------+
          9 rows in set 
          

          and to get this result the executor uses index condition pushdown as well

          MariaDB [test]> explain extended select * from t0,t2 where t2.a in (3,4) and t0.a=t2.a and (t0.b / 10) = t2.a-1;
          +------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+
          | id   | select_type | table | type  | possible_keys | key  | key_len | ref       | rows | filtered | Extra                                                                              |
          +------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+
          |    1 | SIMPLE      | t2    | range | idx           | idx  | 5       | NULL      |    2 |   100.00 | Using where; Using index                                                           |
          |    1 | SIMPLE      | t0    | ref   | idx           | idx  | 5       | test.t2.a |    13 |   100.00 | Using index condition(BKA); Using join buffer (flat, BKA join); Rowid-ordered scan |
          +------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+
          

          The correct result set for the second query we get with the setting

          set optimizer_switch='index_condition_pushdown=off';
          

          MariaDB [test]> set optimizer_switch='index_condition_pushdown=off';
           
           MariaDB [test]> select * from t1,t2 where t2.a in (3,4) and t1.a=t2.a and (t1.b / 10) = t2.a-1;
          +------+------+------+-------+------+
          | tp   | a    | b    | c     | a    |
          +------+------+------+-------+------+
          |    3 |    4 |   30 | yx    |    4 |
          |    3 |    4 |   30 | yyxxx |    4 |
          |    3 |    4 |   30 | zzyy  |    4 |
          |    3 |    4 |   30 | zxyy  |    4 |
          |    3 |    4 |   30 | xxyy  |    4 |
          |    3 |    4 |   30 | yyzx  |    4 |
          |    3 |    4 |   30 | zyyy  |    4 |
          |    3 |    4 |   30 | yzy   |    4 |
          |    3 |    4 |   30 | zzzyy |    4 |
          +------+------+------+-------+------+
          9 rows in set 
          

          igor Igor Babaev (Inactive) added a comment - - edited This is a setting that is simpler than the one used in the bug report and yet with this setting the bug still manifests itself though in a different way: set join_cache_level=6; set optimizer_switch= 'mrr=on' ; set optimizer_switch= 'index_condition_pushdown=on' ; (The last line is not necessary with default settings in 10.3). With the above setting the offensive query does not crash, but rather returns an empty result set and this is not correct. Here's a simple test case that demonstrates the problem: set join_cache_level=6; set optimizer_switch= 'mrr=on' ;   create table t0 ( tp int , a int , b int , c varchar (12), index idx (a,b) ) engine=myisam; insert into t0 values (1,3,30, 'yyzy' ), (1,3,30, 'yxxyy' ), (1,3,30, 'yyxy' ), (1,3,30, 'xxyy' ), (1,3,30, 'yyxz' ), (1,3,30, 'yyzz' ), (1,3,30, 'xxyzy' ), (1,3,30, 'yyyy' ), (1,3,30, 'yzzy' ), (1,93,30, 'zzzy' ), (2,3,40, 'yxx' ), (2,4,40, 'xx' ), (2,3,10, 'zxz' ), (2,3,40, 'yyx' ), (2,4,40, 'xxx' ), (2,3,10, 'zyyz' ), (2,3,40, 'xzzzz' ), (2,4,40, 'yyyxx' ), (2,3,10, 'zyz' ), (2,3,40, 'xxx' ), (2,4,40, 'xx' ), (2,3,10, 'zzz' ), (2,3,40, 'yyxzx' ), (2,4,40, 'xyx' ), (2,3,10, 'xzz' ), (2,3,40, 'xxxzz' ), (2,4,40, 'xxz' ), (2,3,10, 'zzzy' ), (2,3,40, 'zzxxx' ), (2,4,40, 'zxx' ), (2,3,10, 'yzzz' ), (2,3,40, 'xyyxx' ), (2,4,40, 'xzzzx' ), (2,3,10, 'zzxxz' ), (2,3,40, 'yzxxx' ), (2,4,40, 'xxzy' ), (2,3,10, 'zzzyx' ), (2,93,40, 'xzx' ), (2,94,40, 'xz' ), (2,93,10, 'zyyyz' ), (3,4,30, 'yx' ), (3,4,30, 'yyxxx' ), (3,4,30, 'zzyy' ), (3,4,30, 'zxyy' ), (3,4,30, 'xxyy' ), (3,4,30, 'yyzx' ), (3,4,30, 'zyyy' ), (3,4,30, 'yzy' ), (3,4,30, 'zzzyy' ), (3,94,30, 'yyz' );   create table t1 ( tp int , a int , b int , c varchar (12), index idx (a,b) ) engine=myisam partition by list (tp) ( partition p1 values in (1), partition p2 values in (2), partition p3 values in (3)); insert into t1 select * from t0;   create table t2 (a int , index idx(a)) engine=myisam; insert into t2 values (1), (2), (3), (4), (5); insert into t2 select a+10 from t2; insert into t2 select a+20 from t2;   analyze table t0,t1,t2;   select * from t0,t2 where t2.a in (3,4) and t0.a=t2.a and (t0.b / 10) = t2.a-1; select * from t1,t2 where t2.a in (3,4) and t1.a=t2.a and (t1.b / 10) = t2.a-1; The tables t0 and t1 are of the same structure and populated with the same set of rows. The only difference is that t1 is partitioned. We see that the result set for the second query is empty: MariaDB [test]> select * from t1,t2 where t2.a in (3,4) and t1.a=t2.a and (t1.b / 10) = t2.a-1; Empty set and that the execution employs index condition pushdown MariaDB [test]> explain extended select * from t1,t2 where t2.a in (3,4) and t1.a=t2.a and (t1.b / 10) = t2.a-1; +------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+ | 1 | SIMPLE | t2 | range | idx | idx | 5 | NULL | 2 | 100.00 | Using where; Using index | | 1 | SIMPLE | t1 | ref | idx | idx | 5 | test.t2.a | 5 | 100.00 | Using index condition(BKA); Using join buffer (flat, BKA join); Rowid-ordered scan | +------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+ At the same time for the first query the result is correct: MariaDB [test]> select * from t0,t2 where t2.a in (3,4) and t0.a=t2.a and (t0.b / 10) = t2.a-1; +------+------+------+-------+------+ | tp | a | b | c | a | +------+------+------+-------+------+ | 3 | 4 | 30 | yx | 4 | | 3 | 4 | 30 | yyxxx | 4 | | 3 | 4 | 30 | zzyy | 4 | | 3 | 4 | 30 | zxyy | 4 | | 3 | 4 | 30 | xxyy | 4 | | 3 | 4 | 30 | yyzx | 4 | | 3 | 4 | 30 | zyyy | 4 | | 3 | 4 | 30 | yzy | 4 | | 3 | 4 | 30 | zzzyy | 4 | +------+------+------+-------+------+ 9 rows in set and to get this result the executor uses index condition pushdown as well MariaDB [test]> explain extended select * from t0,t2 where t2.a in (3,4) and t0.a=t2.a and (t0.b / 10) = t2.a-1; +------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+ | 1 | SIMPLE | t2 | range | idx | idx | 5 | NULL | 2 | 100.00 | Using where; Using index | | 1 | SIMPLE | t0 | ref | idx | idx | 5 | test.t2.a | 13 | 100.00 | Using index condition(BKA); Using join buffer (flat, BKA join); Rowid-ordered scan | +------+-------------+-------+-------+---------------+------+---------+-----------+------+----------+------------------------------------------------------------------------------------+ The correct result set for the second query we get with the setting set optimizer_switch= 'index_condition_pushdown=off' ; MariaDB [test]> set optimizer_switch='index_condition_pushdown=off';   MariaDB [test]> select * from t1,t2 where t2.a in (3,4) and t1.a=t2.a and (t1.b / 10) = t2.a-1; +------+------+------+-------+------+ | tp | a | b | c | a | +------+------+------+-------+------+ | 3 | 4 | 30 | yx | 4 | | 3 | 4 | 30 | yyxxx | 4 | | 3 | 4 | 30 | zzyy | 4 | | 3 | 4 | 30 | zxyy | 4 | | 3 | 4 | 30 | xxyy | 4 | | 3 | 4 | 30 | yyzx | 4 | | 3 | 4 | 30 | zyyy | 4 | | 3 | 4 | 30 | yzy | 4 | | 3 | 4 | 30 | zzzyy | 4 | +------+------+------+-------+------+ 9 rows in set

          igor, on the question of whether range_info in the sequence of ranges passed from ha_partition to individual partitions could be the same as range_info that SQL layer has passed to ha_partition:

          Here are the places where ha_partition makes use of range_info passed to individual partitions. (I'll explain the reasoning below)

          in ha_partition::handle_ordered_index_scan:

                error= file->multi_range_read_next(&m_range_info[i]);
                ...
                m_stock_range_seq[i]= (((PARTITION_KEY_MULTI_RANGE *)
                                        m_range_info[i])->id);
          

          in ha_partition::handle_ordered_next:

              error= file->multi_range_read_next(&m_range_info[part_id]);
              ...
              if (((PARTITION_KEY_MULTI_RANGE *) m_range_info[part_id])->id !=
                  m_mrr_range_current->id)
          

          The idea here is to get the number of the range we've got the record from. This is used in "Ordered MRR scans".

          An explanation about how "Ordered MRR scans" are done:

          Ordered MRR scan an MRR scan done with HA_MRR_SORTED flag. it has these properties:

          • The ranges come in key order
          • The rows must also be returned in key order.

          ha_partition supports Ordered MRR scans, if the underlying partitions do.

          It does it as follows:

          • Start the (sorted) MRR scans on all partitions, get the first row for each partition.
          • for each partition: check which range the first row is from, and determine the "earliest" range that we've got records from.

          Example: Suppose MRR scan is done on sequence of ranges: { key=1, key=2, key=3, key=4}
           
          - partition p1 returns a row with key=2 
          - partition p2 returns a row with key=4
          - partition p3 returns a row with key=2
          

          Here, the "earliest" range is the one with key=2, and the partitions that have it are p1 and p2.

          • Collect a set of partitions that produce rows from the "earliest" range.
          • Continue to merge rows from them. As soon as partition returns a row from a range that's greater than the "earliest", stop reading rows from it.
          • This continues until we've finished reading all rows from the "earliest" range.
          • Then the process repeats for the next range.
          psergei Sergei Petrunia added a comment - igor , on the question of whether range_info in the sequence of ranges passed from ha_partition to individual partitions could be the same as range_info that SQL layer has passed to ha_partition: Here are the places where ha_partition makes use of range_info passed to individual partitions. (I'll explain the reasoning below) in ha_partition::handle_ordered_index_scan: error= file->multi_range_read_next(&m_range_info[i]); ... m_stock_range_seq[i]= (((PARTITION_KEY_MULTI_RANGE *) m_range_info[i])->id); in ha_partition::handle_ordered_next: error= file->multi_range_read_next(&m_range_info[part_id]); ... if (((PARTITION_KEY_MULTI_RANGE *) m_range_info[part_id])->id != m_mrr_range_current->id) The idea here is to get the number of the range we've got the record from. This is used in "Ordered MRR scans". An explanation about how "Ordered MRR scans" are done: Ordered MRR scan an MRR scan done with HA_MRR_SORTED flag. it has these properties: The ranges come in key order The rows must also be returned in key order. ha_partition supports Ordered MRR scans, if the underlying partitions do. It does it as follows: Start the (sorted) MRR scans on all partitions, get the first row for each partition. for each partition: check which range the first row is from, and determine the "earliest" range that we've got records from. Example: Suppose MRR scan is done on sequence of ranges: { key=1, key=2, key=3, key=4}   - partition p1 returns a row with key=2 - partition p2 returns a row with key=4 - partition p3 returns a row with key=2 Here, the "earliest" range is the one with key=2, and the partitions that have it are p1 and p2. Collect a set of partitions that produce rows from the "earliest" range. Continue to merge rows from them. As soon as partition returns a row from a range that's greater than the "earliest", stop reading rows from it. This continues until we've finished reading all rows from the "earliest" range. Then the process repeats for the next range.

          Now, one may ask a question, why can't we just merge the ordered streams?

          The only explanation I have is that the above code will do less key comparisons.
          Consider a case of two ranges: key=1, key=2 and two partitions, p1 and p2.
          Suppose that only partition p1 has rows for key=1. ha_partition will continue reading the rows from p1 and will avoid comparing each of them with the first row that we got from partition p2. It will compare the range_id instead, which should be faster.

          psergei Sergei Petrunia added a comment - Now, one may ask a question, why can't we just merge the ordered streams? The only explanation I have is that the above code will do less key comparisons. Consider a case of two ranges: key=1, key=2 and two partitions, p1 and p2. Suppose that only partition p1 has rows for key=1. ha_partition will continue reading the rows from p1 and will avoid comparing each of them with the first row that we got from partition p2. It will compare the range_id instead, which should be faster.

          I'll commit a patch shortly with comments for ha_partition class members.

          psergei Sergei Petrunia added a comment - I'll commit a patch shortly with comments for ha_partition class members.
          igor Igor Babaev (Inactive) added a comment - - edited

          The bug was introduced in the commit 8eeb689e9fc57afe19a8dbff354b5f9f167867a9:
          Adding multi_range_read support to partitions.
          This patch allowed to use BKA+MRR to access partition tables. Before the patch such access to partition tables was not supported at all.

          The bug is in the code of the function partition_multi_range_key_skip_index_tuple().
          A similar bug can be seen in the function partition_multi_range_key_skip_record().

          Here's how we can reproduce the bug in partition_multi_range_key_skip_record().
          Create table t0 similar to that used to demonstrate the first bug

          create table t0 (tp int, a int, b int not null, c varchar(12), index idx (a,b));
          

          and populate it with the same data as in the previous test case.
          Now create and populate table t1 with the following stements:

          create table t1 (
            tp int, a int, b int not null, c varchar(12), index idx (a,b)
          )
          partition by list (tp)
          ( partition p1 values in (1),
            partition p2 values in (2),
            partition p3 values in (3));
          insert into t1 select * from t0;
          

          Use the following table t2:

          create table t2 (a int, index idx(a));
          insert into t2 values (1), (2), (3), (4), (5);
          insert into t2 select a+10 from t2;
          insert into t2 select a+20 from t2;
          insert into t2 values (3), (4), (5);
          

          After executing the statement

          analyze table t0,t1,t2;
          

          and changing the setting to use BKA+MRR

          set join_cache_level=6;
          set optimizer_switch='mrr=on';
          

          run the queries

          select * from t2 left join t0 on t2.a=t0.a where t2.a in (3,4) and t0.b is null;
          select * from t2 left join t1 on t2.a=t1.a where t2.a in (3,4) and t1.b is null;
          

          For the first query you have an empty result set and this is expected:

          MariaDB [test]> select * from t2 left join t0 on t2.a=t0.a where t2.a in (3,4) and t0.b is null;
          Empty set (0.004 sec)e
          

          For the second query you have the result set

          MariaDB [test]> select * from t2 left join t1 on t2.a=t1.a where t2.a in (3,4) and t1.b is null;
          +------+------+------+------+------+
          | a    | tp   | a    | b    | c    |
          +------+------+------+------+------+
          |    3 | NULL | NULL | NULL | NULL |
          +------+------+------+------+------+
          1 row in set (0.020 sec)
          

          and this is incorrect.

          igor Igor Babaev (Inactive) added a comment - - edited The bug was introduced in the commit 8eeb689e9fc57afe19a8dbff354b5f9f167867a9: Adding multi_range_read support to partitions. This patch allowed to use BKA+MRR to access partition tables. Before the patch such access to partition tables was not supported at all. The bug is in the code of the function partition_multi_range_key_skip_index_tuple(). A similar bug can be seen in the function partition_multi_range_key_skip_record(). Here's how we can reproduce the bug in partition_multi_range_key_skip_record(). Create table t0 similar to that used to demonstrate the first bug create table t0 (tp int , a int , b int not null , c varchar (12), index idx (a,b)); and populate it with the same data as in the previous test case. Now create and populate table t1 with the following stements: create table t1 ( tp int , a int , b int not null , c varchar (12), index idx (a,b) ) partition by list (tp) ( partition p1 values in (1), partition p2 values in (2), partition p3 values in (3)); insert into t1 select * from t0; Use the following table t2: create table t2 (a int , index idx(a)); insert into t2 values (1), (2), (3), (4), (5); insert into t2 select a+10 from t2; insert into t2 select a+20 from t2; insert into t2 values (3), (4), (5); After executing the statement analyze table t0,t1,t2; and changing the setting to use BKA+MRR set join_cache_level=6; set optimizer_switch= 'mrr=on' ; run the queries select * from t2 left join t0 on t2.a=t0.a where t2.a in (3,4) and t0.b is null ; select * from t2 left join t1 on t2.a=t1.a where t2.a in (3,4) and t1.b is null ; For the first query you have an empty result set and this is expected: MariaDB [test]> select * from t2 left join t0 on t2.a=t0.a where t2.a in (3,4) and t0.b is null; Empty set (0.004 sec)e For the second query you have the result set MariaDB [test]> select * from t2 left join t1 on t2.a=t1.a where t2.a in (3,4) and t1.b is null; +------+------+------+------+------+ | a | tp | a | b | c | +------+------+------+------+------+ | 3 | NULL | NULL | NULL | NULL | +------+------+------+------+------+ 1 row in set (0.020 sec) and this is incorrect.

          A fix for this bug was pushed into 10.3.

          igor Igor Babaev (Inactive) added a comment - A fix for this bug was pushed into 10.3.

          People

            igor Igor Babaev (Inactive)
            valerii Valerii Kravchuk
            Votes:
            3 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.