Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-10927

Crash When Using sort_union Optimization

Details

    • 5.5.54

    Description

      Our servers running MariaDB 10.1.17 in a galera cluster experienced a crash relating to the index merge sort_union optimization. The error caused all three of ou servers crash because the query, after failing on the first server due to the crash, was passed to the next server causing it to crash. I have disable the optimization using `optimizer_switch="index_merge_sort_union=off"` since the error continues to occur without it.

      The crash occurs when the following query is run:

      SELECT SQL_NO_CACHE contact_id FROM users_assignment FORCE INDEX (user_id,company_id) WHERE user_id = '57291' OR (user_id IN(55838,55839,56998,57260,57291,60227,121997,137571,173600,219233,306480,354991,358594,376261,398572,447223,472009,646282,932548,993011,1004609,1021262,1177117,1234288,1283121,1288749,1337771,57291) AND role_id = '1194') OR company_id = '2321'
      

      I have attached the relevant log excerpts from the servers as well as some output from running mysqld in gdb and a SQL file that creates a table with data against which running the sample query above crashes.

      EXPLAIN EXTENDED of the query above:

      +------+-------------+----------------------------+-------------+--------------------+--------------------+---------+------+------+----------+---------------------------------------------------+
      | id   | select_type | table                      | type        | possible_keys      | key                | key_len | ref  | rows | filtered | Extra                                             |
      +------+-------------+----------------------------+-------------+--------------------+--------------------+---------+------+------+----------+---------------------------------------------------+
      |    1 | SIMPLE      | users_assignment_brad_test | index_merge | user_id,company_id | user_id,company_id | 4,4     | NULL | 8193 |   100.00 | Using sort_union(user_id,company_id); Using where |
      +------+-------------+----------------------------+-------------+--------------------+--------------------+---------+------+------+----------+---------------------------------------------------+
      

      Attachments

        1. db1.txt
          5 kB
        2. db2.txt
          1 kB
        3. db3.txt
          5 kB
        4. gdb.txt
          6 kB
        5. index_merge_crash_table.sql
          374 kB
        6. my.cnf
          4 kB
        7. my2.cnf
          4 kB

        Activity

          varun Varun Gupta (Inactive) added a comment - - edited

          In file sql/filesort.cc, function merge_buffers

          [1411]if (!(error= (int) read_to_buffer(from_file, buffpek,
                                                  rec_length)))
                {
                  queue_remove(&queue,0);
                  reuse_freed_buff(&queue, buffpek, rec_length);
          [1416]      }
          

          The assert in queue_remove(file queue.c)

          uchar *queue_remove(register QUEUE *queue, uint idx)
          {
            uchar *element;
            DBUG_ASSERT(idx >= 1 && idx <= queue->elements);
            element= queue->root[idx];
            _downheap(queue, idx, queue->root[queue->elements--]);
            return element;
          }
          

          The assert fails because the queue_remove function is called with value of idx=0.

          varun Varun Gupta (Inactive) added a comment - - edited In file sql/filesort.cc, function merge_buffers [ 1411 ] if (!(error= ( int ) read_to_buffer(from_file, buffpek, rec_length))) { queue_remove(&queue, 0 ); reuse_freed_buff(&queue, buffpek, rec_length); [ 1416 ] } The assert in queue_remove(file queue.c) uchar *queue_remove(register QUEUE *queue, uint idx) { uchar *element; DBUG_ASSERT(idx >= 1 && idx <= queue->elements); element= queue->root[idx]; _downheap(queue, idx, queue->root[queue->elements--]); return element; } The assert fails because the queue_remove function is called with value of idx=0.

          Yes but isn't it odd that queue_remove() has an assert that idx>=1, while filesort.cc has code that ALWAYS calls with idx=0?

          psergei Sergei Petrunia added a comment - Yes but isn't it odd that queue_remove() has an assert that idx>=1, while filesort.cc has code that ALWAYS calls with idx=0?
          psergei Sergei Petrunia added a comment - - edited

          Investigating how we ended up with this mismatch.
          ]
          The call to

          queue_remove(&queue, 0);
          

          in filesort.cc comes from this patch

          commit 76f0b94bb0b2994d639353530c5b251d0f1a204b
          Author:	Sergei Golubchik <sergii@pisem.net>  Wed Oct 19 22:45:18 2011
          Committer:	Sergei Golubchik <sergii@pisem.net>  Wed Oct 19 22:45:18 2011
           
              merge with 5.3
           
           ...
          

          The change made by that patch was

           -        VOID(queue_remove(&queue,0));
          ++        queue_remove(&queue,0);
          

          The old code was introduced by

          commit 69dd773b67a9892561f148964e51d3be606f7c77
          Author: Igor Babaev <igor@askmonty.org>
          Date:   Mon Sep 13 15:22:11 2010 -0700
           
              An implementation of index intersect via a modified Unique class.
              This code is planned to be used for mwl#21.
          

          but when that happened, queue_remove(..., 0) was used in multiple places.
          queue_remove_top didn't exist, and the assert in queue_remove() looked like this:

          DBUG_ASSERT(idx < queue->max_elements);
          

          queue_remove_top was introduced in

          commit ecbcddc03dc298ea1e6c0aa1a120bd0b4b04b3fd
          Author:	Michael Widenius <monty@askmonty.org>  Fri Jul 16 10:33:01 2010
          Committer:	Michael Widenius <monty@askmonty.org>  Fri Jul 16 10:33:01 2010
           
          Improved speed of thr_alarm from O(N) to O(1). thr_alarm is used to handle timeouts and kill of connections.
          Fixed compiler warnings.
          queues.h and queues.c are now based on the UNIREG code and thus made BSD.
          Fix code to use new queue() interface. This mostly affects how you access elements in the queue.
          ...
          

          Also that patch changed:

           uchar *queue_remove(register QUEUE *queue, uint idx)
           {
             uchar *element;
          -  DBUG_ASSERT(idx < queue->max_elements);
          -  element= queue->root[++idx];  /* Intern index starts from 1 */
          -  queue->root[idx]= queue->root[queue->elements--];
          -  _downheap(queue, idx);
          +  DBUG_ASSERT(idx >= 1 && idx <= queue->elements);
          +  element= queue->root[idx];
          +  _downheap(queue, idx, queue->root[queue->elements--]);
             return element;
           }
          

          Ok, it was a change in QUEUE code in 5.5 and a merge of code that used the old calling convention from 5.5

          psergei Sergei Petrunia added a comment - - edited Investigating how we ended up with this mismatch. ] The call to queue_remove(&queue, 0); in filesort.cc comes from this patch commit 76f0b94bb0b2994d639353530c5b251d0f1a204b Author: Sergei Golubchik <sergii@pisem.net> Wed Oct 19 22:45:18 2011 Committer: Sergei Golubchik <sergii@pisem.net> Wed Oct 19 22:45:18 2011   merge with 5.3 ... The change made by that patch was - VOID(queue_remove(&queue,0)); ++ queue_remove(&queue,0); The old code was introduced by commit 69dd773b67a9892561f148964e51d3be606f7c77 Author: Igor Babaev <igor@askmonty.org> Date: Mon Sep 13 15:22:11 2010 -0700   An implementation of index intersect via a modified Unique class. This code is planned to be used for mwl#21. but when that happened, queue_remove(..., 0) was used in multiple places. queue_remove_top didn't exist, and the assert in queue_remove() looked like this: DBUG_ASSERT(idx < queue->max_elements); queue_remove_top was introduced in commit ecbcddc03dc298ea1e6c0aa1a120bd0b4b04b3fd Author: Michael Widenius <monty@askmonty.org> Fri Jul 16 10:33:01 2010 Committer: Michael Widenius <monty@askmonty.org> Fri Jul 16 10:33:01 2010   Improved speed of thr_alarm from O(N) to O(1). thr_alarm is used to handle timeouts and kill of connections. Fixed compiler warnings. queues.h and queues.c are now based on the UNIREG code and thus made BSD. Fix code to use new queue() interface. This mostly affects how you access elements in the queue. ... Also that patch changed: uchar *queue_remove(register QUEUE *queue, uint idx) { uchar *element; - DBUG_ASSERT(idx < queue->max_elements); - element= queue->root[++idx]; /* Intern index starts from 1 */ - queue->root[idx]= queue->root[queue->elements--]; - _downheap(queue, idx); + DBUG_ASSERT(idx >= 1 && idx <= queue->elements); + element= queue->root[idx]; + _downheap(queue, idx, queue->root[queue->elements--]); return element; } Ok, it was a change in QUEUE code in 5.5 and a merge of code that used the old calling convention from 5.5

          Indeed, line where this bug was had no test coverage:

          I make a gcov build, run ./mysql-test-run --gcov --suite=main
          and in sql/CMakeFiles/sql.dir/filesort.cc.gcov I see:

                121: 1396:  if (unique_buff)
                  -: 1397:  {
                  -: 1398:    /* 
                  -: 1399:       Called by Unique::get()
                  -: 1400:       Copy the first argument to unique_buff for unique removal.
                  -: 1401:       Store it also in 'to_file'.
                  -: 1402:    */
                 28: 1403:    buffpek= (BUFFPEK*) queue_top(&queue);
                 28: 1404:    memcpy(unique_buff, buffpek->key, rec_length);
                 28: 1405:    if (min_dupl_count)
                  -: 1406:      memcpy(&dupl_count, unique_buff+dupl_count_ofs, 
                 26: 1407:             sizeof(dupl_count));
                 28: 1408:    buffpek->key+= rec_length;
                 28: 1409:    if (! --buffpek->mem_count)
                  -: 1410:    {
              #####: 1411:      if (!(error= (int) read_to_buffer(from_file, buffpek,
              #####: 1412:                                        rec_length)))
                  -: 1413:      {
              #####: 1414:        queue_remove(&queue,0);
              #####: 1415:        reuse_freed_buff(&queue, buffpek, rec_length);
                  -: 1416:      }
              #####: 1417:      else if (error == -1)
              #####: 1418:        goto err;                        /* purecov: inspected */ 
                  -: 1419:    }
          

          psergei Sergei Petrunia added a comment - Indeed, line where this bug was had no test coverage: I make a gcov build, run ./mysql-test-run --gcov --suite=main and in sql/CMakeFiles/sql.dir/filesort.cc.gcov I see: 121: 1396: if (unique_buff) -: 1397: { -: 1398: /* -: 1399: Called by Unique::get() -: 1400: Copy the first argument to unique_buff for unique removal. -: 1401: Store it also in 'to_file'. -: 1402: */ 28: 1403: buffpek= (BUFFPEK*) queue_top(&queue); 28: 1404: memcpy(unique_buff, buffpek->key, rec_length); 28: 1405: if (min_dupl_count) -: 1406: memcpy(&dupl_count, unique_buff+dupl_count_ofs, 26: 1407: sizeof(dupl_count)); 28: 1408: buffpek->key+= rec_length; 28: 1409: if (! --buffpek->mem_count) -: 1410: { #####: 1411: if (!(error= (int) read_to_buffer(from_file, buffpek, #####: 1412: rec_length))) -: 1413: { #####: 1414: queue_remove(&queue,0); #####: 1415: reuse_freed_buff(&queue, buffpek, rec_length); -: 1416: } #####: 1417: else if (error == -1) #####: 1418: goto err; /* purecov: inspected */ -: 1419: }

          A smaller testcase:

          -- source include/have_innodb.inc
           
          create table t2 (
            pk int(11) NOT NULL AUTO_INCREMENT,
            col1 int(11) NOT NULL,
            col2 int(11) NOT NULL,
            col3 int(11) NOT NULL,
            key2 int(11) NOT NULL,
            col4 int(11) NOT NULL,
            key1 int(11) NOT NULL,
            PRIMARY KEY (pk),
            KEY key1 (key1),
            KEY key2 (key2)
          ) ENGINE=InnoDB AUTO_INCREMENT=12860259 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT;
           
          create table ten(a int);
          insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
           
          create table ten_k(a int);
          insert into ten_k select A.a + B.a* 10 + C.a * 100 + D.a*1000 from ten A, ten B, ten C, ten D;
           
          insert into t2 (key1, key2, col1,col2,col3,col4)
          select 
            a,a,
            a,a,a,a
          from ten_k;
          SELECT sum(col1) FROM t2 FORCE INDEX (key1,key2) WHERE  (key1 between 10 and 8191+10) or (key2= 5);
          

          varun, please rename all tables t0,t1,t2.. and use that for the bugfix.

          psergei Sergei Petrunia added a comment - A smaller testcase: -- source include/have_innodb.inc   create table t2 ( pk int(11) NOT NULL AUTO_INCREMENT, col1 int(11) NOT NULL, col2 int(11) NOT NULL, col3 int(11) NOT NULL, key2 int(11) NOT NULL, col4 int(11) NOT NULL, key1 int(11) NOT NULL, PRIMARY KEY (pk), KEY key1 (key1), KEY key2 (key2) ) ENGINE=InnoDB AUTO_INCREMENT=12860259 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT;   create table ten(a int); insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);   create table ten_k(a int); insert into ten_k select A.a + B.a* 10 + C.a * 100 + D.a*1000 from ten A, ten B, ten C, ten D;   insert into t2 (key1, key2, col1,col2,col3,col4) select a,a, a,a,a,a from ten_k; SELECT sum(col1) FROM t2 FORCE INDEX (key1,key2) WHERE (key1 between 10 and 8191+10) or (key2= 5); varun , please rename all tables t0,t1,t2.. and use that for the bugfix.

          People

            varun Varun Gupta (Inactive)
            bradjorgensen Brad Jorgensen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.