Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32489

Change buffer index fails to delete the records

Details

    Description

      While testing bb-10.6-thiru, InnoDB hangs during shutdown and it shows the
      following information in error log file:

      2023-10-10  6:11:53 0 [Note] Completing change buffer merge; 1 page reads initiated; 3 change buffer pages remain
      2023-10-10  6:12:08 0 [Note] Completing change buffer merge; 1 page reads initiated; 3 change buffer pages remain
      2023-10-10  6:12:23 0 [Note] Completing change buffer merge; 1 page reads initiated; 3 change buffer pages remain
      2023-10-10  6:12:38 0 [Note] Completing change buffer merge; 1 page reads initiated; 3 change buffer pages remain
      

      Analysis:
      =========
      During shutdown, InnoDB calls ibuf_merge_or_delete_for_page() for the problematic page (0, 739). But the desired bit for a given
      page in the bitmap page is already set to IBUF_BITMAP_FREE. So we fail to remove the entry (0, 739) from change buffer index. So
      jumped to understand ibuf_delete_recs() where the problematic page (0, 739) was deleted. During that, InnoDB change
      buffer index has only root, leaf pages and there are no internal nodes.

        mtr_t mtr;
      loop:
        btr_pcur_t pcur;
        pcur.btr_cur.page_cur.index= ibuf.index;
        ibuf_mtr_start(&mtr);
        if (btr_pcur_open(&tuple, PAGE_CUR_GE, BTR_MODIFY_LEAF, &pcur, &mtr))
          goto func_exit;
        if (!btr_pcur_is_on_user_rec(&pcur))
        {
          ut_ad(btr_pcur_is_after_last_on_page(&pcur));
          goto func_exit;
        }
       
        for (;;)
        {
          ut_ad(btr_pcur_is_on_user_rec(&pcur));
          const rec_t* ibuf_rec = btr_pcur_get_rec(&pcur);
          if (ibuf_rec_get_space(&mtr, ibuf_rec) != page_id.space()
              || ibuf_rec_get_page_no(&mtr, ibuf_rec) != page_id.page_no())
            break;
          /* Delete the record from ibuf */
          if (ibuf_delete_rec(page_id, &pcur, &tuple, &mtr))
          {
            /* Deletion was pessimistic and mtr was committed:
            we start from the beginning again */
            ut_ad(mtr.has_committed());
            goto loop;
          }
       
          if (btr_pcur_is_after_last_on_page(&pcur))
          {
            ibuf_mtr_commit(&mtr);
            btr_pcur_close(&pcur);
            goto loop;
          }
      

      btr_cur_open() searches with tuple only page_id (0, 739). Root page has this following (..(0, 563)(child 60), (0, 739)(child 63)..) since the mode is
      PAGE_CUR_L for non-leaf node. It leads to child page 60 and deletes the record of (0, 739). once we reached the end of the page, again we do
      open the change buffer index and we end up in page 60. Fail to find the record on page 60. ibuf_delete_recs() fails to delete the entries completely.
      Even page 62 which is next to child page 60 has the record (0, 739)

      Since change buffer index is in 5.5+ format, primary key for the index is

      {space, 0, page_no, counter}

      . But we fail to use the counter field for searching the tuple.

      Thanks to vlad.lesin for helping me in analysing this issue.

      Attachments

        Issue Links

          Activity

            > Since change buffer index is in 5.5+ format, primary key for the index is

            {space, 0, page_no, counter}

            I am not sure about this, because when we get offsets of the record on root page, there are 7 fields. The first 4 fields is {space, 0, page_no, counter}

            , the last field is a child page id. There are 2 more fields in the key. And, if so, https://github.com/MariaDB/server/commit/015ab499696382b0e3b8d70118beefafd328a779 commit just shifts the problem from "counter" field to the next to the "counter" one.

            I.e. what if we have

            {a, 0, b, count, c, ...}

            record on leaf page N and

            {a, 0, b, count, c+1, ...}

            record on page N+1. And we use

            {a, 0, b, count}

            search tuple. We will delete all

            {a, 0, b, count, ...}

            records on page N, and then, when we open cursor again, there will be the same problem as we have now, i.e. the cursor will point to the supremum of page N, and we will miss records on page N+1.

            vlad.lesin Vladislav Lesin added a comment - > Since change buffer index is in 5.5+ format, primary key for the index is {space, 0, page_no, counter} I am not sure about this, because when we get offsets of the record on root page, there are 7 fields. The first 4 fields is {space, 0, page_no, counter} , the last field is a child page id. There are 2 more fields in the key. And, if so, https://github.com/MariaDB/server/commit/015ab499696382b0e3b8d70118beefafd328a779 commit just shifts the problem from "counter" field to the next to the "counter" one. I.e. what if we have {a, 0, b, count, c, ...} record on leaf page N and {a, 0, b, count, c+1, ...} record on page N+1. And we use {a, 0, b, count} search tuple. We will delete all {a, 0, b, count, ...} records on page N, and then, when we open cursor again, there will be the same problem as we have now, i.e. the cursor will point to the supremum of page N, and we will miss records on page N+1.

            Change buffer records may be in the pre-MySQL 5.5 format in case innodb_change_buffering=inserts is being used. We must not assume anything about the format.

            I think that we should perform a forward scan and first delete-mark all records in the same mini-transaction that modifies the change buffer bitmap. Then, start another mini-transaction that will lock the entire change buffer while doing a forward scan and "pessimistic" delete of all matching change buffer records.

            marko Marko Mäkelä added a comment - Change buffer records may be in the pre-MySQL 5.5 format in case innodb_change_buffering=inserts is being used. We must not assume anything about the format. I think that we should perform a forward scan and first delete-mark all records in the same mini-transaction that modifies the change buffer bitmap. Then, start another mini-transaction that will lock the entire change buffer while doing a forward scan and "pessimistic" delete of all matching change buffer records.

            I could analyze the issue from the RR trace. The internal cursor routines and modes (PAGE_CUR_GE) are behaving well and as per design. The IBUF B-tree state is also correct. Also, verified that there is no issue with the format or key comparison.

            The reason why we are not able to delete all the records is in the implementation of ibuf_delete_recs function. When the cursor is positioned in supremum record in previous page, which is required for correct operation of GE and LT mode, we are exiting from the function. Landing in previous non-user record is not any general issue and the correct way is to proceed further to the next leaf page to get back all the records. The easiest way is to use the btr_pcur_open_on_user_rec() interface which takes care of positioning the cursor in first available user record.

            It is somewhat hard to repeat the issue as not all paths use ibuf_delete_recs and we need to simulate the case where IBUF records for that page spans across multiple leaf pages of the IBUF tree. I could repeat the issue with 4k page size and some code tweak. I am attaching the code modifications and the test here. This test can be used to understand the issue better and I don't intend to add it to our mtr test suite as it could be hard to maintain it over time.

            1. Code tweak to force ibuf_delete_recs while allocating page after deleting and re-creating index.

            diff --git a/storage/innobase/fsp/fsp0fsp.cc b/storage/innobase/fsp/fsp0fsp.cc
            index c5d04a19662..2d18edd2211 100644
            --- a/storage/innobase/fsp/fsp0fsp.cc
            +++ b/storage/innobase/fsp/fsp0fsp.cc
            @@ -1215,8 +1215,9 @@ fsp_alloc_free_page(
             		}
             	}
             
            +	auto ret_block = fsp_page_create(space, page_no, init_mtr);
             	fsp_alloc_from_free_frag(block, xdes, descr, free, mtr);
            -	return fsp_page_create(space, page_no, init_mtr);
            +	return ret_block;
             }
            

            2. mtr Test: mysql-test/suite/innodb/t/change_buffer_delete.test

            #
            # MDEV 32489: Change buffer index fails to delete the records
            #
            -- source include/have_innodb.inc
            SET GLOBAL innodb_change_buffering = none;
            SET GLOBAL innodb_file_per_table = OFF;
             
            let $loop=10;
            while ($loop)
            {
              eval CREATE TABLE t$loop(c1 INT AUTO_INCREMENT PRIMARY KEY,c2 CHAR(100))ENGINE=InnoDB;
              eval CREATE INDEX i1 ON t$loop (c2);
             
              eval INSERT INTO t$loop(c2) VALUES('mariadb');
              eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop;
              eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop;
              eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop;
              eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop;
              eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop;
              eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop LIMIT 4;
             
              dec $loop;
            }
             
            # -- vertical_results
            # select BLOCK_ID, SPACE, PAGE_NUMBER, PAGE_TYPE , TABLE_NAME, INDEX_NAME, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where TABLE_NAME like '%t1%' AND INDEX_NAME = "i1";
             
            SET GLOBAL innodb_change_buffering = all;
             
            --echo # Force change buffering when possible in debug build.
            --disable_query_log
            --error 0,ER_UNKNOWN_SYSTEM_VARIABLE
            SET GLOBAL innodb_change_buffering_debug = 1;
            --enable_query_log
             
            let $loop=10;
            eval INSERT INTO t10(c2) SELECT c2 FROM t8 IGNORE INDEX (i1) LIMIT 1;
            dec $loop;
             
            while ($loop)
            {
              eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop IGNORE INDEX (i1) LIMIT 4;
              dec $loop;
            }
             
            # select PAGE_NUMBER, TABLE_NAME, INDEX_NAME, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where TABLE_NAME like '%t1%' AND INDEX_NAME = "i1";
            # select PAGE_NUMBER, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where PAGE_TYPE LIKE 'IBUF_INDEX%' AND SPACE=0;
             
            --disable_query_log
            --error 0,ER_UNKNOWN_SYSTEM_VARIABLE
            SET GLOBAL innodb_change_buffering_debug = default;
            --enable_query_log
             
            SET GLOBAL innodb_change_buffering = default;
            DROP INDEX i1 ON t7;
            CREATE INDEX i1 ON t7 (c2);
            # select PAGE_NUMBER, TABLE_NAME, INDEX_NAME, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where TABLE_NAME like '%t1%' AND INDEX_NAME = "i1";
            # select PAGE_NUMBER, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where PAGE_TYPE LIKE 'IBUF_INDEX%' AND SPACE=0;
             
            SET GLOBAL innodb_fast_shutdown=0;
            --source include/restart_mysqld.inc
             
            let $loop=10;
            while ($loop)
            {
              eval DROP TABLE t$loop;
              dec $loop;
            }
            SET GLOBAL innodb_file_per_table = default;
            

            3. Test run with 4k page size

            ./mtr --mem --mysqld=--innodb_page_size=4k innodb.change_buffer_delete
            

            debarun Debarun Banerjee added a comment - I could analyze the issue from the RR trace. The internal cursor routines and modes (PAGE_CUR_GE) are behaving well and as per design. The IBUF B-tree state is also correct. Also, verified that there is no issue with the format or key comparison. The reason why we are not able to delete all the records is in the implementation of ibuf_delete_recs function. When the cursor is positioned in supremum record in previous page, which is required for correct operation of GE and LT mode, we are exiting from the function. Landing in previous non-user record is not any general issue and the correct way is to proceed further to the next leaf page to get back all the records. The easiest way is to use the btr_pcur_open_on_user_rec() interface which takes care of positioning the cursor in first available user record. It is somewhat hard to repeat the issue as not all paths use ibuf_delete_recs and we need to simulate the case where IBUF records for that page spans across multiple leaf pages of the IBUF tree. I could repeat the issue with 4k page size and some code tweak. I am attaching the code modifications and the test here. This test can be used to understand the issue better and I don't intend to add it to our mtr test suite as it could be hard to maintain it over time. 1. Code tweak to force ibuf_delete_recs while allocating page after deleting and re-creating index. diff --git a/storage/innobase/fsp/fsp0fsp.cc b/storage/innobase/fsp/fsp0fsp.cc index c5d04a19662..2d18edd2211 100644 --- a/storage/innobase/fsp/fsp0fsp.cc +++ b/storage/innobase/fsp/fsp0fsp.cc @@ -1215,8 +1215,9 @@ fsp_alloc_free_page( } } + auto ret_block = fsp_page_create(space, page_no, init_mtr); fsp_alloc_from_free_frag(block, xdes, descr, free, mtr); - return fsp_page_create(space, page_no, init_mtr); + return ret_block; } 2. mtr Test: mysql-test/suite/innodb/t/change_buffer_delete.test # # MDEV 32489: Change buffer index fails to delete the records # -- source include/have_innodb.inc SET GLOBAL innodb_change_buffering = none; SET GLOBAL innodb_file_per_table = OFF;   let $loop=10; while ($loop) { eval CREATE TABLE t$loop(c1 INT AUTO_INCREMENT PRIMARY KEY,c2 CHAR(100))ENGINE=InnoDB; eval CREATE INDEX i1 ON t$loop (c2);   eval INSERT INTO t$loop(c2) VALUES('mariadb'); eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop; eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop; eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop; eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop; eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop; eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop LIMIT 4;   dec $loop; }   # -- vertical_results # select BLOCK_ID, SPACE, PAGE_NUMBER, PAGE_TYPE , TABLE_NAME, INDEX_NAME, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where TABLE_NAME like '%t1%' AND INDEX_NAME = "i1";   SET GLOBAL innodb_change_buffering = all;   --echo # Force change buffering when possible in debug build. --disable_query_log --error 0,ER_UNKNOWN_SYSTEM_VARIABLE SET GLOBAL innodb_change_buffering_debug = 1; --enable_query_log   let $loop=10; eval INSERT INTO t10(c2) SELECT c2 FROM t8 IGNORE INDEX (i1) LIMIT 1; dec $loop;   while ($loop) { eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop IGNORE INDEX (i1) LIMIT 4; dec $loop; }   # select PAGE_NUMBER, TABLE_NAME, INDEX_NAME, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where TABLE_NAME like '%t1%' AND INDEX_NAME = "i1"; # select PAGE_NUMBER, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where PAGE_TYPE LIKE 'IBUF_INDEX%' AND SPACE=0;   --disable_query_log --error 0,ER_UNKNOWN_SYSTEM_VARIABLE SET GLOBAL innodb_change_buffering_debug = default; --enable_query_log   SET GLOBAL innodb_change_buffering = default; DROP INDEX i1 ON t7; CREATE INDEX i1 ON t7 (c2); # select PAGE_NUMBER, TABLE_NAME, INDEX_NAME, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where TABLE_NAME like '%t1%' AND INDEX_NAME = "i1"; # select PAGE_NUMBER, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where PAGE_TYPE LIKE 'IBUF_INDEX%' AND SPACE=0;   SET GLOBAL innodb_fast_shutdown=0; --source include/restart_mysqld.inc   let $loop=10; while ($loop) { eval DROP TABLE t$loop; dec $loop; } SET GLOBAL innodb_file_per_table = default; 3. Test run with 4k page size ./mtr --mem --mysqld=--innodb_page_size=4k innodb.change_buffer_delete

            Uploaded patch for review

            debarun Debarun Banerjee added a comment - Uploaded patch for review

            Thank you. I like the general idea of the patch. It should be OK to push after addressing some of my comments.

            marko Marko Mäkelä added a comment - Thank you. I like the general idea of the patch. It should be OK to push after addressing some of my comments.

            I have tried to address the review comments. Please have a look.

            debarun Debarun Banerjee added a comment - I have tried to address the review comments. Please have a look.

            People

              debarun Debarun Banerjee
              thiru Thirunarayanan Balathandayuthapani
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.