Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21899

INSERT into a secondary index with zero-data-length key is not crash-safe

Details

    Description

      Problem found during RQG testing invoking the RQG reporter "Mariabackup".
      Scenario:
      1. More or less heavy DDL/DML load
           During that some call   mariabackup ... --backup ... is made.
      2. mariabackup ... --prepare  ...  is called first time.
       
      2020-03-06 19:20:38 0 [Note] InnoDB: Starting final batch to recover 491 pages from redo log.
      2020-03-06 19:20:39 0 [ERROR] InnoDB: Not applying DELETE_ROW_FORMAT_DYNAMIC due to corruption on [page id: space=167, page number=36]
      2020-03-06 19:20:39 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
      2020-03-06 19:20:39 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1580] with error Data structure corruption
      [00] FATAL ERROR: 2020-03-06 19:20:40 mariabackup: innodb_init() returned 39 (Data structure corruption).
       
      origin/HEAD, origin/10.5 23685378bafa38a74957e8f07e2d56b57e90fa53 2020-03-06T11:06:59+02:00
       
       
      Some second variant for the same failure is
      1. Only on session and that runs
          REPLACE INTO <random table which was created earlier> ( _field_no_pk ) VALUES ( <random _mediumint  value) ;
      2. At some point of time the server gets killed (SIGKILL)
      3. Attempt to restart the server
       
      2020-03-23 14:12:49 0 [Note] InnoDB: Completed initialization of buffer pool
      2020-03-23 14:12:49 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
      2020-03-23 14:12:49 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=335923252
      2020-03-23 14:12:49 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 2 row operations to undo
      2020-03-23 14:12:49 0 [Note] InnoDB: Trx id counter is 21150
      2020-03-23 14:12:49 0 [Note] InnoDB: Starting final batch to recover 272 pages from redo log.
      2020-03-23 14:12:49 0 [ERROR] InnoDB: Not applying DELETE_ROW_FORMAT_DYNAMIC due to corruption on [page id: space=5, page number=22]
      2020-03-23 14:12:49 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
      2020-03-23 14:12:49 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1574] with error Data structure corruption
       
      origin/10.5 02fe99750536aad7c9e75513be47c91c8c568419 2020-03-23T10:45:08+01:00
       
      The bug affects builds with and without debug.
      
      

      Attachments

        1. MDEV-21899.tgz
          869 kB
        2. rqg.log
          381 kB

        Issue Links

          Activity

            In the case that I analyzed, I observed 3 or 4 mini-transactions for the page:

            1. Create the B-tree page (root page of a secondary index, page 36 in the tablespace).
            2. Insert one record to the B-tree page, with the key 0x80000000 (looks like the PRIMARY KEY was an empty string).
            3. Update the PAGE_MAX_TRX_ID and delete-mark the record.
            4. Delete a non-existing record with the non-existing predecessor at byte offset 0x8a.

            We could detect the inconsistency earlier, by checking that the predecessor record is not beyond PAGE_HEAP_TOP. We could copy some further consistency checks from the INSERT function.

            Until we have a reasonably repeatable test case, it is not known what is causing this. It would seem that some log is missing. A bug in Mariabackup?

            marko Marko Mäkelä added a comment - In the case that I analyzed, I observed 3 or 4 mini-transactions for the page: Create the B-tree page (root page of a secondary index, page 36 in the tablespace). Insert one record to the B-tree page, with the key 0x80000000 (looks like the PRIMARY KEY was an empty string). Update the PAGE_MAX_TRX_ID and delete-mark the record. Delete a non-existing record with the non-existing predecessor at byte offset 0x8a. We could detect the inconsistency earlier, by checking that the predecessor record is not beyond PAGE_HEAP_TOP . We could copy some further consistency checks from the INSERT function. Until we have a reasonably repeatable test case, it is not known what is causing this. It would seem that some log is missing. A bug in Mariabackup?

            I think that I finally found the cause. It was rather challenging, so I am providing details for future reference, to aid in similar debugging. The patch is simple, and the cause is MDEV-21724:

            diff --git a/storage/innobase/page/page0cur.cc b/storage/innobase/page/page0cur.cc
            index bce7231059a..ec79450995a 100644
            --- a/storage/innobase/page/page0cur.cc
            +++ b/storage/innobase/page/page0cur.cc
            @@ -1551,7 +1551,6 @@ page_cur_insert_rec_low(
               /* Insert the record, possibly copying from the preceding record. */
               ut_ad(mtr->get_log_mode() == MTR_LOG_ALL);
             
            -  if (data_size)
               {
                 const byte *r= rec;
                 const byte *c= cur->rec;
            

            In a provided dataset, there were at least 2 pages for which the record failed to apply: pages 5:9 and 5:12. I concentrated on page 5:9. For that one, I instrumented the crash recovery with gdb as follows:

            break ib::error::~error
            break page_apply_delete_dynamic
            cond 2 block.page.id.m_id==0x500000009
            break 'recv_recover_page(buf_block_t*, mtr_t&, std::_Rb_tree_iterator<std::pair<page_id_t const, page_recv_t> > const&, fil_space_t*, mlog_init_t::init*)'
            cond 3 block.page.id.m_id==0x500000009
            command 3
            set scheduler-locking on
            enable 4
            continue
            break log_phys_t::apply
            disable 4
            continue
            …
            display/x *begin()@len
            display *this
            command 4
            continue
            end
            continue
            

            There were 2 page_apply_delete_dynamic() calls to this page. The second one failed, because it was pointing to an area in the page that was filled with zero bytes. That is, the next-record loop to find the record owner in the sparse page directory would exceed n_recs and the page would correctly be regarded as corrupted.

            Aimed with the knowledge on the LSN and the redo log record contents, I used the provided rr replay environment. First, I think that I let it run to the end (to the SIGKILL that was originally initiated by RQG), and did when to see how many steps there are in total. It was above 4 million. Then, I did roughly the following in the end (omitting lots of steps):

            run 4000000
            break row_upd_sec_index_entry
            cond 1 log_sys.lsn._M_i>176623800
            break mtr_buf_t::close
            cond 2 log_sys.lsn._M_i>176623800
            define find_page_105
              set $id=((unsigned long long)$arg0)<<32|$arg1
              set $chunk=buf_pool->chunks
              set $i=buf_pool->n_chunks
              set $found=0
              while ($i-- && !$found)
                set $j = $chunk->size
                set $b = $chunk->blocks
                while ($j--)
                  if ($b->page.id.m_id==$id)
                    print $b
                    set $found=1
                    loop_break
                  end
                  set $b++
                end
                set $chunk++
              end
            end
             
            document find_page_105
            Buffer pool look-up: find_page_105 space_id page_no"
            end
            find_page_105 5 3
            watch -l *(ulonglong*)$1.frame[0x38]@8
            watch -l $1.frame[0x7a]
            display/x $1.frame[0x38]@8
            display log_sys.lsn
            

            The first hardware watchpoint is for PAGE_MAX_TRX_ID on the page. The page turned out to be a secondary index page.

            The second watchpoint is for a byte that used to be a delete-mark flag that was set back and forth by the test. It was accidentally hit by the page_cur_insert_rec_low(). Thanks to the breakpoint on mtr_buf_t::close, which was not called for that insert, I knew that something is wrong. So, I went back in time by issuing

            when
            run (the reported number - a little bit)
            

            and single-stepped through the function page_cur_insert_rec_low(). The condition that my patch above removed is redundant and outright harmful.

            It is a rare corner case that a secondary index record has no data payload at all. It is possible when the PRIMARY KEY consists of variable-length string columns that are all empty, and all secondary index columns are either NULL or the empty string.

            We failed to write the INSERT_REUSE_DYNAMIC redo log record for the following statement:

            REPLACE INTO `table100_innodb_default` ( `col_text_latin1_unique_default_null` ) VALUES ( 5900288 );
            

            This was on the secondary index col_char_latin1_unique, and the value is (NULL,''), as expected.

            marko Marko Mäkelä added a comment - I think that I finally found the cause. It was rather challenging, so I am providing details for future reference, to aid in similar debugging. The patch is simple, and the cause is MDEV-21724 : diff --git a/storage/innobase/page/page0cur.cc b/storage/innobase/page/page0cur.cc index bce7231059a..ec79450995a 100644 --- a/storage/innobase/page/page0cur.cc +++ b/storage/innobase/page/page0cur.cc @@ -1551,7 +1551,6 @@ page_cur_insert_rec_low( /* Insert the record, possibly copying from the preceding record. */ ut_ad(mtr->get_log_mode() == MTR_LOG_ALL); - if (data_size) { const byte *r= rec; const byte *c= cur->rec; In a provided dataset, there were at least 2 pages for which the record failed to apply: pages 5:9 and 5:12. I concentrated on page 5:9. For that one, I instrumented the crash recovery with gdb as follows: break ib::error::~error break page_apply_delete_dynamic cond 2 block.page.id.m_id==0x500000009 break 'recv_recover_page(buf_block_t*, mtr_t&, std::_Rb_tree_iterator<std::pair<page_id_t const, page_recv_t> > const&, fil_space_t*, mlog_init_t::init*)' cond 3 block.page.id.m_id==0x500000009 command 3 set scheduler-locking on enable 4 continue break log_phys_t::apply disable 4 continue … display/x *begin()@len display *this command 4 continue end continue There were 2 page_apply_delete_dynamic() calls to this page. The second one failed, because it was pointing to an area in the page that was filled with zero bytes. That is, the next-record loop to find the record owner in the sparse page directory would exceed n_recs and the page would correctly be regarded as corrupted. Aimed with the knowledge on the LSN and the redo log record contents, I used the provided rr replay environment. First, I think that I let it run to the end (to the SIGKILL that was originally initiated by RQG), and did when to see how many steps there are in total. It was above 4 million. Then, I did roughly the following in the end (omitting lots of steps): run 4000000 break row_upd_sec_index_entry cond 1 log_sys.lsn._M_i>176623800 break mtr_buf_t::close cond 2 log_sys.lsn._M_i>176623800 define find_page_105 set $id=((unsigned long long)$arg0)<<32|$arg1 set $chunk=buf_pool->chunks set $i=buf_pool->n_chunks set $found=0 while ($i-- && !$found) set $j = $chunk->size set $b = $chunk->blocks while ($j--) if ($b->page.id.m_id==$id) print $b set $found=1 loop_break end set $b++ end set $chunk++ end end   document find_page_105 Buffer pool look-up: find_page_105 space_id page_no" end find_page_105 5 3 watch -l *(ulonglong*)$1.frame[0x38]@8 watch -l $1.frame[0x7a] display/x $1.frame[0x38]@8 display log_sys.lsn The first hardware watchpoint is for PAGE_MAX_TRX_ID on the page. The page turned out to be a secondary index page. The second watchpoint is for a byte that used to be a delete-mark flag that was set back and forth by the test. It was accidentally hit by the page_cur_insert_rec_low() . Thanks to the breakpoint on mtr_buf_t::close , which was not called for that insert, I knew that something is wrong. So, I went back in time by issuing when run (the reported number - a little bit) and single-stepped through the function page_cur_insert_rec_low() . The condition that my patch above removed is redundant and outright harmful. It is a rare corner case that a secondary index record has no data payload at all. It is possible when the PRIMARY KEY consists of variable-length string columns that are all empty, and all secondary index columns are either NULL or the empty string. We failed to write the INSERT_REUSE_DYNAMIC redo log record for the following statement: REPLACE INTO `table100_innodb_default` ( `col_text_latin1_unique_default_null` ) VALUES ( 5900288 ); This was on the secondary index col_char_latin1_unique , and the value is (NULL,'') , as expected.

            People

              marko Marko Mäkelä
              mleich Matthias Leich
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.