Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35508

Race condition between purge and secondary index INSERT or UPDATE

    XMLWordPrintable

Details

    Description

      mleich produced an rr replay trace of a MDEV-35049 development branch that features this same error, but without the involvement of OPTIMIZE TABLE. Initially, I was suspecting that this failure is something specific to that branch, because the branch includes a rewrite of how keys are being searched for in B-tree pages.

      ssh sdp
      rr replay /data/results/1732217540/Marko-3/1/rr/latest-trace
      

      In the end, this seems to be a race condition. It took me a while to figure out how to debug this. In the end, I set a hardware data watchpoint on the clustered index record (DB_ROW_ID=0x4b7) in the buffer pool, specifically on the last 4 bytes of the DB_ROLL_PTR field and the 4 bytes of the problematic col_int_key field, to catch what is going on. I also set a watchpoint on the delete-mark flag of the secondary index record (col_int_key,DB_ROW_ID)=(5,0x4b7).

      The hardware data watchpoint on the clustered index record was being hit by the following:

      INSERT /*! IGNORE */ INTO table10000_innodb VALUES (5, 5), (5,5), …;
      UPDATE table10000_innodb SET `col_int_key` = 4 /* E_R Thread6 QNO 8 CON_ID 23 */
      UPDATE table10000_innodb SET `col_int_key` = 5 /* E_R Thread7 QNO 13 CON_ID 24 */
      

      During the execution of the second UPDATE (transaction 0x1e, just a little too new to be included in purge_sys.view), the purge of the first UPDATE was blocked in the following:

      10.6-MDEV-35049 36a8b44ebd96ec9a8d449c83248109d5e893f534

      #17 log_free_check () at /data/Server/10.6-MDEV-35049E/storage/innobase/log/log0log.cc:956
      #18 0x000063a8330d1a61 in row_purge_remove_sec_if_poss_tree (node=node@entry=0x63a8354a62b8, index=index@entry=0x7d1998069e78, entry=entry@entry=0x7d19a402df88, page_max_trx_id=page_max_trx_id@entry=0x1e)
          at /data/Server/10.6-MDEV-35049E/storage/innobase/row/row0purge.cc:767
      #19 0x000063a8330d2925 in row_purge_remove_sec_if_poss (node=node@entry=0x63a8354a62b8, index=0x7d1998069e78, entry=0x7d19a402df88) at /data/Server/10.6-MDEV-35049E/storage/innobase/row/row0purge.cc:991
      #20 0x000063a8330d31c6 in row_purge_upd_exist_or_extern_func (thr=thr@entry=0x63a8354a6218, node=node@entry=0x63a8354a62b8, undo_rec=undo_rec@entry=0x7d19c51abdda ">\b\f\202\267\022")
          at /data/Server/10.6-MDEV-35049E/storage/innobase/row/row0purge.cc:1157
      #21 0x000063a8330d36a6 in row_purge_record_func (node=node@entry=0x63a8354a62b8, undo_rec=undo_rec@entry=0x7d19c51abdda ">\b\f\202\267\022", thr=thr@entry=0x63a8354a6218, updated_extern=0x0)
          at /data/Server/10.6-MDEV-35049E/storage/innobase/row/row0purge.cc:1548
      #22 0x000063a8330d3baa in row_purge (node=node@entry=0x63a8354a62b8, undo_rec=undo_rec@entry=0x7d19c51abdda ">\b\f\202\267\022", thr=thr@entry=0x63a8354a6218)
          at /data/Server/10.6-MDEV-35049E/storage/innobase/row/row0purge.cc:1587
      #23 0x000063a8330d3c09 in row_purge_step (thr=thr@entry=0x63a8354a6218) at /data/Server/10.6-MDEV-35049E/storage/innobase/row/row0purge.cc:1650
      

      During this blockage, the second UPDATE had updated both the clustered index record and removed the delete-mark on the secondary index record (5,0x4b7) which had been delete-marked by the first UPDATE. Then, purge would report an error and hit ut_ad(0), crashing the debug instrumented build:

      2024-11-21 13:14:38 0 [ERROR] InnoDB: tried to purge non-delete-marked record in index `col_int_key` of table `test`.`table10000_innodb`: tuple: TUPLE (info_bits=0, 2 fields): {[4]    (0x80000005),[6]      (0x0000000004B7)}, record: COMPACT RECORD(info_bits=0, 2 fields): {[4]    (0x80000005),[6]      (0x0000000004B7)}
      

      The problem turns out to be that MDEV-34515 introduced an unsafe optimization: If the PAGE_MAX_TRX_ID did not change between row_purge_remove_sec_if_poss_leaf() and row_purge_remove_sec_if_poss_tree(), a call to row_purge_poss_sec() would be skipped.

      A more correct condition would be the following: If the PAGE_MAX_TRX_ID was not changed and it did not belong to an active transaction when row_purge_remove_sec_if_poss_leaf() was holding the secondary index leaf page latch, the check would be redundant.

      As far as I can tell, the impact of this bug is limited to some error log "spam" and the debug assertion failure. This should not cause any actual corruption.

      Attachments

        1. 10.6_fix.patch
          1 kB
          Debarun Banerjee
        2. 10.6_test_repeat.patch
          4 kB
          Debarun Banerjee

        Issue Links

          Activity

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.