Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30148

Race condition between non-persistent statistics and purge of InnoDB history

    XMLWordPrintable

Details

    Description

      mleich provided an rr record trace where several btr_estimate_number_of_different_key_vals() are accessing the same innodb_page_size=64k page while a purge thread was holding an exclusive page latch and executing a page reorganize:

      Thread 7 (Thread 511063.511299 (mariadbd)):
      #0  0x0000560e2dbac68c in page_offset (ptr=0x7f4f15ba02fa) at /data/Server/bb-10.11-MDEV-29694/storage/innobase/include/page0page.h:216
      #1  page_cur_insert_rec_low (cur=cur@entry=0x7f4f024406a0, rec=<optimized out>, offsets=offsets@entry=0x7f4f024406e0, mtr=mtr@entry=0x7f4f02441c50) at /data/Server/bb-10.11-MDEV-29694/storage/innobase/page/page0cur.cc:1479
      #2  0x0000560e2dbc3c44 in page_copy_rec_list_end_no_locks (new_block=new_block@entry=0x7f4f1596fba0, block=block@entry=0x7f4f14d6ed00, rec=<optimized out>, index=index@entry=0x6160007d1a08, mtr=mtr@entry=0x7f4f02441c50) at /data/Server/bb-10.11-MDEV-29694/storage/innobase/page/page0page.cc:477
      #3  0x0000560e2ded41a9 in btr_page_reorganize_low (cursor=cursor@entry=0x7f4f02440b40, mtr=mtr@entry=0x7f4f02441c50) at /data/Server/bb-10.11-MDEV-29694/storage/innobase/btr/btr0btr.cc:1174
      #4  0x0000560e2ded730f in btr_page_reorganize_block (z_level=<optimized out>, block=block@entry=0x7f4f1596fba0, index=index@entry=0x6160007d1a08, mtr=mtr@entry=0x7f4f02441c50) at /data/Server/bb-10.11-MDEV-29694/storage/innobase/btr/btr0btr.cc:1417
      #5  0x0000560e2ded78ce in btr_can_merge_with_page (cursor=cursor@entry=0x7f4f02441980, page_no=page_no@entry=33, merge_block=merge_block@entry=0x7f4f02440d10, mtr=mtr@entry=0x7f4f02441c50) at /data/Server/bb-10.11-MDEV-29694/storage/innobase/btr/btr0btr.cc:5071
      #6  0x0000560e2deec714 in btr_compress (cursor=cursor@entry=0x7f4f02441980, adjust=adjust@entry=false, mtr=mtr@entry=0x7f4f02441c50) at /data/Server/bb-10.11-MDEV-29694/storage/innobase/btr/btr0btr.cc:3421
      #7  0x0000560e2df25e1f in btr_cur_compress_if_useful (cursor=cursor@entry=0x7f4f02441980, adjust=adjust@entry=false, mtr=mtr@entry=0x7f4f02441c50) at /data/Server/bb-10.11-MDEV-29694/storage/innobase/btr/btr0cur.cc:4624
      #8  0x0000560e2df448f4 in btr_cur_pessimistic_delete (err=err@entry=0x7f4f02441890, has_reserved_extents=has_reserved_extents@entry=0, cursor=cursor@entry=0x7f4f02441980, flags=flags@entry=0, rollback=rollback@entry=false, mtr=mtr@entry=0x7f4f02441c50) at /data/Server/bb-10.11-MDEV-29694/storage/innobase/btr/btr0cur.cc:5053
      #9  0x0000560e2dd4b30c in row_purge_remove_sec_if_poss_tree (node=node@entry=0x61a00000cd08, index=index@entry=0x6160007d1a08, entry=entry@entry=0x619000046608) at /data/Server/bb-10.11-MDEV-29694/storage/innobase/row/row0purge.cc:392
      

      The crashing thread would have been blocked if it had been holding any page latch. It is only holding a buffer-fix:

      #5  0x0000560e2deb6f21 in ut_dbg_assertion_failed (
          expr=expr@entry=0x560e2ede6380 "page_offset(rec) <= page_header_get_field(page, PAGE_HEAP_TOP)", 
          file=file@entry=0x560e2ede6240 "/data/Server/bb-10.11-MDEV-29694/storage/innobase/include/page0page.inl", line=line@entry=310)
          at /data/Server/bb-10.11-MDEV-29694/storage/innobase/ut/ut0dbg.cc:60
      #6  0x0000560e2e0b2eea in page_rec_check (rec=0x7f4f15ba24f3 "")
          at /data/Server/bb-10.11-MDEV-29694/storage/innobase/include/page0page.inl:310
      #7  page_rec_is_supremum (rec=0x7f4f15ba24f3 "")
          at /data/Server/bb-10.11-MDEV-29694/storage/innobase/include/page0page.inl:165
      #8  btr_estimate_number_of_different_key_vals (
          index=index@entry=0x6160007d1a08, bulk_trx_id=<optimized out>)
          at /data/Server/bb-10.11-MDEV-29694/storage/innobase/dict/dict0stats.cc:1378
      #9  0x0000560e2e0b4866 in dict_stats_update_transient_for_index (
          index=index@entry=0x6160007d1a08)
          at /data/Server/bb-10.11-MDEV-29694/storage/innobase/dict/dict0stats.cc:1573
      #20 0x0000560e2c64e5b0 in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x62b0001ab218, packet=packet@entry=0x6290015cc219 "INSERT IGNORE INTO `oltp3` ( `id`, `k`) VALUES ( NULL, 245760000 ) /* E_R Thread4 QNO 507 CON_ID 19 */ ", packet_length=packet_length@entry=103, blocking=blocking@entry=true) at /data/Server/bb-10.11-MDEV-29694/sql/sql_parse.cc:1894
      

      It turns out that there is a lot of dead or unnecessary code in btr_cur_open_at_rnd_pos(). Each caller only requires a shared latch to be held on the returned leaf page.

      MDEV-21136 fixed something similar, but the fix did not touch this code.

      We have not tested if older versions are affected by this. It is possible or even likely, but the fix would be hard to port, because it would depend on MDEV-29603, which is only in 10.6.

      Attachments

        Issue Links

          Activity

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.