Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-14637

Latching order violation during btr_cur_pessimistic_delete()

    Details

      Description

      When MySQL 5.7.1 introduced WL#6326 to reduce contention on the non-leaf levels of B-trees, it introduced a new rw-lock mode SX (not conflicting with S, but conflicting with SX and X) and new rules to go with it.

      A thread that is holding an dict_index_t::lock aka index->lock in SX mode is permitted to acquire non-leaf buf_block_t::lock aka block->lock X or SX mode, in monotonically descending order. That is, once the thread has acquired a block->lock, it is not allowed to acquire a lock on its parent or grandparent pages. Such ‘arbitrary-order access’ is only allowed when the thread acquired the index->lock in X mode upfront.

      A customer encountered a repeatable hang when loading a dump into InnoDB while using multiple innodb_purge_threads (default: 4). The dump makes very heavy use of FOREIGN KEY constraints. By luck, it happened so that two purge worker threads (srv_worker_thread) deadlocked with each other. Both were operating on the index FOR_REF of the InnoDB internal table SYS_FOREIGN. One of them was legitimately holding index->lock S-latch and the root block->lock S-latch. The other srv_worker_thread had acquired index->lock SX-latch, root block->lock SX-latch, and a bunch of other latches, including the fil_space_t::latch for freeing some blocks and some leaf page latches. This other thread was inside 2 nested calls to btr_compress() and it was trying to reacquire the root block->lock in X mode, violating the WL#6326 protocol.

      This violation led to a deadlock, because while S is compatible with SX and a thread can upgrade an SX lock to X when there are no conflicting requests, in this case there was a conflicting S lock held by the other purge worker thread.

      So, the two purge worker threads are now deadlocked with each other. Both are holding dict_operation_lock S-latch to prevent concurrent DROP TABLE operations (it would be better architecture to acquire MDL on the currently processed table name). Their dict_operation_lock would block any future DDL operations, and it would also block the master thread which is from time to time performing InnoDB table cache eviction (a feature introduced in MySQL 5.6; see MySQL Bug#84424). The user experienced a hang during a subsequent CREATE TABLE.

      Below is my detailed analysis of what the purge worker threads (Thread 4 and Thread 5) did, from the core dump that was helpfully provided by the customer.

      Thread 5 is performing a BTR_PURGE_LEAF dive into the index FOR_REF of
      SYS_FOREIGN. It successfully acquired the index S-latch, the root page
      S-latch, and is now trying to acquire an S-latch on the page 0:196633
      that Thread 16 is waiting an SX-latch for. As soon as it would get the
      S-latch, it would find the next child page pointer, which is a pointer
      to a leaf page, and then release the non-leaf page latches and the
      index->lock.

      We do not know yet who is holding the X-latch on the block->lock.

      Thread 4 is trying to acquire an X-latch on the root page of FOR_REF
      of SYS_FOREIGN. It is blocked by Thread 5.

      Also, there is nothing really bad about Thread 16, other than it seems
      to be unnecessarily acquiring a too strong latch (SX instead of S) on
      the block->lock.

      So, we must find out the lock holder:

      (gdb) p/x $13->lock
      $16 = {lock_word = 0xffffffffdfffffff, waiters = 0x1, sx_recursive = 0x0,
      writer_is_wait_ex = 0x0, writer_thread = 0x7fe7e3fff700,

      The writer_thread points to Thread 4 (Thread 0x7fe7e3fff700 (LWP 16824)).

      Thread 4 seems to be holding an X-latch on a level-1 node pointer and
      then trying to acquire the level-2 root page X-latch. This would be
      fine if it were holding the index->lock X-latch. But it is only
      holding the index->lock SX-latch. Let us look at its mini-transaction
      memo:
      (gdb) p/x (mtr_t*)0x7fe7e3ffe500
      $17 = 0x7fe7e3ffe500
      (gdb) p *(void**)$17.m_impl.m_memo.m_first_block.m_data@40
      $19 = {0x7fec5723fdd0, 0x80, 0x7fec13cf5410, 0x4, 0x7fec13cfb3b0, 0x2,
      0x7fea06a89060, 0x8, 0x7fea06a89858, 0x2, 0x7fea06a89060, 0x2,
      0x7fea06a896c0, 0x2, 0x7fec57213520, 0x40, 0x7fec13cf55a8, 0x4,
      0x7fec13cf5410, 0x8, 0x7fec13cfb3b0, 0x8, 0x7fec13cfb3b0, 0x2,
      0x7fea06a89858, 0x2, 0x7fea06a896c0, 0x2, 0x7fec13cf5410, 0x8,
      0x7fec13cfb3b0, 0x8, 0x7fec13cfb3b0, 0x2, 0x7fea06a89858, 0x2,
      0x7fea06a896c0, 0x2, 0x7fec13cf5410, 0x8}

      The first element is the &index->lock of our index=0x7fec5723fc40 (the
      index FOR_REF of SYS_FOREIGN). It is locked as MTR_MEMO_SX_LOCK=0x80.

      The second is MTR_MEMO_PAGE_SX_FIX(4) of the root page (0:47). Looks
      OK so far. It looks a little suspicious OK that the memo contains
      multiple references to this root page, in mode MTR_MEMO_BUF_FIX(8).
      Let us ignore those. The next page is MTR_MEMO_PAGE_X_FIX(2) of our
      interesting page (0:196633). I think that this access is still OK
      according to the WL#6326 rules. (It is certainly wrong of Thread 4 to
      go back to acquire the root page X-latch later.) Also this page
      reoccurs in the memo as MTR_MEMO_BUF_FIX(8). Let me pretty-print the
      memo, ignoring those entries:

      MTR_MEMO_SX_LOCK(index->lock) 0x7fec5723fdd0
      MTR_MEMO_PAGE_SX_FIX(0:47, root, level-2) 0x7fec13cf5410
      MTR_MEMO_PAGE_X_FIX(0:196633, level-1) 0x7fec13cfb3b0
      MTR_MEMO_BUF_FIX(0:70787, leaf page) 0x7fea06a89060
      MTR_MEMO_PAGE_X_FIX(0:70793, leaf page) 0x7fea06a89858

      It looks like there was no match on page 0:70787 and we proceeded to
      its FIL_PAGE_NEXT=0x11489=70793.

      The memo goes on with

      MTR_MEMO_PAGE_X_FIX(0:70787, (left sibling) leaf page) 0x7fea06a89060
      MTR_MEMO_PAGE_X_FIX(0:70791, (right sibling) leaf page) 0x7fea06a896c0

      Latching the left non-leaf sibling should be OK while holding the
      index->lock SX-latch. (Latching parent pages is forbidden.)
      Latching the right sibling should always be OK.

      Then we have this:

      MTR_MEMO_X_LOCK(fil_space->latch) on "innodb_system" 0x7fec57213520

      I think that this is normal for freeing pages. After all, we are
      executing btr_compress(), going to shrink the index tree.

      MTR_MEMO_PAGE_SX_FIX(0:0, allocation bitmap for pages 0..16383) 0x7fec13cf55a8
      MTR_MEMO_BUF_FIX(0:47, root, level-2) 0x7fec13cf5410
      MTR_MEMO_BUF_FIX(0:196633, level-1) 0x7fec13cfb3b0
      MTR_MEMO_PAGE_X_FIX(0:196633, level-1) 0x7fec13cfb3b0 (ok, already X-latched)
      MTR_MEMO_PAGE_X_FIX(0:70793, leaf page) 0x7fea06a89858 (ok, already X-latched)
      MTR_MEMO_PAGE_X_FIX(0:70791, (right sibling) leaf page) 0x7fea06a896c0 (ok)
      MTR_MEMO_BUF_FIX(0:47, root, level-2) 0x7fec13cf5410
      MTR_MEMO_BUF_FIX(0:196633, level-1) 0x7fec13cfb3b0
      MTR_MEMO_PAGE_X_FIX(0:196633, level-1) 0x7fec13cfb3b0 (ok, already X-latched)
      MTR_MEMO_PAGE_X_FIX(0:70793, leaf page) 0x7fea06a89858 (ok, already X-latched)
      MTR_MEMO_PAGE_X_FIX(0:70791, (right sibling) leaf page) 0x7fea06a896c0 (ok)
      MTR_MEMO_BUF_FIX(0:47, root, level-2) 0x7fec13cf5410

      It turns out that a questionable condition had been included in MySQL 5.7, to sometimes acquire the index->lock in too weak SX mode, instead of the appropriate X mode.

      Because the decision to acquire the correct X mode lock depends on the transaction history list length and the number of pending read requests, the code might work correctly most of the time. Also, most ‘pessimistic’ operations do not involve multiple recursion steps to btr_compress(), like Thread 4 in our case:

      #8  pfs_rw_lock_x_lock_func
      #9  buf_page_get_gen
      #10 btr_block_get_func
      #11 btr_cur_search_to_nth_level
      #12 btr_page_get_father_node_ptr_func
      #13 btr_compress
      #14 btr_cur_compress_if_useful
      #15 btr_cur_pessimistic_delete
      #16 btr_compress
      #17 btr_cur_compress_if_useful
      #18 btr_cur_pessimistic_delete
      #19 row_purge_remove_sec_if_poss_tree
      #20 row_purge_remove_sec_if_poss
      #21 row_purge_del_mark
      #22 row_purge_record_func
      #23 row_purge
      #24 row_purge_step
      #25 que_thr_step
      #26 que_run_threads_low
      #27 que_run_threads
      #28 srv_task_execute
      #29 srv_worker_thread
      

      There is some code in btr_cur_search_to_nth_level() that tries to acquire all possibly needed sibling page locks in advance. It appears that this logic is not working correctly in this case, because the root page was only locked in SX mode, not X mode.

      While this deadlock occurred during purge, the BTR_LATCH_FOR_DELETE is also used in ROLLBACK. It is possible that this bug has been reported upstream in MySQL Bug#80919. Without a core dump or at least full stack traces of each thread, this cannot be confirmed or denied.

      While diagnosing this issue, I noticed that the page flush code is now acquiring block->lock in SX mode instead of S mode. This is likely because certain fields in index root pages are protected by SX latch, not X latch. The page flushing must ensure that no concurrent modifications are possible even to the fields in index root pages.

      The conservative fix would be to always acquire the index->lock in X mode for BTR_MODIFY_TREE operations. This will very likely degrade performance, because then the only cases for acquiring the index->lock in SX mode would seem to be operations on BLOBs.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                marko Marko Mäkelä
                Reporter:
                marko Marko Mäkelä
              • Votes:
                1 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: