Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-14637

Latching order violation during btr_cur_pessimistic_delete()

Details

    Description

      When MySQL 5.7.1 introduced WL#6326 to reduce contention on the non-leaf levels of B-trees, it introduced a new rw-lock mode SX (not conflicting with S, but conflicting with SX and X) and new rules to go with it.

      A thread that is holding an dict_index_t::lock aka index->lock in SX mode is permitted to acquire non-leaf buf_block_t::lock aka block->lock X or SX mode, in monotonically descending order. That is, once the thread has acquired a block->lock, it is not allowed to acquire a lock on its parent or grandparent pages. Such ‘arbitrary-order access’ is only allowed when the thread acquired the index->lock in X mode upfront.

      A customer encountered a repeatable hang when loading a dump into InnoDB while using multiple innodb_purge_threads (default: 4). The dump makes very heavy use of FOREIGN KEY constraints. By luck, it happened so that two purge worker threads (srv_worker_thread) deadlocked with each other. Both were operating on the index FOR_REF of the InnoDB internal table SYS_FOREIGN. One of them was legitimately holding index->lock S-latch and the root block->lock S-latch. The other srv_worker_thread had acquired index->lock SX-latch, root block->lock SX-latch, and a bunch of other latches, including the fil_space_t::latch for freeing some blocks and some leaf page latches. This other thread was inside 2 nested calls to btr_compress() and it was trying to reacquire the root block->lock in X mode, violating the WL#6326 protocol.

      This violation led to a deadlock, because while S is compatible with SX and a thread can upgrade an SX lock to X when there are no conflicting requests, in this case there was a conflicting S lock held by the other purge worker thread.

      So, the two purge worker threads are now deadlocked with each other. Both are holding dict_operation_lock S-latch to prevent concurrent DROP TABLE operations (it would be better architecture to acquire MDL on the currently processed table name). Their dict_operation_lock would block any future DDL operations, and it would also block the master thread which is from time to time performing InnoDB table cache eviction (a feature introduced in MySQL 5.6; see MySQL Bug#84424). The user experienced a hang during a subsequent CREATE TABLE.

      Below is my detailed analysis of what the purge worker threads (Thread 4 and Thread 5) did, from the core dump that was helpfully provided by the customer.

      Thread 5 is performing a BTR_PURGE_LEAF dive into the index FOR_REF of
      SYS_FOREIGN. It successfully acquired the index S-latch, the root page
      S-latch, and is now trying to acquire an S-latch on the page 0:196633
      that Thread 16 is waiting an SX-latch for. As soon as it would get the
      S-latch, it would find the next child page pointer, which is a pointer
      to a leaf page, and then release the non-leaf page latches and the
      index->lock.

      We do not know yet who is holding the X-latch on the block->lock.

      Thread 4 is trying to acquire an X-latch on the root page of FOR_REF
      of SYS_FOREIGN. It is blocked by Thread 5.

      Also, there is nothing really bad about Thread 16, other than it seems
      to be unnecessarily acquiring a too strong latch (SX instead of S) on
      the block->lock.

      So, we must find out the lock holder:

      (gdb) p/x $13->lock
      $16 = {lock_word = 0xffffffffdfffffff, waiters = 0x1, sx_recursive = 0x0,
      writer_is_wait_ex = 0x0, writer_thread = 0x7fe7e3fff700,

      The writer_thread points to Thread 4 (Thread 0x7fe7e3fff700 (LWP 16824)).

      Thread 4 seems to be holding an X-latch on a level-1 node pointer and
      then trying to acquire the level-2 root page X-latch. This would be
      fine if it were holding the index->lock X-latch. But it is only
      holding the index->lock SX-latch. Let us look at its mini-transaction
      memo:
      (gdb) p/x (mtr_t*)0x7fe7e3ffe500
      $17 = 0x7fe7e3ffe500
      (gdb) p *(void**)$17.m_impl.m_memo.m_first_block.m_data@40
      $19 = {0x7fec5723fdd0, 0x80, 0x7fec13cf5410, 0x4, 0x7fec13cfb3b0, 0x2,
      0x7fea06a89060, 0x8, 0x7fea06a89858, 0x2, 0x7fea06a89060, 0x2,
      0x7fea06a896c0, 0x2, 0x7fec57213520, 0x40, 0x7fec13cf55a8, 0x4,
      0x7fec13cf5410, 0x8, 0x7fec13cfb3b0, 0x8, 0x7fec13cfb3b0, 0x2,
      0x7fea06a89858, 0x2, 0x7fea06a896c0, 0x2, 0x7fec13cf5410, 0x8,
      0x7fec13cfb3b0, 0x8, 0x7fec13cfb3b0, 0x2, 0x7fea06a89858, 0x2,
      0x7fea06a896c0, 0x2, 0x7fec13cf5410, 0x8}

      The first element is the &index->lock of our index=0x7fec5723fc40 (the
      index FOR_REF of SYS_FOREIGN). It is locked as MTR_MEMO_SX_LOCK=0x80.

      The second is MTR_MEMO_PAGE_SX_FIX(4) of the root page (0:47). Looks
      OK so far. It looks a little suspicious OK that the memo contains
      multiple references to this root page, in mode MTR_MEMO_BUF_FIX(8).
      Let us ignore those. The next page is MTR_MEMO_PAGE_X_FIX(2) of our
      interesting page (0:196633). I think that this access is still OK
      according to the WL#6326 rules. (It is certainly wrong of Thread 4 to
      go back to acquire the root page X-latch later.) Also this page
      reoccurs in the memo as MTR_MEMO_BUF_FIX(8). Let me pretty-print the
      memo, ignoring those entries:

      MTR_MEMO_SX_LOCK(index->lock) 0x7fec5723fdd0
      MTR_MEMO_PAGE_SX_FIX(0:47, root, level-2) 0x7fec13cf5410
      MTR_MEMO_PAGE_X_FIX(0:196633, level-1) 0x7fec13cfb3b0
      MTR_MEMO_BUF_FIX(0:70787, leaf page) 0x7fea06a89060
      MTR_MEMO_PAGE_X_FIX(0:70793, leaf page) 0x7fea06a89858

      It looks like there was no match on page 0:70787 and we proceeded to
      its FIL_PAGE_NEXT=0x11489=70793.

      The memo goes on with

      MTR_MEMO_PAGE_X_FIX(0:70787, (left sibling) leaf page) 0x7fea06a89060
      MTR_MEMO_PAGE_X_FIX(0:70791, (right sibling) leaf page) 0x7fea06a896c0

      Latching the left non-leaf sibling should be OK while holding the
      index->lock SX-latch. (Latching parent pages is forbidden.)
      Latching the right sibling should always be OK.

      Then we have this:

      MTR_MEMO_X_LOCK(fil_space->latch) on "innodb_system" 0x7fec57213520

      I think that this is normal for freeing pages. After all, we are
      executing btr_compress(), going to shrink the index tree.

      MTR_MEMO_PAGE_SX_FIX(0:0, allocation bitmap for pages 0..16383) 0x7fec13cf55a8
      MTR_MEMO_BUF_FIX(0:47, root, level-2) 0x7fec13cf5410
      MTR_MEMO_BUF_FIX(0:196633, level-1) 0x7fec13cfb3b0
      MTR_MEMO_PAGE_X_FIX(0:196633, level-1) 0x7fec13cfb3b0 (ok, already X-latched)
      MTR_MEMO_PAGE_X_FIX(0:70793, leaf page) 0x7fea06a89858 (ok, already X-latched)
      MTR_MEMO_PAGE_X_FIX(0:70791, (right sibling) leaf page) 0x7fea06a896c0 (ok)
      MTR_MEMO_BUF_FIX(0:47, root, level-2) 0x7fec13cf5410
      MTR_MEMO_BUF_FIX(0:196633, level-1) 0x7fec13cfb3b0
      MTR_MEMO_PAGE_X_FIX(0:196633, level-1) 0x7fec13cfb3b0 (ok, already X-latched)
      MTR_MEMO_PAGE_X_FIX(0:70793, leaf page) 0x7fea06a89858 (ok, already X-latched)
      MTR_MEMO_PAGE_X_FIX(0:70791, (right sibling) leaf page) 0x7fea06a896c0 (ok)
      MTR_MEMO_BUF_FIX(0:47, root, level-2) 0x7fec13cf5410

      It turns out that a questionable condition had been included in MySQL 5.7, to sometimes acquire the index->lock in too weak SX mode, instead of the appropriate X mode.

      Because the decision to acquire the correct X mode lock depends on the transaction history list length and the number of pending read requests, the code might work correctly most of the time. Also, most ‘pessimistic’ operations do not involve multiple recursion steps to btr_compress(), like Thread 4 in our case:

      #8  pfs_rw_lock_x_lock_func
      #9  buf_page_get_gen
      #10 btr_block_get_func
      #11 btr_cur_search_to_nth_level
      #12 btr_page_get_father_node_ptr_func
      #13 btr_compress
      #14 btr_cur_compress_if_useful
      #15 btr_cur_pessimistic_delete
      #16 btr_compress
      #17 btr_cur_compress_if_useful
      #18 btr_cur_pessimistic_delete
      #19 row_purge_remove_sec_if_poss_tree
      #20 row_purge_remove_sec_if_poss
      #21 row_purge_del_mark
      #22 row_purge_record_func
      #23 row_purge
      #24 row_purge_step
      #25 que_thr_step
      #26 que_run_threads_low
      #27 que_run_threads
      #28 srv_task_execute
      #29 srv_worker_thread
      

      There is some code in btr_cur_search_to_nth_level() that tries to acquire all possibly needed sibling page locks in advance. It appears that this logic is not working correctly in this case, because the root page was only locked in SX mode, not X mode.

      While this deadlock occurred during purge, the BTR_LATCH_FOR_DELETE is also used in ROLLBACK. It is possible that this bug has been reported upstream in MySQL Bug#80919. Without a core dump or at least full stack traces of each thread, this cannot be confirmed or denied.

      While diagnosing this issue, I noticed that the page flush code is now acquiring block->lock in SX mode instead of S mode. This is likely because certain fields in index root pages are protected by SX latch, not X latch. The page flushing must ensure that no concurrent modifications are possible even to the fields in index root pages.

      The conservative fix would be to always acquire the index->lock in X mode for BTR_MODIFY_TREE operations. This will very likely degrade performance, because then the only cases for acquiring the index->lock in SX mode would seem to be operations on BLOBs.

      Attachments

        Issue Links

          Activity

            axel Axel Schwenke added a comment -

            I have run a benchmark, comparing 10.2.11 and this branch for various write-only workload. For 16K page size (default) this branch is about same speed for INSERT and clearly slower (~20%) for UPDATE. With 4K page size it is slower for both INSERT (~5%) and UPDATE (~10%). Spread sheet with numbers attached.

            axel Axel Schwenke added a comment - I have run a benchmark, comparing 10.2.11 and this branch for various write-only workload. For 16K page size (default) this branch is about same speed for INSERT and clearly slower (~20%) for UPDATE. With 4K page size it is slower for both INSERT (~5%) and UPDATE (~10%). Spread sheet with numbers attached.

            The MySQL 5.7.23 release on Friday, July 27, 2018 finally published a fix that was apparently made almost 4 months earlier:
            BUG#26225783 MYSQL CRASH ON CREATE TABLE (REPRODUCEABLE) -> INNODB: A LONG SEMAPHORE WAIT
            Some weeks later, there was a follow-up fix; the fix should inadvertently changed the handling of spatial indexes, which was not affected.

            marko Marko Mäkelä added a comment - The MySQL 5.7.23 release on Friday, July 27, 2018 finally published a fix that was apparently made almost 4 months earlier: BUG#26225783 MYSQL CRASH ON CREATE TABLE (REPRODUCEABLE) -> INNODB: A LONG SEMAPHORE WAIT Some weeks later, there was a follow-up fix ; the fix should inadvertently changed the handling of spatial indexes, which was not affected.

            I reviewed the fix in MySQL 5.7.23. The root cause is not documented there, and neither is there a test case, but it seems to me that this only affects the purge (or rollback) on the tables SYS_FOREIGN and SYS_FOREIGN_COLS. That is, you would need some DDL operation on tables that contain FOREIGN KEY constraints. The purge should be triggered after DROP TABLE, RENAME TABLE, ALTER TABLE…ALGORITHM=COPY at least. Repeating this would require some bad luck (such as two purge threads operating on the same system table, and one of them deciding to merge a non-leaf page).

            The SYS_FOREIGN and SYS_FOREIGN_COLS tables are special in that they define key columns as the InnoDB internal SQL parser type CHAR, which translates to VARCHAR(0) (at most 0 bytes). However, InnoDB happily ignores the maximum size when inserting or updating records (in MariaDB or MySQL, it is the user SQL layer that enforces the limit). The wrong maximum length leads to a miscalculation of node_ptr_max_size.

            It might be possible to encounter this deadlock for other tables as well. The patch is addressing that as well. This could be lead to a hang during rollback or purge.

            marko Marko Mäkelä added a comment - I reviewed the fix in MySQL 5.7.23. The root cause is not documented there, and neither is there a test case, but it seems to me that this only affects the purge (or rollback) on the tables SYS_FOREIGN and SYS_FOREIGN_COLS . That is, you would need some DDL operation on tables that contain FOREIGN KEY constraints. The purge should be triggered after DROP TABLE , RENAME TABLE , ALTER TABLE…ALGORITHM=COPY at least. Repeating this would require some bad luck (such as two purge threads operating on the same system table, and one of them deciding to merge a non-leaf page). The SYS_FOREIGN and SYS_FOREIGN_COLS tables are special in that they define key columns as the InnoDB internal SQL parser type CHAR , which translates to VARCHAR(0) (at most 0 bytes). However, InnoDB happily ignores the maximum size when inserting or updating records (in MariaDB or MySQL, it is the user SQL layer that enforces the limit). The wrong maximum length leads to a miscalculation of node_ptr_max_size . It might be possible to encounter this deadlock for other tables as well. The patch is addressing that as well. This could be lead to a hang during rollback or purge.

            Other tables with insufficient maximum length for keys that are directly written by InnoDB are the persistent statistics tables mysql.innodb_index_stats and mysql.innodb_table_stats. As noted in MDEV-16850, MySQL 5.7.23 increased the table name (with partition or subpartition name) from VARCHAR(64) to VARCHAR(199). I feel that we must do the same, or otherwise MariaDB could hang in a similar way after DDL operations, this time on tables on which persistent statistics are enabled.

            marko Marko Mäkelä added a comment - Other tables with insufficient maximum length for keys that are directly written by InnoDB are the persistent statistics tables mysql.innodb_index_stats and  mysql.innodb_table_stats . As noted in MDEV-16850 , MySQL 5.7.23 increased the table name (with partition or subpartition name) from VARCHAR(64) to VARCHAR(199) . I feel that we must do the same, or otherwise MariaDB could hang in a similar way after DDL operations, this time on tables on which persistent statistics are enabled.
            marko Marko Mäkelä added a comment - - edited

            Increasing the column length would cause a bootstrap failure when using innodb_page_size=4k, as noted in MDEV-16845. MySQL 5.7.23 works around this by disabling the check for the two statistics tables:

            ERROR: 1071 Specified key was too long; max key length is 768 bytes

            In MariaDB, I made ha_innobase::max_supported_key_length() return a longer limit for all tables when using innodb_page_size=4k. This could allow tables to be created such that some INSERT or UPDATE operations could fail due to too long records being inserted into secondary indexes. I think that such failures are already possible, especially related to an InnoDB undo log format limitation when involving column prefix indexes on TEXT columns.

            The MariaDB fix is also tweaking the node_ptr_max_size calculation so that a larger size will be used for table_name in these tables, no matter with which size they were created.

            marko Marko Mäkelä added a comment - - edited Increasing the column length would cause a bootstrap failure when using innodb_page_size=4k , as noted in MDEV-16845 . MySQL 5.7.23 works around this by disabling the check for the two statistics tables: ERROR: 1071 Specified key was too long; max key length is 768 bytes In MariaDB, I made ha_innobase::max_supported_key_length() return a longer limit for all tables when using innodb_page_size=4k . This could allow tables to be created such that some INSERT or UPDATE operations could fail due to too long records being inserted into secondary indexes. I think that such failures are already possible, especially related to an InnoDB undo log format limitation when involving column prefix indexes on TEXT columns. The MariaDB fix is also tweaking the node_ptr_max_size calculation so that a larger size will be used for table_name in these tables, no matter with which size they were created.

            This bug was not completely fixed until MDEV-29835 in MariaDB Server 10.6.13.

            marko Marko Mäkelä added a comment - This bug was not completely fixed until MDEV-29835 in MariaDB Server 10.6.13.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.