[MDEV-14637] Latching order violation during btr_cur_pessimistic_delete() - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.2.2, 10.3.0
Fix Version/s: 10.2.17, 10.3.9
Component/s: Storage Engine - InnoDB
Labels:
- upstream

Description

When MySQL 5.7.1 introduced WL#6326 to reduce contention on the non-leaf levels of B-trees, it introduced a new rw-lock mode SX (not conflicting with S, but conflicting with SX and X) and new rules to go with it.

A thread that is holding an dict_index_t::lock aka index->lock in SX mode is permitted to acquire non-leaf buf_block_t::lock aka block->lock X or SX mode, in monotonically descending order. That is, once the thread has acquired a block->lock, it is not allowed to acquire a lock on its parent or grandparent pages. Such ‘arbitrary-order access’ is only allowed when the thread acquired the index->lock in X mode upfront.

A customer encountered a repeatable hang when loading a dump into InnoDB while using multiple innodb_purge_threads (default: 4). The dump makes very heavy use of FOREIGN KEY constraints. By luck, it happened so that two purge worker threads (srv_worker_thread) deadlocked with each other. Both were operating on the index FOR_REF of the InnoDB internal table SYS_FOREIGN. One of them was legitimately holding index->lock S-latch and the root block->lock S-latch. The other srv_worker_thread had acquired index->lock SX-latch, root block->lock SX-latch, and a bunch of other latches, including the fil_space_t::latch for freeing some blocks and some leaf page latches. This other thread was inside 2 nested calls to btr_compress() and it was trying to reacquire the root block->lock in X mode, violating the WL#6326 protocol.

This violation led to a deadlock, because while S is compatible with SX and a thread can upgrade an SX lock to X when there are no conflicting requests, in this case there was a conflicting S lock held by the other purge worker thread.

So, the two purge worker threads are now deadlocked with each other. Both are holding dict_operation_lock S-latch to prevent concurrent DROP TABLE operations (it would be better architecture to acquire MDL on the currently processed table name). Their dict_operation_lock would block any future DDL operations, and it would also block the master thread which is from time to time performing InnoDB table cache eviction (a feature introduced in MySQL 5.6; see MySQL Bug#84424). The user experienced a hang during a subsequent CREATE TABLE.

Below is my detailed analysis of what the purge worker threads (Thread 4 and Thread 5) did, from the core dump that was helpfully provided by the customer.

Thread 5 is performing a BTR_PURGE_LEAF dive into the index FOR_REF of
SYS_FOREIGN. It successfully acquired the index S-latch, the root page
S-latch, and is now trying to acquire an S-latch on the page 0:196633
that Thread 16 is waiting an SX-latch for. As soon as it would get the
S-latch, it would find the next child page pointer, which is a pointer
to a leaf page, and then release the non-leaf page latches and the
index->lock.

We do not know yet who is holding the X-latch on the block->lock.

Thread 4 is trying to acquire an X-latch on the root page of FOR_REF
of SYS_FOREIGN. It is blocked by Thread 5.

Also, there is nothing really bad about Thread 16, other than it seems
to be unnecessarily acquiring a too strong latch (SX instead of S) on
the block->lock.

So, we must find out the lock holder:

(gdb) p/x $13->lock
$16 = {lock_word = 0xffffffffdfffffff, waiters = 0x1, sx_recursive = 0x0,
writer_is_wait_ex = 0x0, writer_thread = 0x7fe7e3fff700,

The writer_thread points to Thread 4 (Thread 0x7fe7e3fff700 (LWP 16824)).

Thread 4 seems to be holding an X-latch on a level-1 node pointer and
then trying to acquire the level-2 root page X-latch. This would be
fine if it were holding the index->lock X-latch. But it is only
holding the index->lock SX-latch. Let us look at its mini-transaction
memo:
(gdb) p/x (mtr_t*)0x7fe7e3ffe500
$17 = 0x7fe7e3ffe500
(gdb) p *(void**)$17.m_impl.m_memo.m_first_block.m_data@40
$19 = {0x7fec5723fdd0, 0x80, 0x7fec13cf5410, 0x4, 0x7fec13cfb3b0, 0x2,
0x7fea06a89060, 0x8, 0x7fea06a89858, 0x2, 0x7fea06a89060, 0x2,
0x7fea06a896c0, 0x2, 0x7fec57213520, 0x40, 0x7fec13cf55a8, 0x4,
0x7fec13cf5410, 0x8, 0x7fec13cfb3b0, 0x8, 0x7fec13cfb3b0, 0x2,
0x7fea06a89858, 0x2, 0x7fea06a896c0, 0x2, 0x7fec13cf5410, 0x8,
0x7fec13cfb3b0, 0x8, 0x7fec13cfb3b0, 0x2, 0x7fea06a89858, 0x2,
0x7fea06a896c0, 0x2, 0x7fec13cf5410, 0x8}

The first element is the &index->lock of our index=0x7fec5723fc40 (the
index FOR_REF of SYS_FOREIGN). It is locked as MTR_MEMO_SX_LOCK=0x80.

The second is MTR_MEMO_PAGE_SX_FIX(4) of the root page (0:47). Looks
OK so far. It looks a little suspicious OK that the memo contains
multiple references to this root page, in mode MTR_MEMO_BUF_FIX(8).
Let us ignore those. The next page is MTR_MEMO_PAGE_X_FIX(2) of our
interesting page (0:196633). I think that this access is still OK
according to the WL#6326 rules. (It is certainly wrong of Thread 4 to
go back to acquire the root page X-latch later.) Also this page
reoccurs in the memo as MTR_MEMO_BUF_FIX(8). Let me pretty-print the
memo, ignoring those entries:

MTR_MEMO_SX_LOCK(index->lock) 0x7fec5723fdd0
MTR_MEMO_PAGE_SX_FIX(0:47, root, level-2) 0x7fec13cf5410
MTR_MEMO_PAGE_X_FIX(0:196633, level-1) 0x7fec13cfb3b0
MTR_MEMO_BUF_FIX(0:70787, leaf page) 0x7fea06a89060
MTR_MEMO_PAGE_X_FIX(0:70793, leaf page) 0x7fea06a89858

It looks like there was no match on page 0:70787 and we proceeded to
its FIL_PAGE_NEXT=0x11489=70793.

The memo goes on with

MTR_MEMO_PAGE_X_FIX(0:70787, (left sibling) leaf page) 0x7fea06a89060
MTR_MEMO_PAGE_X_FIX(0:70791, (right sibling) leaf page) 0x7fea06a896c0

Latching the left non-leaf sibling should be OK while holding the
index->lock SX-latch. (Latching parent pages is forbidden.)
Latching the right sibling should always be OK.

Then we have this:

MTR_MEMO_X_LOCK(fil_space->latch) on "innodb_system" 0x7fec57213520

I think that this is normal for freeing pages. After all, we are
executing btr_compress(), going to shrink the index tree.

MTR_MEMO_PAGE_SX_FIX(0:0, allocation bitmap for pages 0..16383) 0x7fec13cf55a8
MTR_MEMO_BUF_FIX(0:47, root, level-2) 0x7fec13cf5410
MTR_MEMO_BUF_FIX(0:196633, level-1) 0x7fec13cfb3b0
MTR_MEMO_PAGE_X_FIX(0:196633, level-1) 0x7fec13cfb3b0 (ok, already X-latched)
MTR_MEMO_PAGE_X_FIX(0:70793, leaf page) 0x7fea06a89858 (ok, already X-latched)
MTR_MEMO_PAGE_X_FIX(0:70791, (right sibling) leaf page) 0x7fea06a896c0 (ok)
MTR_MEMO_BUF_FIX(0:47, root, level-2) 0x7fec13cf5410
MTR_MEMO_BUF_FIX(0:196633, level-1) 0x7fec13cfb3b0
MTR_MEMO_PAGE_X_FIX(0:196633, level-1) 0x7fec13cfb3b0 (ok, already X-latched)
MTR_MEMO_PAGE_X_FIX(0:70793, leaf page) 0x7fea06a89858 (ok, already X-latched)
MTR_MEMO_PAGE_X_FIX(0:70791, (right sibling) leaf page) 0x7fea06a896c0 (ok)
MTR_MEMO_BUF_FIX(0:47, root, level-2) 0x7fec13cf5410

It turns out that a questionable condition had been included in MySQL 5.7, to sometimes acquire the index->lock in too weak SX mode, instead of the appropriate X mode.

Because the decision to acquire the correct X mode lock depends on the transaction history list length and the number of pending read requests, the code might work correctly most of the time. Also, most ‘pessimistic’ operations do not involve multiple recursion steps to btr_compress(), like Thread 4 in our case:

#8  pfs_rw_lock_x_lock_func

#9  buf_page_get_gen

#10 btr_block_get_func

#11 btr_cur_search_to_nth_level

#12 btr_page_get_father_node_ptr_func

#13 btr_compress

#14 btr_cur_compress_if_useful

#15 btr_cur_pessimistic_delete

#16 btr_compress

#17 btr_cur_compress_if_useful

#18 btr_cur_pessimistic_delete

#19 row_purge_remove_sec_if_poss_tree

#20 row_purge_remove_sec_if_poss

#21 row_purge_del_mark

#22 row_purge_record_func

#23 row_purge

#24 row_purge_step

#25 que_thr_step

#26 que_run_threads_low

#27 que_run_threads

#28 srv_task_execute

#29 srv_worker_thread

There is some code in btr_cur_search_to_nth_level() that tries to acquire all possibly needed sibling page locks in advance. It appears that this logic is not working correctly in this case, because the root page was only locked in SX mode, not X mode.

While this deadlock occurred during purge, the BTR_LATCH_FOR_DELETE is also used in ROLLBACK. It is possible that this bug has been reported upstream in MySQL Bug#80919. Without a core dump or at least full stack traces of each thread, this cannot be confirmed or denied.

While diagnosing this issue, I noticed that the page flush code is now acquiring block->lock in SX mode instead of S mode. This is likely because certain fields in index root pages are protected by SX latch, not X latch. The page flushing must ensure that no concurrent modifications are possible even to the fields in index root pages.

The conservative fix would be to always acquire the index->lock in X mode for BTR_MODIFY_TREE operations. This will very likely degrade performance, because then the only cases for acquiring the index->lock in SX mode would seem to be operations on BLOBs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

MDEV-14637b.ods
90 kB
2017-12-19 11:23

Issue Links

causes

MDEV-25101 Assertion `!strcmp(field->name, "table_name")' failed in btr_node_ptr_max_size after adding index on InnoDB stats table

Closed

relates to

MDEV-20543 Latching order violation during B-tree operations

Closed

MDEV-21509 Possible hang during purge of history, or transaction rollback

Closed

MDEV-24673 btr_cur_pessimistic_delete() fails to shrink the index B-tree

Open

MDEV-29835 Partial server freeze

Closed

MDEV-13360 too long values in mysql.innodb_table_stats.table_name

Stalled

MDEV-13935 INSERT INTO stuck at state "Unlocking tables"

Closed

MDEV-15154 WSREP: BF lock wait long after a TRUNCATE TABLE

Closed

MDEV-16845 Failure upon upgrade from 5.7.23 to 10.2 and 10.3, pagesize 4K

Closed

MDEV-17815 Assertion `!strcmp(index->table->name.m_name, "SYS_FOREIGN") || !strcmp(index->table->name.m_name, "SYS_FOREIGN_COLS")' failed in btr_node_ptr_max_size upon reducing column length

Closed

MDEV-17988 mysql_upgrade failed to revert innodb_index/table_stats tables column datatype length

Closed

MDEV-19216 Assertion `!strcmp(index->table->name.m_name, "SYS_FOREIGN") || !strcmp(index->table->name.m_name, "SYS_FOREIGN_COLS")' failed in btr_node_ptr_max_size

Closed

(7 relates to)

Activity

Ascending order - Click to sort in descending order

Axel Schwenke added a comment - 2017-12-19 11:23

I have run a benchmark, comparing 10.2.11 and this branch for various write-only workload. For 16K page size (default) this branch is about same speed for INSERT and clearly slower (~20%) for UPDATE. With 4K page size it is slower for both INSERT (~5%) and UPDATE (~10%). Spread sheet with numbers attached.

Axel Schwenke added a comment - 2017-12-19 11:23 I have run a benchmark, comparing 10.2.11 and this branch for various write-only workload. For 16K page size (default) this branch is about same speed for INSERT and clearly slower (~20%) for UPDATE. With 4K page size it is slower for both INSERT (~5%) and UPDATE (~10%). Spread sheet with numbers attached.

Marko Mäkelä added a comment - 2018-07-30 06:18

The MySQL 5.7.23 release on Friday, July 27, 2018 finally published a fix that was apparently made almost 4 months earlier:
BUG#26225783 MYSQL CRASH ON CREATE TABLE (REPRODUCEABLE) -> INNODB: A LONG SEMAPHORE WAIT
Some weeks later, there was a follow-up fix; the fix should inadvertently changed the handling of spatial indexes, which was not affected.

Marko Mäkelä added a comment - 2018-07-30 06:18 The MySQL 5.7.23 release on Friday, July 27, 2018 finally published a fix that was apparently made almost 4 months earlier: BUG#26225783 MYSQL CRASH ON CREATE TABLE (REPRODUCEABLE) -> INNODB: A LONG SEMAPHORE WAIT Some weeks later, there was a follow-up fix ; the fix should inadvertently changed the handling of spatial indexes, which was not affected.

Marko Mäkelä added a comment - 2018-07-31 15:27

I reviewed the fix in MySQL 5.7.23. The root cause is not documented there, and neither is there a test case, but it seems to me that this only affects the purge (or rollback) on the tables SYS_FOREIGN and SYS_FOREIGN_COLS. That is, you would need some DDL operation on tables that contain FOREIGN KEY constraints. The purge should be triggered after DROP TABLE, RENAME TABLE, ALTER TABLE…ALGORITHM=COPY at least. Repeating this would require some bad luck (such as two purge threads operating on the same system table, and one of them deciding to merge a non-leaf page).

The SYS_FOREIGN and SYS_FOREIGN_COLS tables are special in that they define key columns as the InnoDB internal SQL parser type CHAR, which translates to VARCHAR(0) (at most 0 bytes). However, InnoDB happily ignores the maximum size when inserting or updating records (in MariaDB or MySQL, it is the user SQL layer that enforces the limit). The wrong maximum length leads to a miscalculation of node_ptr_max_size.

It might be possible to encounter this deadlock for other tables as well. The patch is addressing that as well. This could be lead to a hang during rollback or purge.

Marko Mäkelä added a comment - 2018-07-31 15:27 I reviewed the fix in MySQL 5.7.23. The root cause is not documented there, and neither is there a test case, but it seems to me that this only affects the purge (or rollback) on the tables SYS_FOREIGN and SYS_FOREIGN_COLS . That is, you would need some DDL operation on tables that contain FOREIGN KEY constraints. The purge should be triggered after DROP TABLE , RENAME TABLE , ALTER TABLE…ALGORITHM=COPY at least. Repeating this would require some bad luck (such as two purge threads operating on the same system table, and one of them deciding to merge a non-leaf page). The SYS_FOREIGN and SYS_FOREIGN_COLS tables are special in that they define key columns as the InnoDB internal SQL parser type CHAR , which translates to VARCHAR(0) (at most 0 bytes). However, InnoDB happily ignores the maximum size when inserting or updating records (in MariaDB or MySQL, it is the user SQL layer that enforces the limit). The wrong maximum length leads to a miscalculation of node_ptr_max_size . It might be possible to encounter this deadlock for other tables as well. The patch is addressing that as well. This could be lead to a hang during rollback or purge.

Marko Mäkelä added a comment - 2018-07-31 16:05

Other tables with insufficient maximum length for keys that are directly written by InnoDB are the persistent statistics tables mysql.innodb_index_stats and mysql.innodb_table_stats. As noted in ~~MDEV-16850~~, MySQL 5.7.23 increased the table name (with partition or subpartition name) from VARCHAR(64) to VARCHAR(199). I feel that we must do the same, or otherwise MariaDB could hang in a similar way after DDL operations, this time on tables on which persistent statistics are enabled.

Marko Mäkelä added a comment - 2018-07-31 16:05 Other tables with insufficient maximum length for keys that are directly written by InnoDB are the persistent statistics tables mysql.innodb_index_stats and mysql.innodb_table_stats . As noted in MDEV-16850 , MySQL 5.7.23 increased the table name (with partition or subpartition name) from VARCHAR(64) to VARCHAR(199) . I feel that we must do the same, or otherwise MariaDB could hang in a similar way after DDL operations, this time on tables on which persistent statistics are enabled.

Marko Mäkelä added a comment - 2018-07-31 16:40 - edited

Increasing the column length would cause a bootstrap failure when using innodb_page_size=4k, as noted in ~~MDEV-16845~~. MySQL 5.7.23 works around this by disabling the check for the two statistics tables:

ERROR: 1071 Specified key was too long; max key length is 768 bytes

In MariaDB, I made ha_innobase::max_supported_key_length() return a longer limit for all tables when using innodb_page_size=4k. This could allow tables to be created such that some INSERT or UPDATE operations could fail due to too long records being inserted into secondary indexes. I think that such failures are already possible, especially related to an InnoDB undo log format limitation when involving column prefix indexes on TEXT columns.

The MariaDB fix is also tweaking the node_ptr_max_size calculation so that a larger size will be used for table_name in these tables, no matter with which size they were created.

Marko Mäkelä added a comment - 2018-07-31 16:40 - edited Increasing the column length would cause a bootstrap failure when using innodb_page_size=4k , as noted in MDEV-16845 . MySQL 5.7.23 works around this by disabling the check for the two statistics tables: ERROR: 1071 Specified key was too long; max key length is 768 bytes In MariaDB, I made ha_innobase::max_supported_key_length() return a longer limit for all tables when using innodb_page_size=4k . This could allow tables to be created such that some INSERT or UPDATE operations could fail due to too long records being inserted into secondary indexes. I think that such failures are already possible, especially related to an InnoDB undo log format limitation when involving column prefix indexes on TEXT columns. The MariaDB fix is also tweaking the node_ptr_max_size calculation so that a larger size will be used for table_name in these tables, no matter with which size they were created.

Marko Mäkelä added a comment - 2023-03-17 08:38

This bug was not completely fixed until ~~MDEV-29835~~ in MariaDB Server 10.6.13.

Marko Mäkelä added a comment - 2023-03-17 08:38 This bug was not completely fixed until MDEV-29835 in MariaDB Server 10.6.13.

MariaDB Server

Latching order violation during btr_cur_pessimistic_delete()

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration