[MDEV-29081] trx_t::lock.was_chosen_as_deadlock_victim race in lock_wait_end() Created: 2022-07-11  Updated: 2022-09-26  Resolved: 2022-08-25

Status: Closed
Project: MariaDB Server
Component/s: Data Manipulation - Update, Storage Engine - InnoDB
Affects Version/s: 10.8.3
Fix Version/s: 10.6.10, 10.7.6, 10.8.5, 10.9.3, 10.10.2, 10.11.0

Type: Bug Priority: Critical
Reporter: Zuming Jiang Assignee: Vladislav Lesin
Resolution: Fixed Votes: 0
Labels: crash
Environment:

Ubuntu 20.04


Attachments: File MDEV-29081.test     Text File bug_report.txt     File mariadb_reproduce.cc     File mysql_bk.sql     File reproduce.sh     File stmts.sql     Text File tid.txt    
Issue Links:
Duplicate
duplicates MDEV-26809 Assertion `lock_rec_has_expl(LOCK_X |... Closed
Relates
relates to MDEV-29323 Galera ha_abort_transaction is not ho... Open
relates to MDEV-29515 innodb.deadlock_victim_race is unstable Closed
relates to MDEV-29635 race on trx->lock.wait_lock in deadlo... Closed

 Description   

I used my fuzzing tool to test Mariadb , and found a transaction-related bug that can result in an abortion.

Mariadb installation:
1) cd mariadb-10.8.3
2) mkdir build; cd build
3) cmake .. -DCMAKE_BUILD_TYPE=Debug
4) make -j12 && sudo make install

Test driver compilation:
Note: "mariadb_reproduce" sets up several transactions and execute SQL statements according to /tmp/mysql_bk.sql, /tmp/stmts.sql and /tmp/tid.txt
1) g++ -I/usr/local/mysql/include/ mariadb_reproduce.cc -o mariadb_reproduce -lmysqlclient -g

Reproduce the bug:
1) cp mysql_bk.sql /tmp; cp stmts.sql /tmp; cp tid.txt /tmp
2) export ASAN_OPTIONS=detect_leaks=0
3) /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data --plugin-dir=/usr/local/mysql/lib/plugin --user=mysql &
4) bash reproduce.sh # "reproduce.sh" repeatedly executes "mariadb_reproduce" 100 times. "reproduce.sh" and "mariadb_reproduce" should be in the same directory

I have simplified the content of fuzz.sql, and I hope fuzz.sql can help you reproduce and fix the bug. In addition, I attached the failure report (which has its stack trace).

This bug seems related to concurrency and it cannot be reproduced stably. However, it did occur and crashed the server in my test. To stably trigger this bug, I use "reproduce.sh" to execute "mariadb_reproduce" 100 times. Hope you could reproduce the bug successfully



 Comments   
Comment by Alice Sherepa [ 2022-07-11 ]

from the log (bug_report.txt):

mysqld: /home/mysql/mariadb/storage/innobase/lock/lock0lock.cc:5994: bool lock_trx_has_expl_x_lock(const trx_t&, const dict_table_t&,
page_id_t, ulint): Assertion `lock_rec_has_expl(LOCK_X | 1024U, g.cell(), id, heap_no, &trx)' failed.
220711 14:08:02 [ERROR] mysqld got signal 6 ;
 
Server version: 10.8.3-MariaDB-debug
 
mysys/stacktrace.c:212(my_print_stacktrace)[0x558f7165af7e]
sql/signal_handler.cc:226(handle_fatal_signal)[0x558f70da8bd0]
??:0(__restore_rt)[0x7f91f7ba63c0]
??:0(gsignal)[0x7f91f768f03b]
??:0(abort)[0x7f91f766e859]
/lib/x86_64-linux-gnu/libc.so.6(+0x22729)[0x7f91f766e729]
??:0(__assert_fail)[0x7f91f7680006]
lock/lock0lock.cc:5993(lock_trx_has_expl_x_lock(trx_t const&, dict_table_t const&, page_id_t, unsigned long))[0x558f7126d6ad]
row/row0upd.cc:2661(row_upd_clust_step(upd_node_t*, que_thr_t*))[0x558f713c33d5]
row/row0upd.cc:2790(row_upd(upd_node_t*, que_thr_t*))[0x558f713c3b10]
row/row0upd.cc:2935(row_upd_step(que_thr_t*))[0x558f713c410a]
row/row0mysql.cc:1702(row_update_for_mysql(row_prebuilt_t*))[0x558f7137028a]
handler/ha_innodb.cc:8582(ha_innobase::update_row(unsigned char const*, unsigned char const*))[0x558f711aca56]
sql/handler.cc:7602(handler::ha_update_row(unsigned char const*, unsigned char const*))[0x558f70dc2dc5]
sql/sql_update.cc:1088(mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, bool, unsigned long long*, unsigned long long*))[0x558f70b3ccb7]
sql/sql_parse.cc:4405(mysql_execute_command(THD*, bool))[0x558f709e954e]
sql/sql_parse.cc:8027(mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x558f709f57c2]
sql/sql_parse.cc:1896(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool))[0x558f709e1cae]
sql/sql_parse.cc:1407(do_command(THD*, bool))[0x558f709e068f]
sql/sql_connect.cc:1418(do_handle_one_connection(CONNECT*, bool))[0x558f70bb9df1]
sql/sql_connect.cc:1314(handle_one_connection)[0x558f70bb9a81]
perfschema/pfs.cc:2203(pfs_spawn_thread)[0x558f710ca123]
??:0(start_thread)[0x7f91f7b9a609]
??:0(clone)[0x7f91f776b163]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f91ac0140d0): update t_khn17c set 
  wkey = 158, 
  c_ci0ked = PI(), 
  c_qdo5gb = t_khn17c.pkey
where t_khn17c.pkey = t_khn17c.pkey
 
Connection ID (thread ID): 377

it seems to be the same bug as MDEV-26809

Comment by Marko Mäkelä [ 2022-07-14 ]

I think that this is best kept open until a fix has been developed and tested.

Comment by Marko Mäkelä [ 2022-07-27 ]

Because the trace in MDEV-26809 seems to have been invalidated by changes to shared libraries, I am unable to debug it.

To be able to compile mariadb_reproduce.cc with the libmariadb-dev that is installed on my Debian system, I had to change the name of the first header file:

#include <mariadb/mysql.h>

After that, the executable can be built with GCC 12.1.0:

g++ mariadb_reproduce.cc -o mariadb_reproduce -lmariadb

I also had to replace the hard-coded path /usr/local/mysql/bin/mysql with the path of a mariadb client executable, and I had to create a symlink from /run/mysqld/mysqld.sock to the socket that my server is using.
With those changes, on my first try of

while ./mariadb_reproduce; do :; done

I was able to reproduce the crash in less than 100 attempts. The crash that I got was:

10.6 772e3f61eb2956143bbdb1d85490bd5aad73d43c

Version: '10.6.9-MariaDB-debug'  socket: '/dev/shm/mariadb.sock'  port: 3306  Source distribution
mariadbd: /mariadb/10.6/storage/innobase/lock/lock0lock.cc:6023: bool lock_trx_has_expl_x_lock(const trx_t&, const dict_table_t&, page_id_t, ulint): Assertion `lock_rec_has_expl(LOCK_X | 1024U, g.cell(), id, heap_no, &trx)' failed.

My subsequent attempts to reproduce this crash failed. I will keep trying, also by running the server under https://rr-project.org so that I will get something debuggable.

Comment by Marko Mäkelä [ 2022-07-27 ]

After I rebuilt the executable using cmake -DPLUGIN_PERFSCHEMA=NO so that rr record would have to deal with fewer conditional branches, the test program seems to have become much slower, as if it were waiting for innodb_lock_wait_timeout very often. After I restarted the server with innodb_lock_wait_timeout=0, the test program got quick again, but failed to reproduce anything.

There does not seem to be much hope to get an rr replay trace from this test case. I will try to produce a core dump again and then try to figure it out from there.

Comment by Marko Mäkelä [ 2022-07-27 ]

I was able to reproduce this a few more times while running the server inside GDB. I had to recompile the server with GCC 11, because the type information emitted by GCC 12.1.0 would often cause GDB 12.1 to crash.

Often, depending on the server build options, the test program would end up frequently waiting for innodb_lock_wait_timeout. In such cases, I restarted the server from the scratch to try again. Altogether, I was only able to reproduce the failure a few times during several hours. For the last-but-one crash, the while loop around the script was running for about 7 minutes. The very first and the last crash that I checked were instantaneous.

I only debugged the last crash. The assertion in row_upd_clust_step() that invoked lock_trx_has_expl_x_lock() would pass heap_no=49 to that call, corresponding to the record pkey=153000 in the clustered index, which was a single page. The record is exclusively locked by another transaction, which at the time of the assertion failure is waiting in lock_wait() for a lock on heap_no=2 (pkey=28000), on which the crashing thread holds an exclusive lock.

I think that this must be reproduced under rr record so that we can determine what is actually going on. While I was typing this, ramesh wrote that he just succeeded at that. I will debug that trace shortly.

Comment by Marko Mäkelä [ 2022-07-27 ]

This appears to be a race condition that involves 2 threads that are concurrently blocked in lock_wait() while a 3rd thread is creating an explicit record lock for a record that is implicitly locked by one of the lock_wait() threads. The rr replay trace that I got is very similar to the GDB session that I analyzed.

I debugged it this far:

ssh galapq
rr replay /test/mtest/10.10/mariadb-10.10.0-linux-x86_64/rr/latest-trace
continue
set $l=(ib_lock_t *) 0x1538dfb8be00
watch -l *(size_t*)&$l.hash.hash[1]
reverse-continue

The record in question is heap_no=50 (pkey=154000,DB_TRX_ID=0x188a). In my GDB session, it was heap_no=49 and pkey=153000. In the rr replay session, this transaction was associated with Thread 10, which is one of the threads that was blocked in lock_wait() when the implicit-to-explicit conversion occurred.

I think that we must check if those 2 lock_wait() are waiting for a lock on the same record, and what exactly happens to the record between that time and the crash.

Comment by Vladislav Lesin [ 2022-07-29 ]

The scenario is the following:

1. Trx 6287 does semi-consistent read for UPDATE. See sel_set_rec_lock() call from row_search_mvcc() just after no_gap_lock label. Then sel_set_rec_lock() returns DB_LOCK_WAIT, and row_search_mvcc() reads offsets for committed version of the record in row_sel_build_committed_vers_for_mysql() call. Then it checks, what is going on with the lock it set on the record in lock_trx_handle_wait() call, which, in turns, waits for lock_sys.wait_mutex.

2. Trx 6290 does read for INSERT, it was just woken up after waiting for some lock to be released in lock_wait() and holds lock_sys.wait_mutex, which trx 6287 is waiting for. It does deadlock checking and chooses trx 6287 as a victim, then cancels wait on the lock created on step 1:
2a. It sets victim->lock.was_chosen_as_deadlock_victim= true in Deadlock::report(),
2b. It dequeues the lock from page, resets wait state, transaction waiting lock, etc, and the invokes lock_wait_end(), see lock_cancel_waiting_and_release() for details,
2c. lock_wait_end() resets trx->lock.was_chosen_as_deadlock_victim, sets trx->error_state= DB_DEADLOCK;
2d lock_sys.wait_mutex is released

3. Trx 6287 acquires lock_sys.wait_mutex in lock_trx_handle_wait(), checks trx->lock.was_chosen_as_deadlock_victim, and, as it is no set, returns DB_SUCCESS to row_search_mvcc() instead of DB_DEADLOCK. row_search_mvcc() continues execution as if the the lock is granted.

4. Trx 6287 aborts execution in row_upd_clust_step() as X-lock on the record was not set in row_search_mvcc() on step 3.

Suggested fix:
If trx->lock.was_chosen_as_deadlock_victim is reset in lock_wait_end(), then, maybe, lock_trx_handle_wait() should check trx->error_state instead of trx->error_state to return DB_DEADLOCK?

Comment by Vladislav Lesin [ 2022-08-11 ]

At last I developed test case for it: MDEV-29081.test.

The following two new syncpoints are necessary for the test:

--- a/storage/innobase/lock/lock0lock.cc
+++ b/storage/innobase/lock/lock0lock.cc
@@ -1849,6 +1849,7 @@ dberr_t lock_wait(que_thr_t *thr)
   thd_wait_end(trx->mysql_thd);
 
   trx->error_state= error_state;
+  DEBUG_SYNC_C("lock_wait_suspend_thread_exit");
   return error_state;
 }
 
@@ -5815,6 +5816,7 @@ while holding a clustered index leaf page latch.
                      lock request was released */
 dberr_t lock_trx_handle_wait(trx_t *trx)
 {
+  DEBUG_SYNC_C("lock_trx_handle_wait_enter");
   if (trx->lock.was_chosen_as_deadlock_victim)
     return DB_DEADLOCK;
   if (!trx->lock.wait_lock)

Comment by Vladislav Lesin [ 2022-08-11 ]

The bug affects 10.6+. I am checking 10.5, it can take some time as related code is different.

Comment by Vladislav Lesin [ 2022-08-11 ]

I can't backport the test on 10.5 because of two reasons:

1. trx->lock.was_chosen_as_deadlock_victim is not reset from the functions which cancel and release locks in 10.5, in 10.6 it is,

2. lock_trx_handle_wait() cancels waiting in 10.5, while in 10.6 it just checks lock state, and the thread will be suspended in row_search_mvcc() if the lock is still in "waiting" state.

I think 10.5 is not affected because of (1).

UPD: I was wrong about (1). 10.5 resets trx->lock.was_chosen_as_deadlock_victim in lock_wait_release_thread_if_suspended(). So currently I am not so sure about if 10.5 is affected or not. The current test will not work anyway due to (2), so 10.5 requires separate test case.

Comment by Vladislav Lesin [ 2022-08-12 ]

There are two ways to fix it:

1. Reset trx->lock.was_chosen_as_deadlock_victim only on rollback(full to savepoint). Particularly, don't reset it in lock_wait_end(). It can be set during deadlock checking by another transaction, but if it is set, the transaction should have ability to check somehow that it was chosen as deadlock victim. We should be careful here as trx->lock.was_chosen_as_deadlock_victim is used not only to notify about deadlock, but also to notify transaction that it is a victim of Galera abort (see trx_lock_t::set_wsrep_victim() for details).

2. There is also trx->error_state which is set to DB_DEADLOCK. But it's not atomic and not protected. It's said in the corresponding comment that "ONLY the thread doing the transaction is allowed to set this field". In 10.5 trx->error_state was set to DB_DEADLOCK only before trx->lock.was_chosen_as_deadlock_victim is reset. In 10.6 the latter became atomic(commit 3e45f8e36a80), and the corresponding bit is reset before trx->error_state=DB_DEADLOCK in lock_wait_end(), which can be executed by other thread. So trx->error_state can be changed in the thread which executes some other transaction. And there is some root between trx->lock.was_chosen_as_deadlock_victim resetting and trx->error_state=DB_DEADLOCK setting. But, anyway, I think this will not cause race condition because trx->error_state is set under lock_sys.wait_mutex lock in lock_wait_end(), and we could check trx->error_state in lock_trx_handle_wait() under lock_sys.wait_mutex lock too.

We could also check both trx->lock.was_chosen_as_deadlock_victim and trx->error_state. Because Galera transaction abort sets the second bit of trx->lock.was_chosen_as_deadlock_victim and does set trx->error_state = DB_DEADLOCK.

But the question is still open to me. Why do we need to reset trx->lock.was_chosen_as_deadlock_victim elsewhere except rollback? marko, maybe you know?

Another concern is that currently we have two different ways to notify transaction about it was chosen as deadlock victim. And this ambiguousness is quite confusing, because it's not clear what to check in what cases.

Comment by Marko Mäkelä [ 2022-08-12 ]

vlad.lesin, my intuition says that it should suffice to reset trx->lock.was_chosen_as_deadlock_victim at the end of trx_t::commit_in_memory(). A deadlock should be a hard error; the only way to "resume" the transaction is to re-execute the statements in a new transaction. An example of a soft error would be a lock wait timeout. There is an option to trigger a roll back on timeout, but the user could also keep executing further statements after one statement encountered a lock wait timeout and was rolled back.

In MDEV-24671 and MDEV-24738 I cleaned up the code quite a bit, but I can’t remember why I did not remove the trx_lock_t::clear_deadlock_victim(). Maybe I just wanted to limit the amount of changes to reduce the risk of breaking things.

It now seems to me that the separate flag for the Galera set_wsrep_victim() is unnecessary if we just remove the trx_lock_t::clear_deadlock_victim() and calls to it. Then the trx_lock_t::was_chosen_as_deadlock_victim could be a simple atomic Boolean flag.

Comment by Matthias Leich [ 2022-08-19 ]

Intermediate result of testing:
origin/bb-10.6-MDEV-29081-deadlock-victim fd75a1b755d9216d731d524d28da072b8b27ebde 2022-08-18T16:11:03+03:00
performed well in the RQGstandard test battery. No new bad effects.
But some slightly modified test battery seems to find new crashes.
mysqld: storage/innobase/lock/lock0lock.cc:1493: dberr_t lock_rec_lock(bool, unsigned int, const buf_block_t*, ulint, dict_index_t*, que_thr_t*): Assertion `trx->is_wsrep() || !trx->lock.was_chosen_as_deadlock_victim' failed.
All tests hitting that assert have a server start with 
transaction-isolation=READ-COMMITTED or READ-UNCOMMITTED
 
pluto:/data/results/1660910976/TBR-1585$ _RR_TRACE_DIR=./1/rr/ rr replay

Comment by Matthias Leich [ 2022-08-23 ]

origin/bb-10.6-MDEV-29081-deadlock-victim f473354c5ca617bfde7306d571052ac4aa9c5a86 2022-08-22T15:13:26+03:00
performed well in RQG testing. The bug mentioned in the previous comment is fixed.
Please be aware that the capabilities to check the correctness of result sets or error messages are quite limited when using RQG.
The new experimental RQG validator "DmlStability" (TODO-3508) did not found problems.

Generated at Thu Feb 08 10:05:44 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.