Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29081

trx_t::lock.was_chosen_as_deadlock_victim race in lock_wait_end()

Details

    Description

      I used my fuzzing tool to test Mariadb , and found a transaction-related bug that can result in an abortion.

      Mariadb installation:
      1) cd mariadb-10.8.3
      2) mkdir build; cd build
      3) cmake .. -DCMAKE_BUILD_TYPE=Debug
      4) make -j12 && sudo make install

      Test driver compilation:
      Note: "mariadb_reproduce" sets up several transactions and execute SQL statements according to /tmp/mysql_bk.sql, /tmp/stmts.sql and /tmp/tid.txt
      1) g++ -I/usr/local/mysql/include/ mariadb_reproduce.cc -o mariadb_reproduce -lmysqlclient -g

      Reproduce the bug:
      1) cp mysql_bk.sql /tmp; cp stmts.sql /tmp; cp tid.txt /tmp
      2) export ASAN_OPTIONS=detect_leaks=0
      3) /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data --plugin-dir=/usr/local/mysql/lib/plugin --user=mysql &
      4) bash reproduce.sh # "reproduce.sh" repeatedly executes "mariadb_reproduce" 100 times. "reproduce.sh" and "mariadb_reproduce" should be in the same directory

      I have simplified the content of fuzz.sql, and I hope fuzz.sql can help you reproduce and fix the bug. In addition, I attached the failure report (which has its stack trace).

      This bug seems related to concurrency and it cannot be reproduced stably. However, it did occur and crashed the server in my test. To stably trigger this bug, I use "reproduce.sh" to execute "mariadb_reproduce" 100 times. Hope you could reproduce the bug successfully

      Attachments

        1. bug_report.txt
          6 kB
          Zuming Jiang
        2. mariadb_reproduce.cc
          14 kB
          Zuming Jiang
        3. MDEV-29081.test
          4 kB
          Vladislav Lesin
        4. mysql_bk.sql
          7 kB
          Zuming Jiang
        5. reproduce.sh
          0.1 kB
          Zuming Jiang
        6. stmts.sql
          18 kB
          Zuming Jiang
        7. tid.txt
          0.1 kB
          Zuming Jiang

        Issue Links

          Activity

            Zuming Jiang Zuming Jiang created issue -
            Zuming Jiang Zuming Jiang made changes -
            Field Original Value New Value
            Attachment bug_report.txt [ 64577 ]
            alice Alice Sherepa added a comment -

            from the log (bug_report.txt):

            mysqld: /home/mysql/mariadb/storage/innobase/lock/lock0lock.cc:5994: bool lock_trx_has_expl_x_lock(const trx_t&, const dict_table_t&,
            page_id_t, ulint): Assertion `lock_rec_has_expl(LOCK_X | 1024U, g.cell(), id, heap_no, &trx)' failed.
            220711 14:08:02 [ERROR] mysqld got signal 6 ;
             
            Server version: 10.8.3-MariaDB-debug
             
            mysys/stacktrace.c:212(my_print_stacktrace)[0x558f7165af7e]
            sql/signal_handler.cc:226(handle_fatal_signal)[0x558f70da8bd0]
            ??:0(__restore_rt)[0x7f91f7ba63c0]
            ??:0(gsignal)[0x7f91f768f03b]
            ??:0(abort)[0x7f91f766e859]
            /lib/x86_64-linux-gnu/libc.so.6(+0x22729)[0x7f91f766e729]
            ??:0(__assert_fail)[0x7f91f7680006]
            lock/lock0lock.cc:5993(lock_trx_has_expl_x_lock(trx_t const&, dict_table_t const&, page_id_t, unsigned long))[0x558f7126d6ad]
            row/row0upd.cc:2661(row_upd_clust_step(upd_node_t*, que_thr_t*))[0x558f713c33d5]
            row/row0upd.cc:2790(row_upd(upd_node_t*, que_thr_t*))[0x558f713c3b10]
            row/row0upd.cc:2935(row_upd_step(que_thr_t*))[0x558f713c410a]
            row/row0mysql.cc:1702(row_update_for_mysql(row_prebuilt_t*))[0x558f7137028a]
            handler/ha_innodb.cc:8582(ha_innobase::update_row(unsigned char const*, unsigned char const*))[0x558f711aca56]
            sql/handler.cc:7602(handler::ha_update_row(unsigned char const*, unsigned char const*))[0x558f70dc2dc5]
            sql/sql_update.cc:1088(mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, bool, unsigned long long*, unsigned long long*))[0x558f70b3ccb7]
            sql/sql_parse.cc:4405(mysql_execute_command(THD*, bool))[0x558f709e954e]
            sql/sql_parse.cc:8027(mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x558f709f57c2]
            sql/sql_parse.cc:1896(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool))[0x558f709e1cae]
            sql/sql_parse.cc:1407(do_command(THD*, bool))[0x558f709e068f]
            sql/sql_connect.cc:1418(do_handle_one_connection(CONNECT*, bool))[0x558f70bb9df1]
            sql/sql_connect.cc:1314(handle_one_connection)[0x558f70bb9a81]
            perfschema/pfs.cc:2203(pfs_spawn_thread)[0x558f710ca123]
            ??:0(start_thread)[0x7f91f7b9a609]
            ??:0(clone)[0x7f91f776b163]
             
            Trying to get some variables.
            Some pointers may be invalid and cause the dump to abort.
            Query (0x7f91ac0140d0): update t_khn17c set 
              wkey = 158, 
              c_ci0ked = PI(), 
              c_qdo5gb = t_khn17c.pkey
            where t_khn17c.pkey = t_khn17c.pkey
             
            Connection ID (thread ID): 377
            

            it seems to be the same bug as MDEV-26809

            alice Alice Sherepa added a comment - from the log (bug_report.txt): mysqld: /home/mysql/mariadb/storage/innobase/lock/lock0lock.cc:5994: bool lock_trx_has_expl_x_lock(const trx_t&, const dict_table_t&, page_id_t, ulint): Assertion `lock_rec_has_expl(LOCK_X | 1024U, g.cell(), id, heap_no, &trx)' failed. 220711 14:08:02 [ERROR] mysqld got signal 6 ;   Server version: 10.8.3-MariaDB-debug   mysys/stacktrace.c:212(my_print_stacktrace)[0x558f7165af7e] sql/signal_handler.cc:226(handle_fatal_signal)[0x558f70da8bd0] ??:0(__restore_rt)[0x7f91f7ba63c0] ??:0(gsignal)[0x7f91f768f03b] ??:0(abort)[0x7f91f766e859] /lib/x86_64-linux-gnu/libc.so.6(+0x22729)[0x7f91f766e729] ??:0(__assert_fail)[0x7f91f7680006] lock/lock0lock.cc:5993(lock_trx_has_expl_x_lock(trx_t const&, dict_table_t const&, page_id_t, unsigned long))[0x558f7126d6ad] row/row0upd.cc:2661(row_upd_clust_step(upd_node_t*, que_thr_t*))[0x558f713c33d5] row/row0upd.cc:2790(row_upd(upd_node_t*, que_thr_t*))[0x558f713c3b10] row/row0upd.cc:2935(row_upd_step(que_thr_t*))[0x558f713c410a] row/row0mysql.cc:1702(row_update_for_mysql(row_prebuilt_t*))[0x558f7137028a] handler/ha_innodb.cc:8582(ha_innobase::update_row(unsigned char const*, unsigned char const*))[0x558f711aca56] sql/handler.cc:7602(handler::ha_update_row(unsigned char const*, unsigned char const*))[0x558f70dc2dc5] sql/sql_update.cc:1088(mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, bool, unsigned long long*, unsigned long long*))[0x558f70b3ccb7] sql/sql_parse.cc:4405(mysql_execute_command(THD*, bool))[0x558f709e954e] sql/sql_parse.cc:8027(mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x558f709f57c2] sql/sql_parse.cc:1896(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool))[0x558f709e1cae] sql/sql_parse.cc:1407(do_command(THD*, bool))[0x558f709e068f] sql/sql_connect.cc:1418(do_handle_one_connection(CONNECT*, bool))[0x558f70bb9df1] sql/sql_connect.cc:1314(handle_one_connection)[0x558f70bb9a81] perfschema/pfs.cc:2203(pfs_spawn_thread)[0x558f710ca123] ??:0(start_thread)[0x7f91f7b9a609] ??:0(clone)[0x7f91f776b163]   Trying to get some variables. Some pointers may be invalid and cause the dump to abort. Query (0x7f91ac0140d0): update t_khn17c set wkey = 158, c_ci0ked = PI(), c_qdo5gb = t_khn17c.pkey where t_khn17c.pkey = t_khn17c.pkey   Connection ID (thread ID): 377 it seems to be the same bug as MDEV-26809
            alice Alice Sherepa made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s N/A [ 14700 ]
            Resolution Duplicate [ 3 ]
            Status Open [ 1 ] Closed [ 6 ]

            I think that this is best kept open until a fix has been developed and tested.

            marko Marko Mäkelä added a comment - I think that this is best kept open until a fix has been developed and tested.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ]
            Resolution Duplicate [ 3 ]
            Status Closed [ 6 ] Stalled [ 10000 ]
            marko Marko Mäkelä made changes -
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.7 [ 24805 ]
            Fix Version/s 10.8 [ 26121 ]
            Fix Version/s 10.9 [ 26905 ]
            Fix Version/s 10.10 [ 27530 ]
            Fix Version/s N/A [ 14700 ]
            marko Marko Mäkelä made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]

            Because the trace in MDEV-26809 seems to have been invalidated by changes to shared libraries, I am unable to debug it.

            To be able to compile mariadb_reproduce.cc with the libmariadb-dev that is installed on my Debian system, I had to change the name of the first header file:

            #include <mariadb/mysql.h>
            

            After that, the executable can be built with GCC 12.1.0:

            g++ mariadb_reproduce.cc -o mariadb_reproduce -lmariadb
            

            I also had to replace the hard-coded path /usr/local/mysql/bin/mysql with the path of a mariadb client executable, and I had to create a symlink from /run/mysqld/mysqld.sock to the socket that my server is using.
            With those changes, on my first try of

            while ./mariadb_reproduce; do :; done
            

            I was able to reproduce the crash in less than 100 attempts. The crash that I got was:

            10.6 772e3f61eb2956143bbdb1d85490bd5aad73d43c

            Version: '10.6.9-MariaDB-debug'  socket: '/dev/shm/mariadb.sock'  port: 3306  Source distribution
            mariadbd: /mariadb/10.6/storage/innobase/lock/lock0lock.cc:6023: bool lock_trx_has_expl_x_lock(const trx_t&, const dict_table_t&, page_id_t, ulint): Assertion `lock_rec_has_expl(LOCK_X | 1024U, g.cell(), id, heap_no, &trx)' failed.
            

            My subsequent attempts to reproduce this crash failed. I will keep trying, also by running the server under https://rr-project.org so that I will get something debuggable.

            marko Marko Mäkelä added a comment - Because the trace in MDEV-26809 seems to have been invalidated by changes to shared libraries, I am unable to debug it. To be able to compile mariadb_reproduce.cc with the libmariadb-dev that is installed on my Debian system, I had to change the name of the first header file: #include <mariadb/mysql.h> After that, the executable can be built with GCC 12.1.0: g++ mariadb_reproduce.cc -o mariadb_reproduce -lmariadb I also had to replace the hard-coded path /usr/local/mysql/bin/mysql with the path of a mariadb client executable, and I had to create a symlink from /run/mysqld/mysqld.sock to the socket that my server is using. With those changes, on my first try of while ./mariadb_reproduce; do :; done I was able to reproduce the crash in less than 100 attempts. The crash that I got was: 10.6 772e3f61eb2956143bbdb1d85490bd5aad73d43c Version: '10.6.9-MariaDB-debug' socket: '/dev/shm/mariadb.sock' port: 3306 Source distribution mariadbd: /mariadb/10.6/storage/innobase/lock/lock0lock.cc:6023: bool lock_trx_has_expl_x_lock(const trx_t&, const dict_table_t&, page_id_t, ulint): Assertion `lock_rec_has_expl(LOCK_X | 1024U, g.cell(), id, heap_no, &trx)' failed. My subsequent attempts to reproduce this crash failed. I will keep trying, also by running the server under https://rr-project.org so that I will get something debuggable.

            After I rebuilt the executable using cmake -DPLUGIN_PERFSCHEMA=NO so that rr record would have to deal with fewer conditional branches, the test program seems to have become much slower, as if it were waiting for innodb_lock_wait_timeout very often. After I restarted the server with innodb_lock_wait_timeout=0, the test program got quick again, but failed to reproduce anything.

            There does not seem to be much hope to get an rr replay trace from this test case. I will try to produce a core dump again and then try to figure it out from there.

            marko Marko Mäkelä added a comment - After I rebuilt the executable using cmake -DPLUGIN_PERFSCHEMA=NO so that rr record would have to deal with fewer conditional branches, the test program seems to have become much slower, as if it were waiting for innodb_lock_wait_timeout very often. After I restarted the server with innodb_lock_wait_timeout=0 , the test program got quick again, but failed to reproduce anything. There does not seem to be much hope to get an rr replay trace from this test case. I will try to produce a core dump again and then try to figure it out from there.

            I was able to reproduce this a few more times while running the server inside GDB. I had to recompile the server with GCC 11, because the type information emitted by GCC 12.1.0 would often cause GDB 12.1 to crash.

            Often, depending on the server build options, the test program would end up frequently waiting for innodb_lock_wait_timeout. In such cases, I restarted the server from the scratch to try again. Altogether, I was only able to reproduce the failure a few times during several hours. For the last-but-one crash, the while loop around the script was running for about 7 minutes. The very first and the last crash that I checked were instantaneous.

            I only debugged the last crash. The assertion in row_upd_clust_step() that invoked lock_trx_has_expl_x_lock() would pass heap_no=49 to that call, corresponding to the record pkey=153000 in the clustered index, which was a single page. The record is exclusively locked by another transaction, which at the time of the assertion failure is waiting in lock_wait() for a lock on heap_no=2 (pkey=28000), on which the crashing thread holds an exclusive lock.

            I think that this must be reproduced under rr record so that we can determine what is actually going on. While I was typing this, ramesh wrote that he just succeeded at that. I will debug that trace shortly.

            marko Marko Mäkelä added a comment - I was able to reproduce this a few more times while running the server inside GDB. I had to recompile the server with GCC 11, because the type information emitted by GCC 12.1.0 would often cause GDB 12.1 to crash. Often, depending on the server build options, the test program would end up frequently waiting for innodb_lock_wait_timeout . In such cases, I restarted the server from the scratch to try again. Altogether, I was only able to reproduce the failure a few times during several hours. For the last-but-one crash, the while loop around the script was running for about 7 minutes. The very first and the last crash that I checked were instantaneous. I only debugged the last crash. The assertion in row_upd_clust_step() that invoked lock_trx_has_expl_x_lock() would pass heap_no=49 to that call, corresponding to the record pkey=153000 in the clustered index, which was a single page. The record is exclusively locked by another transaction, which at the time of the assertion failure is waiting in lock_wait() for a lock on heap_no=2 ( pkey=28000 ), on which the crashing thread holds an exclusive lock. I think that this must be reproduced under rr record so that we can determine what is actually going on. While I was typing this, ramesh wrote that he just succeeded at that. I will debug that trace shortly.

            This appears to be a race condition that involves 2 threads that are concurrently blocked in lock_wait() while a 3rd thread is creating an explicit record lock for a record that is implicitly locked by one of the lock_wait() threads. The rr replay trace that I got is very similar to the GDB session that I analyzed.

            I debugged it this far:

            ssh galapq
            rr replay /test/mtest/10.10/mariadb-10.10.0-linux-x86_64/rr/latest-trace
            continue
            set $l=(ib_lock_t *) 0x1538dfb8be00
            watch -l *(size_t*)&$l.hash.hash[1]
            reverse-continue

            The record in question is heap_no=50 (pkey=154000,DB_TRX_ID=0x188a). In my GDB session, it was heap_no=49 and pkey=153000. In the rr replay session, this transaction was associated with Thread 10, which is one of the threads that was blocked in lock_wait() when the implicit-to-explicit conversion occurred.

            I think that we must check if those 2 lock_wait() are waiting for a lock on the same record, and what exactly happens to the record between that time and the crash.

            marko Marko Mäkelä added a comment - This appears to be a race condition that involves 2 threads that are concurrently blocked in lock_wait() while a 3rd thread is creating an explicit record lock for a record that is implicitly locked by one of the lock_wait() threads. The rr replay trace that I got is very similar to the GDB session that I analyzed. I debugged it this far: ssh galapq rr replay /test/mtest/10.10/mariadb-10.10.0-linux-x86_64/rr/latest-trace continue set $l=(ib_lock_t *) 0x1538dfb8be00 watch -l *(size_t*)&$l.hash.hash[1] reverse-continue … The record in question is heap_no=50 ( pkey=154000 , DB_TRX_ID=0x188a ). In my GDB session, it was heap_no=49 and pkey=153000 . In the rr replay session, this transaction was associated with Thread 10, which is one of the threads that was blocked in lock_wait() when the implicit-to-explicit conversion occurred. I think that we must check if those 2 lock_wait() are waiting for a lock on the same record, and what exactly happens to the record between that time and the crash.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Vladislav Lesin [ vlad.lesin ]
            vlad.lesin Vladislav Lesin added a comment - - edited

            The scenario is the following:

            1. Trx 6287 does semi-consistent read for UPDATE. See sel_set_rec_lock() call from row_search_mvcc() just after no_gap_lock label. Then sel_set_rec_lock() returns DB_LOCK_WAIT, and row_search_mvcc() reads offsets for committed version of the record in row_sel_build_committed_vers_for_mysql() call. Then it checks, what is going on with the lock it set on the record in lock_trx_handle_wait() call, which, in turns, waits for lock_sys.wait_mutex.

            2. Trx 6290 does read for INSERT, it was just woken up after waiting for some lock to be released in lock_wait() and holds lock_sys.wait_mutex, which trx 6287 is waiting for. It does deadlock checking and chooses trx 6287 as a victim, then cancels wait on the lock created on step 1:
            2a. It sets victim->lock.was_chosen_as_deadlock_victim= true in Deadlock::report(),
            2b. It dequeues the lock from page, resets wait state, transaction waiting lock, etc, and the invokes lock_wait_end(), see lock_cancel_waiting_and_release() for details,
            2c. lock_wait_end() resets trx->lock.was_chosen_as_deadlock_victim, sets trx->error_state= DB_DEADLOCK;
            2d lock_sys.wait_mutex is released

            3. Trx 6287 acquires lock_sys.wait_mutex in lock_trx_handle_wait(), checks trx->lock.was_chosen_as_deadlock_victim, and, as it is no set, returns DB_SUCCESS to row_search_mvcc() instead of DB_DEADLOCK. row_search_mvcc() continues execution as if the the lock is granted.

            4. Trx 6287 aborts execution in row_upd_clust_step() as X-lock on the record was not set in row_search_mvcc() on step 3.

            Suggested fix:
            If trx->lock.was_chosen_as_deadlock_victim is reset in lock_wait_end(), then, maybe, lock_trx_handle_wait() should check trx->error_state instead of trx->error_state to return DB_DEADLOCK?

            vlad.lesin Vladislav Lesin added a comment - - edited The scenario is the following: 1. Trx 6287 does semi-consistent read for UPDATE. See sel_set_rec_lock() call from row_search_mvcc() just after no_gap_lock label. Then sel_set_rec_lock() returns DB_LOCK_WAIT, and row_search_mvcc() reads offsets for committed version of the record in row_sel_build_committed_vers_for_mysql() call. Then it checks, what is going on with the lock it set on the record in lock_trx_handle_wait() call, which, in turns, waits for lock_sys.wait_mutex. 2. Trx 6290 does read for INSERT, it was just woken up after waiting for some lock to be released in lock_wait() and holds lock_sys.wait_mutex, which trx 6287 is waiting for. It does deadlock checking and chooses trx 6287 as a victim, then cancels wait on the lock created on step 1: 2a. It sets victim->lock.was_chosen_as_deadlock_victim= true in Deadlock::report(), 2b. It dequeues the lock from page, resets wait state, transaction waiting lock, etc, and the invokes lock_wait_end(), see lock_cancel_waiting_and_release() for details, 2c. lock_wait_end() resets trx->lock.was_chosen_as_deadlock_victim, sets trx->error_state= DB_DEADLOCK; 2d lock_sys.wait_mutex is released 3. Trx 6287 acquires lock_sys.wait_mutex in lock_trx_handle_wait(), checks trx->lock.was_chosen_as_deadlock_victim, and, as it is no set, returns DB_SUCCESS to row_search_mvcc() instead of DB_DEADLOCK. row_search_mvcc() continues execution as if the the lock is granted. 4. Trx 6287 aborts execution in row_upd_clust_step() as X-lock on the record was not set in row_search_mvcc() on step 3. Suggested fix: If trx->lock.was_chosen_as_deadlock_victim is reset in lock_wait_end(), then, maybe, lock_trx_handle_wait() should check trx->error_state instead of trx->error_state to return DB_DEADLOCK?
            vlad.lesin Vladislav Lesin made changes -
            Attachment MDEV-29081.test [ 65015 ]
            vlad.lesin Vladislav Lesin added a comment - - edited

            At last I developed test case for it: MDEV-29081.test.

            The following two new syncpoints are necessary for the test:

            --- a/storage/innobase/lock/lock0lock.cc
            +++ b/storage/innobase/lock/lock0lock.cc
            @@ -1849,6 +1849,7 @@ dberr_t lock_wait(que_thr_t *thr)
               thd_wait_end(trx->mysql_thd);
             
               trx->error_state= error_state;
            +  DEBUG_SYNC_C("lock_wait_suspend_thread_exit");
               return error_state;
             }
             
            @@ -5815,6 +5816,7 @@ while holding a clustered index leaf page latch.
                                  lock request was released */
             dberr_t lock_trx_handle_wait(trx_t *trx)
             {
            +  DEBUG_SYNC_C("lock_trx_handle_wait_enter");
               if (trx->lock.was_chosen_as_deadlock_victim)
                 return DB_DEADLOCK;
               if (!trx->lock.wait_lock)
            

            vlad.lesin Vladislav Lesin added a comment - - edited At last I developed test case for it: MDEV-29081.test . The following two new syncpoints are necessary for the test: --- a/storage/innobase/lock/lock0lock.cc +++ b/storage/innobase/lock/lock0lock.cc @@ - 1849 , 6 + 1849 , 7 @@ dberr_t lock_wait(que_thr_t *thr) thd_wait_end(trx->mysql_thd); trx->error_state= error_state; + DEBUG_SYNC_C( "lock_wait_suspend_thread_exit" ); return error_state; } @@ - 5815 , 6 + 5816 , 7 @@ while holding a clustered index leaf page latch. lock request was released */ dberr_t lock_trx_handle_wait(trx_t *trx) { + DEBUG_SYNC_C( "lock_trx_handle_wait_enter" ); if (trx->lock.was_chosen_as_deadlock_victim) return DB_DEADLOCK; if (!trx->lock.wait_lock)
            vlad.lesin Vladislav Lesin made changes -
            Attachment MDEV-29081.test [ 65015 ]
            vlad.lesin Vladislav Lesin made changes -
            Attachment MDEV-29081.test [ 65016 ]
            vlad.lesin Vladislav Lesin made changes -
            Attachment MDEV-29081.test [ 65016 ]
            vlad.lesin Vladislav Lesin made changes -
            Attachment MDEV-29081.test [ 65026 ]

            The bug affects 10.6+. I am checking 10.5, it can take some time as related code is different.

            vlad.lesin Vladislav Lesin added a comment - The bug affects 10.6+. I am checking 10.5, it can take some time as related code is different.
            vlad.lesin Vladislav Lesin added a comment - - edited

            I can't backport the test on 10.5 because of two reasons:

            1. trx->lock.was_chosen_as_deadlock_victim is not reset from the functions which cancel and release locks in 10.5, in 10.6 it is,

            2. lock_trx_handle_wait() cancels waiting in 10.5, while in 10.6 it just checks lock state, and the thread will be suspended in row_search_mvcc() if the lock is still in "waiting" state.

            I think 10.5 is not affected because of (1).

            UPD: I was wrong about (1). 10.5 resets trx->lock.was_chosen_as_deadlock_victim in lock_wait_release_thread_if_suspended(). So currently I am not so sure about if 10.5 is affected or not. The current test will not work anyway due to (2), so 10.5 requires separate test case.

            vlad.lesin Vladislav Lesin added a comment - - edited I can't backport the test on 10.5 because of two reasons: 1. trx->lock.was_chosen_as_deadlock_victim is not reset from the functions which cancel and release locks in 10.5, in 10.6 it is, 2. lock_trx_handle_wait() cancels waiting in 10.5, while in 10.6 it just checks lock state, and the thread will be suspended in row_search_mvcc() if the lock is still in "waiting" state. I think 10.5 is not affected because of (1). UPD: I was wrong about (1). 10.5 resets trx->lock.was_chosen_as_deadlock_victim in lock_wait_release_thread_if_suspended(). So currently I am not so sure about if 10.5 is affected or not. The current test will not work anyway due to (2), so 10.5 requires separate test case.

            There are two ways to fix it:

            1. Reset trx->lock.was_chosen_as_deadlock_victim only on rollback(full to savepoint). Particularly, don't reset it in lock_wait_end(). It can be set during deadlock checking by another transaction, but if it is set, the transaction should have ability to check somehow that it was chosen as deadlock victim. We should be careful here as trx->lock.was_chosen_as_deadlock_victim is used not only to notify about deadlock, but also to notify transaction that it is a victim of Galera abort (see trx_lock_t::set_wsrep_victim() for details).

            2. There is also trx->error_state which is set to DB_DEADLOCK. But it's not atomic and not protected. It's said in the corresponding comment that "ONLY the thread doing the transaction is allowed to set this field". In 10.5 trx->error_state was set to DB_DEADLOCK only before trx->lock.was_chosen_as_deadlock_victim is reset. In 10.6 the latter became atomic(commit 3e45f8e36a80), and the corresponding bit is reset before trx->error_state=DB_DEADLOCK in lock_wait_end(), which can be executed by other thread. So trx->error_state can be changed in the thread which executes some other transaction. And there is some root between trx->lock.was_chosen_as_deadlock_victim resetting and trx->error_state=DB_DEADLOCK setting. But, anyway, I think this will not cause race condition because trx->error_state is set under lock_sys.wait_mutex lock in lock_wait_end(), and we could check trx->error_state in lock_trx_handle_wait() under lock_sys.wait_mutex lock too.

            We could also check both trx->lock.was_chosen_as_deadlock_victim and trx->error_state. Because Galera transaction abort sets the second bit of trx->lock.was_chosen_as_deadlock_victim and does set trx->error_state = DB_DEADLOCK.

            But the question is still open to me. Why do we need to reset trx->lock.was_chosen_as_deadlock_victim elsewhere except rollback? marko, maybe you know?

            Another concern is that currently we have two different ways to notify transaction about it was chosen as deadlock victim. And this ambiguousness is quite confusing, because it's not clear what to check in what cases.

            vlad.lesin Vladislav Lesin added a comment - There are two ways to fix it: 1. Reset trx->lock.was_chosen_as_deadlock_victim only on rollback(full to savepoint). Particularly, don't reset it in lock_wait_end(). It can be set during deadlock checking by another transaction, but if it is set, the transaction should have ability to check somehow that it was chosen as deadlock victim. We should be careful here as trx->lock.was_chosen_as_deadlock_victim is used not only to notify about deadlock, but also to notify transaction that it is a victim of Galera abort (see trx_lock_t::set_wsrep_victim() for details). 2. There is also trx->error_state which is set to DB_DEADLOCK. But it's not atomic and not protected. It's said in the corresponding comment that "ONLY the thread doing the transaction is allowed to set this field". In 10.5 trx->error_state was set to DB_DEADLOCK only before trx->lock.was_chosen_as_deadlock_victim is reset. In 10.6 the latter became atomic(commit 3e45f8e36a80), and the corresponding bit is reset before trx->error_state=DB_DEADLOCK in lock_wait_end(), which can be executed by other thread. So trx->error_state can be changed in the thread which executes some other transaction. And there is some root between trx->lock.was_chosen_as_deadlock_victim resetting and trx->error_state=DB_DEADLOCK setting. But, anyway, I think this will not cause race condition because trx->error_state is set under lock_sys.wait_mutex lock in lock_wait_end(), and we could check trx->error_state in lock_trx_handle_wait() under lock_sys.wait_mutex lock too. We could also check both trx->lock.was_chosen_as_deadlock_victim and trx->error_state. Because Galera transaction abort sets the second bit of trx->lock.was_chosen_as_deadlock_victim and does set trx->error_state = DB_DEADLOCK. But the question is still open to me. Why do we need to reset trx->lock.was_chosen_as_deadlock_victim elsewhere except rollback? marko , maybe you know? Another concern is that currently we have two different ways to notify transaction about it was chosen as deadlock victim. And this ambiguousness is quite confusing, because it's not clear what to check in what cases.

            vlad.lesin, my intuition says that it should suffice to reset trx->lock.was_chosen_as_deadlock_victim at the end of trx_t::commit_in_memory(). A deadlock should be a hard error; the only way to "resume" the transaction is to re-execute the statements in a new transaction. An example of a soft error would be a lock wait timeout. There is an option to trigger a roll back on timeout, but the user could also keep executing further statements after one statement encountered a lock wait timeout and was rolled back.

            In MDEV-24671 and MDEV-24738 I cleaned up the code quite a bit, but I can’t remember why I did not remove the trx_lock_t::clear_deadlock_victim(). Maybe I just wanted to limit the amount of changes to reduce the risk of breaking things.

            It now seems to me that the separate flag for the Galera set_wsrep_victim() is unnecessary if we just remove the trx_lock_t::clear_deadlock_victim() and calls to it. Then the trx_lock_t::was_chosen_as_deadlock_victim could be a simple atomic Boolean flag.

            marko Marko Mäkelä added a comment - vlad.lesin , my intuition says that it should suffice to reset trx->lock.was_chosen_as_deadlock_victim at the end of trx_t::commit_in_memory() . A deadlock should be a hard error; the only way to "resume" the transaction is to re-execute the statements in a new transaction. An example of a soft error would be a lock wait timeout. There is an option to trigger a roll back on timeout, but the user could also keep executing further statements after one statement encountered a lock wait timeout and was rolled back. In MDEV-24671 and MDEV-24738 I cleaned up the code quite a bit, but I can’t remember why I did not remove the trx_lock_t::clear_deadlock_victim() . Maybe I just wanted to limit the amount of changes to reduce the risk of breaking things. It now seems to me that the separate flag for the Galera set_wsrep_victim() is unnecessary if we just remove the trx_lock_t::clear_deadlock_victim() and calls to it. Then the trx_lock_t::was_chosen_as_deadlock_victim could be a simple atomic Boolean flag.
            vlad.lesin Vladislav Lesin made changes -
            Summary Transaction-related assertions in lock0lock.cc:5994 trx_t::lock.was_chosen_as_deadlock_victim race
            vlad.lesin Vladislav Lesin made changes -
            Summary trx_t::lock.was_chosen_as_deadlock_victim race trx_t::lock.was_chosen_as_deadlock_victim race in lock_wait_end()
            marko Marko Mäkelä made changes -
            mleich Matthias Leich added a comment - - edited

            Intermediate result of testing:
            origin/bb-10.6-MDEV-29081-deadlock-victim fd75a1b755d9216d731d524d28da072b8b27ebde 2022-08-18T16:11:03+03:00
            performed well in the RQGstandard test battery. No new bad effects.
            But some slightly modified test battery seems to find new crashes.
            mysqld: storage/innobase/lock/lock0lock.cc:1493: dberr_t lock_rec_lock(bool, unsigned int, const buf_block_t*, ulint, dict_index_t*, que_thr_t*): Assertion `trx->is_wsrep() || !trx->lock.was_chosen_as_deadlock_victim' failed.
            All tests hitting that assert have a server start with 
            transaction-isolation=READ-COMMITTED or READ-UNCOMMITTED
             
            pluto:/data/results/1660910976/TBR-1585$ _RR_TRACE_DIR=./1/rr/ rr replay
            

            mleich Matthias Leich added a comment - - edited Intermediate result of testing: origin/bb-10.6-MDEV-29081-deadlock-victim fd75a1b755d9216d731d524d28da072b8b27ebde 2022-08-18T16:11:03+03:00 performed well in the RQGstandard test battery. No new bad effects. But some slightly modified test battery seems to find new crashes. mysqld: storage/innobase/lock/lock0lock.cc:1493: dberr_t lock_rec_lock(bool, unsigned int, const buf_block_t*, ulint, dict_index_t*, que_thr_t*): Assertion `trx->is_wsrep() || !trx->lock.was_chosen_as_deadlock_victim' failed. All tests hitting that assert have a server start with transaction-isolation=READ-COMMITTED or READ-UNCOMMITTED   pluto:/data/results/1660910976/TBR-1585$ _RR_TRACE_DIR=./1/rr/ rr replay

            origin/bb-10.6-MDEV-29081-deadlock-victim f473354c5ca617bfde7306d571052ac4aa9c5a86 2022-08-22T15:13:26+03:00
            performed well in RQG testing. The bug mentioned in the previous comment is fixed.
            Please be aware that the capabilities to check the correctness of result sets or error messages are quite limited when using RQG.
            The new experimental RQG validator "DmlStability" (TODO-3508) did not found problems.

            mleich Matthias Leich added a comment - origin/bb-10.6- MDEV-29081 -deadlock-victim f473354c5ca617bfde7306d571052ac4aa9c5a86 2022-08-22T15:13:26+03:00 performed well in RQG testing. The bug mentioned in the previous comment is fixed. Please be aware that the capabilities to check the correctness of result sets or error messages are quite limited when using RQG. The new experimental RQG validator "DmlStability" (TODO-3508) did not found problems.
            vlad.lesin Vladislav Lesin made changes -
            Component/s Storage Engine - InnoDB [ 10129 ]
            Fix Version/s 10.6.10 [ 28407 ]
            Fix Version/s 10.7.6 [ 28408 ]
            Fix Version/s 10.8.5 [ 28308 ]
            Fix Version/s 10.9.3 [ 28409 ]
            Fix Version/s 10.10.2 [ 28410 ]
            Fix Version/s 10.11.0 [ 28411 ]
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.7 [ 24805 ]
            Fix Version/s 10.8 [ 26121 ]
            Fix Version/s 10.9 [ 26905 ]
            Fix Version/s 10.10 [ 27530 ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            vlad.lesin Vladislav Lesin made changes -
            vlad.lesin Vladislav Lesin made changes -

            People

              vlad.lesin Vladislav Lesin
              Zuming Jiang Zuming Jiang
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.