[MDEV-23536] Race condition between KILL and transaction commit Created: 2020-08-22  Updated: 2022-09-28  Resolved: 2021-01-08

Status: Closed
Project: MariaDB Server
Component/s: Locking, Server
Affects Version/s: 10.2, 10.3, 10.4, 10.5, 10.6
Fix Version/s: 10.2.37, 10.3.28, 10.4.18, 10.5.9, 10.6.0

Type: Bug Priority: Critical
Reporter: Marko Mäkelä Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: ASAN, race

Issue Links:
Blocks
blocks MDEV-23328 Server hang due to Galera lock confli... Closed
Relates
relates to MDEV-24624 main.lock_tables_lost_commit failed i... Closed
relates to MDEV-26014 Race condition between KILL and clien... Closed
relates to MDEV-17092 ERROR: AddressSanitizer: use-after-po... Closed
relates to MDEV-22782 SUMMARY: AddressSanitizer: unknown-cr... Closed
relates to MDEV-23328 Server hang due to Galera lock confli... Closed
relates to MDEV-24561 rpl.rpl_semi_sync_uninstall_plugin fa... Closed
relates to MDEV-29368 Assertion `trx->mysql_thd == thd' fai... Closed

 Description   

A race condition may occur between the execution of transaction commit, and an execution of a KILL statement that would attempt to abort that transaction.

MDEV-17092 worked around this race condition by modifying InnoDB code. After that issue was closed, svoj pointed out that this race condition would better be fixed above the storage engine layer:

If you look carefully into the above, you can conclude that thd->free_connection() can be called concurrently with KILL/thd->awake(). Which is the bug. And it is partially fixed in THD::~THD(), that is destructor waits for KILL completion:

He is quoting this code of THD::~THD() in 10.5:

  /*
    Other threads may have a lock on LOCK_thd_kill to ensure that this
    THD is not deleted while they access it. The following mutex_lock
    ensures that no one else is using this THD and it's now safe to delete
  */
  if (WSREP_NNULL(this)) mysql_mutex_lock(&LOCK_thd_data);
  mysql_mutex_lock(&LOCK_thd_kill);
  mysql_mutex_unlock(&LOCK_thd_kill);
  if (WSREP_NNULL(this)) mysql_mutex_unlock(&LOCK_thd_data);
 
  if (!free_connection_done)
    free_connection();

And he seems to suggest that the empty critical section should be moved to THD::free_connection(). Note: in 10.2 and 10.3, that code is slightly different:

  mysql_mutex_lock(&LOCK_thd_kill);
  mysql_mutex_unlock(&LOCK_thd_kill);
 
#ifdef WITH_WSREP
  delete wsrep_rgi;
#endif
  if (!free_connection_done)
    free_connection();

Nevertheless, it seems that we might want to do something like

#if MYSQL_VERSION_ID >= 100400
  if (WSREP_NNULL(this)) mysql_mutex_lock(&LOCK_thd_data); 
#endif
  if (!free_connection_done)
    free_connection();
  else
  {
    mysql_mutex_lock(&LOCK_thd_kill);
    mysql_mutex_unlock(&LOCK_thd_kill);
  }
#if MYSQL_VERSION_ID >= 100400
  if (WSREP_NNULL(this)) mysql_mutex_unlock(&LOCK_thd_data);
#elif defined WITH_WSREP
  delete wsrep_rgi;
#endif

It might turn out that the else branch is not needed. The empty lock/unlock pair would of course be added to THD::free_connection().

It might also turn out that all the Galera-specific changes need to be done in THD::free_connection(). (In that case, we would likely want to assign wsrep_rgi= NULL).

As part of this fix, the trx_t::free() instrumentation that was modified in MDEV-22782 should be tightened: trx_t::mysql_thd and trx_t::state must be poisoned, because innobase_kill_connection() should no longer be invoked on a freed transaction of a freed connection. This should of course be validated with an RQG run similar to the one that reproduced MDEV-17092.



 Comments   
Comment by Jan Lindström (Inactive) [ 2020-12-18 ]

https://github.com/MariaDB/server/commit/889f90365ef419f73e7b2a2f4d5ce18445d41abc

Testing:

  • Used grammar provided on MDEV-17092
  • Used machine on Sofia: galapq
  • Used additional instrumentation that was used on MDEV-17092

diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
index 25cbc63ac64..7e67c91b6a4 100644
--- a/storage/innobase/handler/ha_innodb.cc
+++ b/storage/innobase/handler/ha_innodb.cc
@@ -5176,6 +5176,7 @@ static void innobase_kill_query(handlerton*, THD* thd, enum thd_kill_levels)
 
   if (trx_t* trx= thd_to_trx(thd))
   {
+    os_thread_sleep(1000);
     ut_ad(trx->mysql_thd == thd && trx->state == TRX_STATE_ACTIVE);
     lock_mutex_enter();
     trx_mutex_enter(trx);

Comment by Marko Mäkelä [ 2020-12-18 ]

The latest revision looks rather acceptable to me. But, the changes to THD should be reviewed by someone else. Please also port this to 10.3 and 10.6 for testing. I think that this needs multiple hours of RQG+ASAN testing in each branch.

Comment by Marko Mäkelä [ 2020-12-19 ]

jplindst, thank you, those look reasonable to me, but I am not eligible to review the THD changes.

I think that mleich should test the above branches with RQG and ASAN, and maybe someone else should perform additional testing with Galera enabled. I think that the stress testing should be primarily conducted without rr record, with a few rr record --chaos runs thrown in for extra double-checking.

Comment by Jan Lindström (Inactive) [ 2020-12-21 ]
Comment by Matthias Leich [ 2020-12-21 ]

One run with debug+asan+rr
(rr) bt
#0  0x0000000070000002 in ?? ()
#1  0x00007f5ff89153b6 in _raw_syscall () at /build/rr-S0CLEN/rr-5.3.0/src/preload/raw_syscall.S:120
#2  0x00007f5ff89112ff in traced_raw_syscall (call=call@entry=0x7f5ff1149fa0) at ./src/preload/syscallbuf.c:229
#3  0x00007f5ff8912978 in sys_fcntl (call=<optimized out>) at ./src/preload/syscallbuf.c:1291
#4  syscall_hook_internal (call=0x7f5ff1149fa0) at ./src/preload/syscallbuf.c:2855
#5  syscall_hook (call=0x7f5ff1149fa0) at ./src/preload/syscallbuf.c:2987
#6  0x00007f5ff89111da in _syscall_hook_trampoline () at /build/rr-S0CLEN/rr-5.3.0/src/preload/syscall_hook.S:282
#7  0x00007f5ff891120a in __morestack () at /build/rr-S0CLEN/rr-5.3.0/src/preload/syscall_hook.S:417
#8  0x00007f5ff8911263 in _syscall_hook_trampoline_89_c2_f7_da () at /build/rr-S0CLEN/rr-5.3.0/src/preload/syscall_hook.S:463
#9  0x000032eb6e832f0c in __pthread_kill (threadid=<optimized out>, signo=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
#10 0x000055ff0b1b22da in my_write_core (sig=6) at /Server/bb-10.2-MDEV-23536/mysys/stacktrace.c:382
#11 0x000055ff09f10759 in handle_fatal_signal (sig=6) at /Server/bb-10.2-MDEV-23536/sql/signal_handler.cc:343
#12 <signal handler called>
#13 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#14 0x00001bd5462ad859 in __GI_abort () at abort.c:79
#15 0x000020f94143d6a2 in ?? () from /lib/x86_64-linux-gnu/libasan.so.5
#16 0x000020f94144824c in ?? () from /lib/x86_64-linux-gnu/libasan.so.5
#17 0x000020f9414298ec in ?? () from /lib/x86_64-linux-gnu/libasan.so.5
#18 0x000020f941429363 in ?? () from /lib/x86_64-linux-gnu/libasan.so.5
#19 0x000020f94142a1ab in __asan_report_load8 () from /lib/x86_64-linux-gnu/libasan.so.5
#20 0x000055ff0a43a5c7 in innobase_kill_query (thd=0x62a000240208) at /Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:5180
#21 0x000055ff09f185f9 in kill_handlerton (thd=0x62a000240208, plugin=0x602000119398, level=0x11092f893620) at /Server/bb-10.2-MDEV-23536/sql/handler.cc:843
#22 0x000055ff0991a175 in plugin_foreach_with_mask (thd=0x62a000240208, func=0x55ff09f184c3 <kill_handlerton(THD*, plugin_ref, void*)>, type=1, state_mask=8, arg=0x11092f893620) at /Server/bb-10.2-MDEV-23536/sql/sql_plugin.cc:2411
#23 0x000055ff09f186fc in ha_kill_query (thd=0x62a000240208, level=THD_ABORT_ASAP) at /Server/bb-10.2-MDEV-23536/sql/handler.cc:850
#24 0x000055ff09817ea8 in THD::awake (this=0x62a000240208, state_to_set=KILL_CONNECTION_HARD) at /Server/bb-10.2-MDEV-23536/sql/sql_class.cc:1752
#25 0x000055ff098fc888 in kill_one_thread (thd=0x62a00020a208, id=839, kill_signal=KILL_CONNECTION_HARD, type=KILL_TYPE_ID) at /Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:8928
#26 0x000055ff098fd391 in sql_kill (thd=0x62a00020a208, id=839, state=KILL_CONNECTION_HARD, type=KILL_TYPE_ID) at /Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:9034
#27 0x000055ff098e4ba7 in mysql_execute_command (thd=0x62a00020a208) at /Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:5245
#28 0x000055ff098f5514 in mysql_parse (thd=0x62a00020a208, rawbuf=0x62b0000d9228 "KILL CONNECTION @kill_id /* E_R Thread8 QNO 940 CON_ID 851 */", length=61, parser_state=0x11092f894ce0, is_com_multi=false, is_next_command=false)
    at /Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:7762
#29 0x000055ff098ce64d in dispatch_command (command=COM_QUERY, thd=0x62a00020a208, packet=0x62900136f209 " KILL CONNECTION @kill_id /* E_R Thread8 QNO 940 CON_ID 851 */ ", packet_length=63, is_com_multi=false, is_next_command=false)
    at /Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:1827
#30 0x000055ff098cb40c in do_command (thd=0x62a00020a208) at /Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:1381
#31 0x000055ff09c52d67 in do_handle_one_connection (connect=0x6080000010a8) at /Server/bb-10.2-MDEV-23536/sql/sql_connect.cc:1336
#32 0x000055ff09c5262a in handle_one_connection (arg=0x6080000010a8) at /Server/bb-10.2-MDEV-23536/sql/sql_connect.cc:1241
#33 0x000032eb6e82a609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#34 0x00001bd5463aa293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(rr)
 
[rr 1822861 504429][rr 1822861 504433]==1822861==ERROR: AddressSanitizer: use-after-poison on address 0x6cb256bc5f68 at pc 0x55ff0a43a5c7 bp 0x11092f8932d0 sp 0x11092f8932c0
[rr 1822861 504436][rr 1822861 504438]READ of size 8 at 0x6cb256bc5f68 thread T35
[rr 1822861 504545]2020-12-21  2:47:14 46785501316864 [Warning] Aborted connection 867 to db: 'test' user: 'root' host: 'localhost' (CLOSE_CONNECTION)
[rr 1822861 504693]2020-12-21  2:47:14 31681701005056 [Warning] Aborted connection 868 to db: 'test' user: 'root' host: 'localhost' (CLOSE_CONNECTION)
[rr 1822861 519481]    #0 0x55ff0a43a5c6 in innobase_kill_query /Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:5180
[rr 1822861 520055]    #1 0x55ff09f185f8 in kill_handlerton /Server/bb-10.2-MDEV-23536/sql/handler.cc:843
[rr 1822861 520298]    #2 0x55ff0991a174 in plugin_foreach_with_mask(THD*, char (*)(THD*, st_plugin_int**, void*), int, unsigned int, void*) /Server/bb-10.2-MDEV-23536/sql/sql_plugin.cc:2411
[rr 1822861 520300]    #3 0x55ff09f186fb in ha_kill_query(THD*, thd_kill_levels) /Server/bb-10.2-MDEV-23536/sql/handler.cc:850
[rr 1822861 520906]    #4 0x55ff09817ea7 in THD::awake(killed_state) /Server/bb-10.2-MDEV-23536/sql/sql_class.cc:1752
[rr 1822861 521488]    #5 0x55ff098fc887 in kill_one_thread(THD*, long long, killed_state, killed_type) /Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:8928
[rr 1822861 521497]    #6 0x55ff098fd390 in sql_kill /Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:9034
[rr 1822861 521499]    #7 0x55ff098e4ba6 in mysql_execute_command(THD*) /Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:5245
[rr 1822861 521520]    #8 0x55ff098f5513 in mysql_parse(THD*, char*, unsigned int, Parser_state*, bool, bool) /Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:7762
[rr 1822861 521530]    #9 0x55ff098ce64c in dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool) /Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:1827
[rr 1822861 521532]    #10 0x55ff098cb40b in do_command(THD*) /Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:1381
[rr 1822861 521960]    #11 0x55ff09c52d66 in do_handle_one_connection(CONNECT*) /Server/bb-10.2-MDEV-23536/sql/sql_connect.cc:1336
[rr 1822861 521962]    #12 0x55ff09c52629 in handle_one_connection /Server/bb-10.2-MDEV-23536/sql/sql_connect.cc:1241
[rr 1822861 522006]    #13 0x32eb6e82a608 in start_thread /build/glibc-ZN95T4/glibc-2.31/nptl/pthread_create.c:477
[rr 1822861 522008]    #14 0x1bd5463aa292 in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x122292)
[rr 1822861 522010]
[rr 1822861 522020]0x6cb256bc5f68 is located 38760 bytes inside of 4194304-byte region [0x6cb256bbc800,0x6cb256fbc800)
[rr 1822861 522026]allocated by thread T0 here:
[rr 1822861 522085]    #0 0x20f94141fdc6 in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10ddc6)
[rr 1822861 522215]    #1 0x55ff0a8c2589 in Pool<trx_t, TrxFactory, TrxPoolLock>::Pool(unsigned long) /Server/bb-10.2-MDEV-23536/storage/innobase/include/ut0pool.h:66
[rr 1822861 522217]    #2 0x55ff0a8be939 in PoolManager<Pool<trx_t, TrxFactory, TrxPoolLock>, TrxPoolManagerLock>::add_pool(unsigned long) /Server/bb-10.2-MDEV-23536/storage/innobase/include/ut0pool.h:299
[rr 1822861 522219]    #3 0x55ff0a8be118 in PoolManager<Pool<trx_t, TrxFactory, TrxPoolLock>, TrxPoolManagerLock>::create() /Server/bb-10.2-MDEV-23536/storage/innobase/include/ut0pool.h:327
[rr 1822861 522261]    #4 0x55ff0a8bc012 in PoolManager<Pool<trx_t, TrxFactory, TrxPoolLock>, TrxPoolManagerLock>::PoolManager(unsigned long) /Server/bb-10.2-MDEV-23536/storage/innobase/include/ut0pool.h:211
[rr 1822861 522263]    #5 0x55ff0a8a73cf in trx_pool_init() /Server/bb-10.2-MDEV-23536/storage/innobase/trx/trx0trx.cc:303
[rr 1822861 522708]    #6 0x55ff0a7eebf2 in srv_boot() /Server/bb-10.2-MDEV-23536/storage/innobase/srv/srv0srv.cc:1172
[rr 1822861 523163]    #7 0x55ff0a80640a in innobase_start_or_create_for_mysql() /Server/bb-10.2-MDEV-23536/storage/innobase/srv/srv0start.cc:1693
[rr 1822861 523165]    #8 0x55ff0a43699a in innobase_init /Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:4290
[rr 1822861 523167]    #9 0x55ff09f16ac4 in ha_initialize_handlerton(st_plugin_int*) /Server/bb-10.2-MDEV-23536/sql/handler.cc:555
[rr 1822861 523176]    #10 0x55ff099126ee in plugin_initialize /Server/bb-10.2-MDEV-23536/sql/sql_plugin.cc:1417
[rr 1822861 523178]    #11 0x55ff09914425 in plugin_init(int*, char**, int) /Server/bb-10.2-MDEV-23536/sql/sql_plugin.cc:1698
[rr 1822861 523653]    #12 0x55ff096847bc in init_server_components /Server/bb-10.2-MDEV-23536/sql/mysqld.cc:5392
[rr 1822861 523655]    #13 0x55ff096867e5 in mysqld_main(int, char**) /Server/bb-10.2-MDEV-23536/sql/mysqld.cc:5990
[rr 1822861 523657]    #14 0x55ff0966ec4c in main /Server/bb-10.2-MDEV-23536/sql/main.cc:25
[rr 1822861 523659]    #15 0x1bd5462af0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)
[rr 1822861 523661]
[rr 1822861 523669]Thread T35 created by T0 here:
[rr 1822861 523733]    #0 0x20f94134c805 in pthread_create (/lib/x86_64-linux-gnu/libasan.so.5+0x3a805)
[rr 1822861 523735]    #1 0x55ff0b201c8a in spawn_thread_noop /Server/bb-10.2-MDEV-23536/mysys/psi_noop.c:187
[rr 1822861 523737]    #2 0x55ff09670393 in inline_mysql_thread_create /Server/bb-10.2-MDEV-23536/include/mysql/psi/mysql_thread.h:1246
[rr 1822861 523739]    #3 0x55ff09687e59 in create_thread_to_handle_connection(CONNECT*) /Server/bb-10.2-MDEV-23536/sql/mysqld.cc:6567
[rr 1822861 523822]    #4 0x55ff096885f4 in create_new_thread /Server/bb-10.2-MDEV-23536/sql/mysqld.cc:6637
[rr 1822861 523824]    #5 0x55ff09689786 in handle_connections_sockets() /Server/bb-10.2-MDEV-23536/sql/mysqld.cc:6895
[rr 1822861 523826]    #6 0x55ff096871af in mysqld_main(int, char**) /Server/bb-10.2-MDEV-23536/sql/mysqld.cc:6186
[rr 1822861 523828]    #7 0x55ff0966ec4c in main /Server/bb-10.2-MDEV-23536/sql/main.cc:25
[rr 1822861 523883]    #8 0x1bd5462af0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)
[rr 1822861 523885]
[rr 1822861 523897]SUMMARY: AddressSanitizer: use-after-poison /Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:5180 in innobase_kill_query
[rr 1822861 524331]Shadow bytes around the buggy address:
  0x0d96cad70b90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0d96cad70ba0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0d96cad70bb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0d96cad70bc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0d96cad70bd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0d96cad70be0: 00 00 00 00 00 00 00 00 00 00 00 00 00[00]00 00
  0x0d96cad70bf0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0d96cad70c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0d96cad70c10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0d96cad70c20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0d96cad70c30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
 
sdp:/RQG2/storage/1608547588/REPLAY1/dev/shm/vardir/1608547588/12/1/rr
_RR_TRACE_DIR="." rr replay --mark-stdio
 
Also without "rr" a replay goes fast.

Comment by Matthias Leich [ 2020-12-21 ]

A replay of the
SUMMARY: AddressSanitizer: use-after-poison /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:5180 in innobase_kill_query
rr:/home/mleich/RQG/storage/1608557943/VARIANT1/dev/shm/vardir/1608557943/15/1/rr
_RR_TRACE_DIR="." rr replay --mark-stdio

Comment by Jan Lindström (Inactive) [ 2020-12-28 ]

If I correctly understand the rr replay race happens between these two threads:

Firstly, there is KILL CONNECTION thread:

#18 0x000055cc62501de9 in innobase_kill_query (thd=0x62a0000ba208)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:5180
#19 0x000055cc61fe444b in kill_handlerton (thd=0x62a0000ba208, plugin=0x602000060b58, level=0x6d0f794df5a0)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/handler.cc:843
#20 0x000055cc619eaae9 in plugin_foreach_with_mask (thd=0x62a0000ba208, 
    func=0x55cc61fe4315 <kill_handlerton(THD*, plugin_ref, void*)>, type=1, state_mask=8, arg=0x6d0f794df5a0)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_plugin.cc:2411
#21 0x000055cc61fe454e in ha_kill_query (thd=0x62a0000ba208, level=THD_ABORT_ASAP)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/handler.cc:850
#22 0x000055cc618e9468 in THD::awake (this=0x62a0000ba208, state_to_set=KILL_CONNECTION_HARD)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_class.cc:1752
#23 0x000055cc619cd3e4 in kill_one_thread (thd=0x62a00027c208, id=578, kill_signal=KILL_CONNECTION_HARD, 
    type=KILL_TYPE_ID) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:8928
--Type <RET> for more, q to quit, c to continue without paging--
#24 0x000055cc619cded8 in sql_kill (thd=0x62a00027c208, id=578, state=KILL_CONNECTION_HARD, type=KILL_TYPE_ID)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:9034
#25 0x000055cc619b57fb in mysql_execute_command (thd=0x62a00027c208)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:5245
#26 0x000055cc619c60ea in mysql_parse (thd=0x62a00027c208, 
    rawbuf=0x62b00016c228 "KILL CONNECTION @kill_id /* E_R Thread8 QNO 571 CON_ID 598 */", length=61, 
    parser_state=0x6d0f794e0c60, is_com_multi=false, is_next_command=false)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:7762
#27 0x000055cc6199f2c7 in dispatch_command (command=COM_QUERY, thd=0x62a00027c208, 
    packet=0x62900300c209 " KILL CONNECTION @kill_id /* E_R Thread8 QNO 571 CON_ID 598 */ ", packet_length=63, 
    is_com_multi=false, is_next_command=false) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:1827
#28 0x000055cc6199c0a3 in do_command (thd=0x62a00027c208) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:1381
#29 0x000055cc61d21229 in do_handle_one_connection (connect=0x60800003c9a8)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_connect.cc:1336
#30 0x000055cc61d20aec in handle_one_connection (arg=0x60800003c9a8)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_connect.cc:1241
#31 0x00007ffd358ce609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#32 0x00004bad50df2103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

And thread doing transaction commit:

(rr) c
Continuing.
[rr 1120600 317603]READ of size 8 at 0x5cf337fd3a40 thread T45
[Switching to Thread 1120600.1123140]
 
Thread 37 hit Breakpoint 2, trx_t::free (this=0x54064f26f730)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/trx/trx0trx.cc:365
365	{
(rr) where
#0  trx_t::free (this=0x54064f26f730) at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/trx/trx0trx.cc:365
#1  0x000055cc6296b871 in trx_free_for_background (trx=0x5cf337fd5e38)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/trx/trx0trx.cc:554
#2  0x000055cc6296c617 in trx_free_for_mysql (trx=0x5cf337fd5e38)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/trx/trx0trx.cc:709
#3  0x000055cc6256015b in ha_innobase::delete_table (this=0x62b00010abe8, name=0x54064f26f400 "./test/t1", 
    sqlcom=SQLCOM_DROP_TABLE) at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:13406
#4  0x000055cc6252de21 in ha_innobase::delete_table (this=0x62b00010abe8, name=0x54064f26f400 "./test/t1")
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:13435
#5  0x000055cc61ffc1b5 in handler::ha_delete_table (this=0x62b00010abe8, name=0x54064f26f400 "./test/t1")
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/handler.cc:4480
#6  0x000055cc61fec7e1 in ha_delete_table (thd=0x62a000210208, table_type=0x614000001248, 
    path=0x54064f26f400 "./test/t1", db=0x62b00010a998 "test", alias=0x62b00010a328 "t1", generate_warning=true)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/handler.cc:2442
#7  0x000055cc61bb720f in mysql_rm_table_no_locks (thd=0x62a000210208, tables=0x62b00010a378, if_exists=true, 
    drop_temporary=false, drop_view=false, dont_log_query=false, dont_free_locks=false)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_table.cc:2447
#8  0x000055cc61bb55f9 in mysql_rm_table (thd=0x62a000210208, tables=0x62b00010a378, if_exists=1 '\001', 
    drop_temporary=0 '\000') at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_table.cc:2089
#9  0x000055cc619b02f7 in mysql_execute_command (thd=0x62a000210208)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:4532
#10 0x000055cc619c60ea in mysql_parse (thd=0x62a000210208, 
    rawbuf=0x62b00010a228 "DROP TABLE IF EXISTS t1 /* E_R Thread7 QNO 578 CON_ID 586 */", length=60, 
    parser_state=0x54064f270c60, is_com_multi=false, is_next_command=false)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:7762
#11 0x000055cc6199f2c7 in dispatch_command (command=COM_QUERY, thd=0x62a000210208, 
    packet=0x629002eb8209 "DROP TABLE IF EXISTS t1 /* E_R Thread7 QNO 578 CON_ID 586 */ ", packet_length=61, 
    is_com_multi=false, is_next_command=false) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:1827
#12 0x000055cc6199c0a3 in do_command (thd=0x62a000210208) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:1381
#13 0x000055cc61d21229 in do_handle_one_connection (connect=0x608000001028)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_connect.cc:1336
#14 0x000055cc61d20aec in handle_one_connection (arg=0x608000001028)
--Type <RET> for more, q to quit, c to continue without paging--
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_connect.cc:1241
#15 0x00007ffd358ce609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00004bad50df2103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

KILL CONNECTION is holding THD::LOCK_thd_data but that does not help as trx->free() will poison trx::mysql_thd we are trying to reference on innobase_kill_query.

Comment by Marko Mäkelä [ 2020-12-28 ]

jplindst, you did not include full stack traces or any timestamp (when). Let me try to debug the trace for you:

(rr) continue
Thread 3 received signal SIGABRT, Aborted.
[Switching to Thread 1120600.1129198]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(rr) awatch *(byte*)0x0b9ee6ff2748
Hardware access (read/write) watchpoint 1: *(byte*)0x0b9ee6ff2748
(rr) when
Current event: 341241
(rr) rc
Continuing.
(rr) rc
Continuing.
[Switching to Thread 1120600.1123144]
 
Thread 39 hit Hardware access (read/write) watchpoint 1: *(byte*)0x0b9ee6ff2748
 
Value = 0 '\000'
0x000055cc624f733f in innobase_trx_init (thd=0x62a00022e208, 
    trx=0x5cf337fd2b20)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:2738
2738		DBUG_ASSERT(thd == trx->mysql_thd);
(rr) when
Current event: 338120

The trace is somewhat tricky, because at this point of execution (as well as according to the AddressSanitizer output), the memory is actually marked as accessible. The memory was apparently reused for an internal transaction. We must find the point of time when the memory was actually marked as free. It is before the following where the memory was marked allocated:

(rr) watch -l *(byte*)0x0b9ee6ff2748
Hardware watchpoint 2: -location *(byte*)0x0b9ee6ff2748
(rr) disable 1
(rr) rc
Continuing.
Thread 39 hit Hardware watchpoint 2: -location *(byte*)0x0b9ee6ff2748
 
Old value = 0 '\000'
New value = 247 '\367'
__memset_avx2_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:204
204	../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such file or directory.
1: /x *(byte*)0x0b9ee6ff2748 = 0xf7
(rr) when
Current event: 318391
(rr) bt
#0  __memset_avx2_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:204
#1  0x00007acd2bbb3410 in __asan_unpoison_memory_region ()
   from /lib/x86_64-linux-gnu/libasan.so.5
#2  0x000055cc62969fd4 in trx_allocate_for_background ()
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/trx/trx0trx.cc:325
#3  0x000055cc6296a94f in trx_allocate_for_mysql ()
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/trx/trx0trx.cc:483
#4  0x000055cc624f76c3 in innobase_trx_allocate (thd=0x62a00022e208)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:2772
#5  0x000055cc624f78af in check_trx_exists (thd=0x62a00022e208)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:2797
#6  0x000055cc6255f839 in ha_innobase::delete_table (this=0x62b0000fcbe8, 
    name=0x3b6145a69400 "./test/t1", sqlcom=SQLCOM_DROP_TABLE)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:13272
(rr) rc
Continuing.
[Switching to Thread 1120600.1125736]
 
Thread 45 hit Hardware watchpoint 2: -location *(byte*)0x0b9ee6ff2748
 
Old value = 247 '\367'
New value = 0 '\000'
__memset_avx2_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:244
244	in ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
1: /x *(byte*)0x0b9ee6ff2748 = 0x0
(rr) when
Current event: 318259
(rr) bt
#0  __memset_avx2_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:244
#1  0x00007acd2bbb3022 in __asan_poison_memory_region ()
   from /lib/x86_64-linux-gnu/libasan.so.5
#2  0x000055cc6297db1c in trx_t::free (this=0x5cf337fd2b20)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/trx/trx0trx.cc:422
#3  0x000055cc6296b871 in trx_free_for_background (trx=0x5cf337fd2b20)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/trx/trx0trx.cc:554
#4  0x000055cc6296c617 in trx_free_for_mysql (trx=0x5cf337fd2b20)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/trx/trx0trx.cc:709
#5  0x000055cc6255f3c7 in ha_innobase::create (this=0x61c00024c0a8, 
    name=0x6ca10d073020 "./test/t1", form=0x6ca10d06fdd0, 
    create_info=0x6ca10d073670, file_per_table=true, trx=0x5cf337fd2b20)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:13084
#6  0x000055cc6252cc1e in ha_innobase::create (this=0x61c00024c0a8, 
    name=0x6ca10d073020 "./test/t1", form=0x6ca10d06fdd0, 
    create_info=0x6ca10d073670)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:13115

Even this one looks a little suspicious, because I do not think that the internal transactions of DDL operations would be registered with THD. Let us set one more breakpoint and continue the execution in the forward direction:

(rr) break __asan_report_load8
Breakpoint 3 at 0x7acd2bbb8170
(rr) run 317530
(rr) display/x *(byte*)0x0b9ee6ff2748
(rr) c
Continuing.
[Switching to Thread 1120600.1121344]
 
Thread 33 hit Hardware watchpoint 2: -location *(byte*)0x0b9ee6ff2748
 
Old value = 0 '\000'
New value = 247 '\367'
__memset_avx2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:246
246	../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such file or directory.
1: /x *(byte*)0x0b9ee6ff2748 = 0xf7
Current event: 317537
(rr) bt
#0  __memset_avx2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:246
#1  0x00007acd2bbb3022 in __asan_poison_memory_region () from /lib/x86_64-linux-gnu/libasan.so.5
#2  0x000055cc6297db1c in trx_t::free (this=0x5cf337fd2b20)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/trx/trx0trx.cc:422
#3  0x000055cc6296b871 in trx_free_for_background (trx=0x5cf337fd2b20)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/trx/trx0trx.cc:554
#4  0x000055cc6296c617 in trx_free_for_mysql (trx=0x5cf337fd2b20)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/trx/trx0trx.cc:709
#5  0x000055cc62501b86 in innobase_close_connection (hton=0x614000001248, thd=0x62a0000ba208)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:5151
#6  0x000055cc61fe42c0 in closecon_handlerton (thd=0x62a0000ba208, plugin=0x602000055678, unused=0x0)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/handler.cc:818
#7  0x000055cc619eaae9 in plugin_foreach_with_mask (thd=0x62a0000ba208, 
    func=0x55cc61fe41ed <closecon_handlerton(THD*, plugin_ref, void*)>, type=1, state_mask=10, arg=0x0)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_plugin.cc:2411
#8  0x000055cc61fe4312 in ha_close_connection (thd=0x62a0000ba208)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/handler.cc:831
#9  0x000055cc618e723e in THD::free_connection (this=0x62a0000ba208)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_class.cc:1477
#10 0x000055cc6174f047 in unlink_thd (thd=0x62a0000ba208)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/mysqld.cc:2969
#11 0x000055cc6174fa8f in one_thread_per_connection_end (thd=0x62a0000ba208, put_in_cache=true)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/mysqld.cc:3110
#12 0x000055cc61d21515 in do_handle_one_connection (connect=0x608000000aa8)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_connect.cc:1355
#13 0x000055cc61d20aec in handle_one_connection (arg=0x608000000aa8)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_connect.cc:1241
#14 0x00007ffd358ce609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#15 0x00004bad50df2103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This is where the transaction object was freed. Note the THD::free_connection() in the call stack. The following is the first invalid access to the freed object. There could be multiple such accesses in the trace:

(rr) ena
(rr) c
Continuing.
[Switching to Thread 1120600.1129198]
 
Thread 45 hit Hardware access (read/write) watchpoint 1: *(byte*)0x0b9ee6ff2748
 
Value = 247 '\367'
0x000055cc62501ddd in innobase_kill_query (thd=0x62a0000ba208)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:5180
5180	    ut_ad(trx->mysql_thd == thd);
1: /x *(byte*)0x0b9ee6ff2748 = 0xf7
(rr) when
Current event: 317555
(rr) bt
#0  0x000055cc62501ddd in innobase_kill_query (thd=0x62a0000ba208)
    at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:5180
#1  0x000055cc61fe444b in kill_handlerton (thd=0x62a0000ba208, plugin=0x602000060b58, 
    level=0x6d0f794df5a0) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/handler.cc:843
#2  0x000055cc619eaae9 in plugin_foreach_with_mask (thd=0x62a0000ba208, 
    func=0x55cc61fe4315 <kill_handlerton(THD*, plugin_ref, void*)>, type=1, state_mask=8, 
    arg=0x6d0f794df5a0) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_plugin.cc:2411
#3  0x000055cc61fe454e in ha_kill_query (thd=0x62a0000ba208, level=THD_ABORT_ASAP)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/handler.cc:850
#4  0x000055cc618e9468 in THD::awake (this=0x62a0000ba208, state_to_set=KILL_CONNECTION_HARD)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_class.cc:1752
#5  0x000055cc619cd3e4 in kill_one_thread (thd=0x62a00027c208, id=578, 
    kill_signal=KILL_CONNECTION_HARD, type=KILL_TYPE_ID)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:8928
#6  0x000055cc619cded8 in sql_kill (thd=0x62a00027c208, id=578, state=KILL_CONNECTION_HARD, 
    type=KILL_TYPE_ID) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:9034
#7  0x000055cc619b57fb in mysql_execute_command (thd=0x62a00027c208)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:5245
#8  0x000055cc619c60ea in mysql_parse (thd=0x62a00027c208, 
    rawbuf=0x62b00016c228 "KILL CONNECTION @kill_id /* E_R Thread8 QNO 571 CON_ID 598 */", length=61, 
    parser_state=0x6d0f794e0c60, is_com_multi=false, is_next_command=false)
    at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:7762

It looks like a race between a client disconnect and KILL to me.

To fix this, you must ensure that both THD::awake() and THD::free_connection() are being protected by a common mutex, to ensure the mutual exclusion of these two threads. Maybe there is some state field in THD that is not being properly protected or checked in THD::awake() to avoid the call. Obviously, THD::awake() should be a no-op after the client was disconnected and the transaction was aborted and freed.

I see the following code in THD::awake():

  mysql_mutex_assert_owner(&LOCK_thd_data);
  mysql_mutex_lock(&LOCK_thd_kill);
  set_killed_no_mutex(state_to_set);
  if (state_to_set != NOT_KILLED)
    ha_kill_query(this, thd_kill_level(this));

Using one more watch, I see that THD::killed is not being modified by the free_connection() call at all.

Here is one more stack trace that I think should point out the problem:

Thread 33 hit Hardware watchpoint 2: -location *(byte*)0x0b9ee6ff2748
 
Old value = 0 '\000'
New value = 247 '\367'
__memset_avx2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:246
246	../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such file or directory.
1: /x *(byte*)0x0b9ee6ff2748 = 0xf7
Current event: 317537
(rr) thr a 45 bt
 
Thread 45 (Thread 1120600.1129198):
#0  0x00004bad50db0361 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x6d0f794df1f0, rem=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  0x00004bad50db5eb7 in __GI___nanosleep (requested_time=<optimized out>, remaining=<optimized out>) at nanosleep.c:27
#2  0x000055cc626d3a8a in os_thread_sleep (tm=1000) at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/os/os0thread.cc:231
#3  0x000055cc62501dbf in innobase_kill_query (thd=0x62a0000ba208) at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:5179
#4  0x000055cc61fe444b in kill_handlerton (thd=0x62a0000ba208, plugin=0x602000060b58, level=0x6d0f794df5a0) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/handler.cc:843
#5  0x000055cc619eaae9 in plugin_foreach_with_mask (thd=0x62a0000ba208, func=0x55cc61fe4315 <kill_handlerton(THD*, plugin_ref, void*)>, type=1, state_mask=8, arg=0x6d0f794df5a0) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_plugin.cc:2411
#6  0x000055cc61fe454e in ha_kill_query (thd=0x62a0000ba208, level=THD_ABORT_ASAP) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/handler.cc:850
#7  0x000055cc618e9468 in THD::awake (this=0x62a0000ba208, state_to_set=KILL_CONNECTION_HARD) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_class.cc:1752
#8  0x000055cc619cd3e4 in kill_one_thread (thd=0x62a00027c208, id=578, kill_signal=KILL_CONNECTION_HARD, type=KILL_TYPE_ID) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:8928
#9  0x000055cc619cded8 in sql_kill (thd=0x62a00027c208, id=578, state=KILL_CONNECTION_HARD, type=KILL_TYPE_ID) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:9034
#10 0x000055cc619b57fb in mysql_execute_command (thd=0x62a00027c208) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:5245
#11 0x000055cc619c60ea in mysql_parse (thd=0x62a00027c208, rawbuf=0x62b00016c228 "KILL CONNECTION @kill_id /* E_R Thread8 QNO 571 CON_ID 598 */", length=61, parser_state=0x6d0f794e0c60, is_com_multi=false, is_next_command=false) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:7762
#12 0x000055cc6199f2c7 in dispatch_command (command=COM_QUERY, thd=0x62a00027c208, packet=0x62900300c209 " KILL CONNECTION @kill_id /* E_R Thread8 QNO 571 CON_ID 598 */ ", packet_length=63, is_com_multi=false, is_next_command=false) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_parse.cc:1827
(rr) thr a 33 bt
Thread 33 (Thread 1120600.1121344):
#0  __memset_avx2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:246
#1  0x00007acd2bbb3022 in __asan_poison_memory_region () from /lib/x86_64-linux-gnu/libasan.so.5
#2  0x000055cc6297db1c in trx_t::free (this=0x5cf337fd2b20) at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/trx/trx0trx.cc:422
#3  0x000055cc6296b871 in trx_free_for_background (trx=0x5cf337fd2b20) at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/trx/trx0trx.cc:554
#4  0x000055cc6296c617 in trx_free_for_mysql (trx=0x5cf337fd2b20) at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/trx/trx0trx.cc:709
#5  0x000055cc62501b86 in innobase_close_connection (hton=0x614000001248, thd=0x62a0000ba208) at /home/mleich/Server/bb-10.2-MDEV-23536/storage/innobase/handler/ha_innodb.cc:5151
#6  0x000055cc61fe42c0 in closecon_handlerton (thd=0x62a0000ba208, plugin=0x602000055678, unused=0x0) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/handler.cc:818
#7  0x000055cc619eaae9 in plugin_foreach_with_mask (thd=0x62a0000ba208, func=0x55cc61fe41ed <closecon_handlerton(THD*, plugin_ref, void*)>, type=1, state_mask=10, arg=0x0) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_plugin.cc:2411
#8  0x000055cc61fe4312 in ha_close_connection (thd=0x62a0000ba208) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/handler.cc:831
#9  0x000055cc618e723e in THD::free_connection (this=0x62a0000ba208) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_class.cc:1477
#10 0x000055cc6174f047 in unlink_thd (thd=0x62a0000ba208) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/mysqld.cc:2969
#11 0x000055cc6174fa8f in one_thread_per_connection_end (thd=0x62a0000ba208, put_in_cache=true) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/mysqld.cc:3110
#12 0x000055cc61d21515 in do_handle_one_connection (connect=0x608000000aa8) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_connect.cc:1355
#13 0x000055cc61d20aec in handle_one_connection (arg=0x608000000aa8) at /home/mleich/Server/bb-10.2-MDEV-23536/sql/sql_connect.cc:1241
#14 0x00007ffd358ce609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#15 0x00004bad50df2103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This trace shows that the client disconnect was triggered during the THD::awake() execution.

In the THD::free_connection() I see the following:

1450	  bool is_wsrep= WSREP(this);
1451	  if (is_wsrep) mysql_mutex_lock(&LOCK_thd_data);
1459	  mysql_mutex_lock(&LOCK_thd_kill);
1460	  mysql_mutex_unlock(&LOCK_thd_kill);
1465	  if (is_wsrep) mysql_mutex_unlock(&LOCK_thd_data);
1477	  ha_close_connection(this);

So, this exact race condition is caused by the fact that the ha_close_connection() call is not being protected by a mutex that is protecting THD::awake(). I think that we must hold THD::LOCK:thd_data across that call, no matter whether Galera is disabled at run-time or compilation time.

Comment by Jan Lindström (Inactive) [ 2020-12-29 ]
Comment by Jan Lindström (Inactive) [ 2020-12-30 ]

mleich Can you test again, I think 10.2 first from branch bb-10.2-MDEV-23536

Comment by Matthias Leich [ 2021-01-06 ]

bb-10.2-MDEV-23536 behaved well in RQG testing.
The corresponding test was no more capable to replay any of the error patterns
#       [ 'MDEV-17092' , 'SUMMARY: AddressSanitizer: use-after-poison .{1,150}lock0lock.cc.{1,30} in lock_trx_handle_wait_low' ],
#       # 10.2 HEAD 1d12bff42c7dd73b2a0158d6780267467b9e0871 asan 
#       [ 'VARIANT1' , 'SUMMARY: AddressSanitizer: use-after-poison .{1,150}ha_innodb.cc.{1,30} in innobase_kill_query' ],
#       # 10.2 HEAD 1d12bff42c7dd73b2a0158d6780267467b9e0871 debug 
#       # mysqld: /Server/10.2_old/storage/innobase/handler/ha_innodb.cc:5250: void innobase_kill_query(handlerton*, THD*, thd_kill_levels): Assertion `trx->mysql_thd == thd' failed.
#       [ 'VARIANT2' , 'mysqld: .{1,150}ha_innodb.cc.{1,30} void innobase_kill_query.{1,150}: Assertion .trx->mysql_thd == thd. failed.' ],
 
In addition the test battery for broad range functional coverage did not show new unknown issues.

Comment by Marko Mäkelä [ 2021-01-07 ]

jplindst, I posted a comment about the 10.6 version, which is specific to the "Galera is enabled" scenario. If you find the comment valid and you decide to change something, I’d like to review that.

mleich, thank you for testing the 10.2 version. I think that we will need tests of the 10.3 and 10.6 versions as well, because the locking inside InnoDB has been significantly changed in both.

Comment by Jan Lindström (Inactive) [ 2021-01-07 ]

I will address 10.4+ Galera only issue later, so 10.6 can be tested Galera disabled.

Comment by Jan Lindström (Inactive) [ 2021-01-08 ]

To help merge 10.4: https://github.com/MariaDB/server/commit/545cdf433146782c088062ba3a15d50eb3ddcfdb

Comment by Matthias Leich [ 2021-01-08 ]

The RQG testing with the replay test on bb-10.3-MDEV-23536 + the os_thread_sleep patch showed two new issues.
 
pluto:/data/Results/1610048719/000148.log
 mysqld: sql/sql_table.cc:2428: int mysql_rm_table_no_locks(THD*, TABLE_LIST*, bool, bool, bool, bool, bool, bool): Assertion `thd->mdl_context.is_lock_owner(MDL_key::TABLE, table->db.str, table->table_name.str, MDL_SHARED)' failed.
 
pluto:/data/Results/1610048719/002598.log  (there are more of such results)
  # 2021-01-07T20:52:16 [2879386] ==2926968==LeakSanitizer has encountered a fatal error.
 # 2021-01-07T20:52:16 [2879386] |in/mysqld(_ZN8I_P_ListI10MDL_ticket16I_P_List_adapterIS0_XadL_ZNS0_12next_in_lockEEEXadL_ZNS0_12prev_in_lockEEEE21I_P_List_null_counter23I_P_List_fast_push_backIS0_EE6removeEPS0_+0x140)[0x561e0c85aa4c]
 # 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN8MDL_lock11Ticket_list13remove_ticketEP10MDL_ticket+0x27)[0x561e0c85144f]
# 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN8MDL_lock13remove_ticketEP7LF_PINSMS_NS_11Ticket_listEP10MDL_ticket+0x54)[0x561e0c851e8a]
# 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN11MDL_context12release_lockE17enum_mdl_durationP10MDL_ticket+0x23b)[0x561e0c856787]
# 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN11MDL_context27release_locks_stored_beforeE17enum_mdl_durationP10MDL_ticket+0x208)[0x561e0c856bce]
# 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN11MDL_context27release_transactional_locksEv+0x1f0)[0x561e0c857a6c]
# 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN3THD27release_transactional_locksEv+0x72)[0x561e0c274476]
# 2021-01-07T20:52:16 [2879386] | [bin/mysqld(_Z21mysql_execute_commandP3THD+0x14e8c)[0x561e0c48de3b]
# 2021-01-07T20:52:16 [2879386] | sql/sql_plist.h:129(I_P_List<MDL_ticket, I_P_List_adapter<MDL_ticket, &MDL_ticket::next_in_lock, &MDL_ticket::prev_in_lock>, I_P_List_null_counter, I_P_List_fast_push_back<MDL_ticket> >::remove(MDL_ticket*))[0x561e0c499674]
# 2021-01-07T20:52:16 [2879386] | sql/mdl.cc:1232(MDL_lock::Ticket_list::remove_ticket(MDL_ticket*))[0x561e0c4704b8]
# 2021-01-07T20:52:16 [2879386] | sql/mdl.cc:2632(MDL_context::release_lock(enum_mdl_duration, MDL_ticket*))[0x561e0c46cfef]
# 2021-01-07T20:52:16 [2879386] | sql/mdl.cc:2853(MDL_context::release_transactional_locks())[0x561e0c837872]
# 2021-01-07T20:52:16 [2879386] | /Server_bin/bb-10.3-MDEV-23536_asan/bin/mysqld(handle_one_connection+0x59)[0x561e0c83712c]
...
  during DROP TABLE IF EXISTS ...
 
I tried the replay test on actual 10.3 + the os_thread_sleep patch too.
Issues like above did not show up. But there was a significant higher fraction of replay tests failing.
They all showed some ASAN failure during KILL CONNECTION.
 
Given the fact that
- a higher fraction of replay tests failed on actual 10.3
- bb-10.3-MDEV-23536 performed well in the test battery for broad range coverage of functionality
bb-10.3-MDEV-23536 seems to be better than the actual 10.3.
Maybe the ASAN failure observed on actual 10.3 just prevents to hit the two new issues observed
on bb-10.3-MDEV-23536.
 
RQG testing on bb-10.6-MDEV-23536 + the os_thread_sleep patch
All replay tests passed.
bb-10.6-MDEV-23536 also performed well in the test battery for broad range coverage of functionality.
 
I vote for merge+push.

Comment by Roel Van de Paar [ 2021-01-12 ]

I also did a run against this branch. Here are the things I am seeing, let me know if anything looks related and I can reduce:

One line per issue seen. First field (| separated) can be assert message, then signal, then first 4 frames.

!table->fts|SIGABRT|dict_make_room_in_cache|srv_master_evict_from_table_cache|srv_master_do_idle_tasks|srv_master_callback
[REF_NEXT_COMMENT] SIGSEGV|my_scan_weight_utf8mb3_general_ci|my_strnncollsp_utf8mb3_general_ci|charset_info_st::strnncollsp|sortcmp_lex_string
SIGSEGV|my_strndup|Event_queue_element_for_exec::init|Event_queue::get_top_for_execution_if_time|Event_scheduler::run
[REF_NEXT_COMMENT] binlog_table_maps == 0 || locked_tables_mode == LTM_LOCK_TABLES|SIGABRT|THD::reset_for_next_command|mysql_parse|dispatch_command|do_command
[REF_NEXT_COMMENT] thd->transaction->stmt.is_empty()|SIGABRT|trans_check|trans_rollback|THD::cleanup|unlink_thd
_ZN13MYSQL_BIN_LOG7cleanupEv   (limited stack info, error log only)

I've cleaned up this list to exclude all bugs seen in normal 6.0 runs.

Comment by Marko Mäkelä [ 2021-01-18 ]

Roel, which exact branch did you test? The last 3 assertions look potentially related to this change; the first 3 assertion failures look unrelated.

As far as I can tell, serg pushed a revised fix to 10.2. It has not been merged to later branches yet.

Comment by Roel Van de Paar [ 2021-01-21 ]

marko Thank you!

bb-10.6-MDEV-23536 at rev 5b25f651a961e06d377f2f8ff78ac38dc40f3fb3

Further findings:

SIGSEGV|my_scan_weight_utf8mb3_general_ci|my_strnncollsp_utf8mb3_general_ci|charset_info_st::strnncollsp|sortcmp_lex_string
binlog_table_maps == 0 || locked_tables_mode == LTM_LOCK_TABLES|SIGABRT|THD::reset_for_next_command|mysql_parse|dispatch_command|do_command
thd->transaction->stmt.is_empty()|SIGABRT|trans_check|trans_rollback|THD::cleanup|unlink_thd

> Were found in normal 6.0 also: MDEV-22915, MDEV-22915, MDEV-24706

Generated at Thu Feb 08 09:23:08 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.