Details
Description
A race condition may occur between the execution of transaction commit, and an execution of a KILL statement that would attempt to abort that transaction.
MDEV-17092 worked around this race condition by modifying InnoDB code. After that issue was closed, svoj pointed out that this race condition would better be fixed above the storage engine layer:
If you look carefully into the above, you can conclude that thd->free_connection() can be called concurrently with KILL/thd->awake(). Which is the bug. And it is partially fixed in THD::~THD(), that is destructor waits for KILL completion:
He is quoting this code of THD::~THD() in 10.5:
/* |
Other threads may have a lock on LOCK_thd_kill to ensure that this
|
THD is not deleted while they access it. The following mutex_lock
|
ensures that no one else is using this THD and it's now safe to delete
|
*/
|
if (WSREP_NNULL(this)) mysql_mutex_lock(&LOCK_thd_data); |
mysql_mutex_lock(&LOCK_thd_kill);
|
mysql_mutex_unlock(&LOCK_thd_kill);
|
if (WSREP_NNULL(this)) mysql_mutex_unlock(&LOCK_thd_data); |
|
if (!free_connection_done) |
free_connection();
|
And he seems to suggest that the empty critical section should be moved to THD::free_connection(). Note: in 10.2 and 10.3, that code is slightly different:
mysql_mutex_lock(&LOCK_thd_kill);
|
mysql_mutex_unlock(&LOCK_thd_kill);
|
|
#ifdef WITH_WSREP
|
delete wsrep_rgi; |
#endif
|
if (!free_connection_done) |
free_connection();
|
Nevertheless, it seems that we might want to do something like
#if MYSQL_VERSION_ID >= 100400
|
if (WSREP_NNULL(this)) mysql_mutex_lock(&LOCK_thd_data); |
#endif
|
if (!free_connection_done) |
free_connection();
|
else |
{
|
mysql_mutex_lock(&LOCK_thd_kill);
|
mysql_mutex_unlock(&LOCK_thd_kill);
|
}
|
#if MYSQL_VERSION_ID >= 100400
|
if (WSREP_NNULL(this)) mysql_mutex_unlock(&LOCK_thd_data); |
#elif defined WITH_WSREP
|
delete wsrep_rgi; |
#endif |
It might turn out that the else branch is not needed. The empty lock/unlock pair would of course be added to THD::free_connection().
It might also turn out that all the Galera-specific changes need to be done in THD::free_connection(). (In that case, we would likely want to assign wsrep_rgi= NULL).
As part of this fix, the trx_t::free() instrumentation that was modified in MDEV-22782 should be tightened: trx_t::mysql_thd and trx_t::state must be poisoned, because innobase_kill_connection() should no longer be invoked on a freed transaction of a freed connection. This should of course be validated with an RQG run similar to the one that reproduced MDEV-17092.
Attachments
Issue Links
- blocks
-
MDEV-23328 Server hang due to Galera lock conflict resolution
- Closed
- relates to
-
MDEV-24624 main.lock_tables_lost_commit failed in buildbot with timeout, Trying to lock mutex at sql/sql_base.cc, line 958, when the mutex was already locked
- Closed
-
MDEV-26014 Race condition between KILL and client disconnect
- Closed
-
MDEV-17092 ERROR: AddressSanitizer: use-after-poison around lock_trx_handle_wait_low
- Closed
-
MDEV-22782 SUMMARY: AddressSanitizer: unknown-crash storage/innobase/trx/trx0trx.cc:566 in trx_t::commit_state()
- Closed
-
MDEV-23328 Server hang due to Galera lock conflict resolution
- Closed
-
MDEV-24561 rpl.rpl_semi_sync_uninstall_plugin fails in bb with "Found wrong usage of mutex 'LOCK_thd_kill' and 'LOCK_show_status'"
- Closed
-
MDEV-29368 Assertion `trx->mysql_thd == thd' failed in innobase_kill_query from process_timers/timer_handler and use-after-poison in innobase_kill_query
- Closed