Details
-
Bug
-
Status: Closed (View Workflow)
-
Blocker
-
Resolution: Fixed
-
10.6
Description
In MDEV-24789, we are minimizing the use of lock_sys.latch. It turns out that when the acquisition of an exclusive lock_sys.latch in innobase_kill_query() is replaced with an acquisition of a shared lock_sys.latch, a number of tests would occasionally hang:
- rpl.rpl_parallel_optimistic
- rpl.rpl_parallel_optimistic_xa_lsu_off
- rpl.rpl_parallel_optimistic_nobinlog
It seems that we can work around this bug by making innobase_kill_query() acquire an exclusive lock_sys.latch instead of a shared one. This work-around will obviously hurt performance, and I would think that it is merely reducing the probability of such hangs, instead of fixing them altogether. Until this bug is fixed, we can invoke the work-around whenever thd_need_wait_reports() holds.
Note: thd_need_wait_reports() holds even when no replication is being used, and only the option log_bin is enabled. That condition seems to be necessary, because without it, the test binlog.rpl_parallel_optimistic would hang (fall back to innodb_lock_wait_timeout).
Attachments
Issue Links
- is caused by
-
MDEV-24789 Performance regression after MDEV-24671
-
- Closed
-
- relates to
-
MDEV-24789 Performance regression after MDEV-24671
-
- Closed
-
-
MDEV-24948 thd_need_wait_reports() hurts performance
-
- Open
-
To repeat this, apply the following patch:
diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc
index 223783a11bc..b5a8d3b7933 100644
--- a/storage/innobase/lock/lock0lock.cc
+++ b/storage/innobase/lock/lock0lock.cc
@@ -5444,6 +5444,7 @@ void lock_sys_t::cancel(trx_t *trx, lock_t *lock)
/** Cancel a waiting lock request (if any) when killing a transaction */
void lock_sys_t::cancel(trx_t *trx)
{
+#undef HAVE_REPLICATION
#ifdef HAVE_REPLICATION /* Work around MDEV-25016 (FIXME: remove this!) */
/* Parallel replication tests would occasionally hang if we did not
and run the following:
On my system, I can observe that the CPU load will reduce as more and more tests will hang. Finally, one of the tests will time out. Without the patch, the tests will complete.