Details
Description
In an environment running Galera Cluster with 6 MariaDB nodes, 1 arbitrator node, some replicas and a ProxySQL, after a network issue that triggered a state transfer on two nodes,
for some reason, almost all the transactions hang in:
- “starting” state on the commit statement or on "".
- "acquiring total order isolation" on the "KILL CONNECTION" statement (The "KILL CONNECTION" was requested by the ProxySQL)
We tried to restart the service but it hangs on stopping, ProxySQL detected this node as down and switched the traffic to another node.
By looking at the backtrace it seems that we have a kind of "pthread_cond_wait() deadlock" executed by lock.wait() on the enter() function on the commit monitor during the commit order critical section.
Unfortunately, we didn't find a way to reproduce the problem
Attachments
Issue Links
- blocks
-
MDEV-30963 Assertion failure !lock.was_chosen_as_deadlock_victim in trx0trx.h:1065
- Closed
- causes
-
MDEV-29346 update_rows_log_event hung causing galera cluster failure
- Closed
-
MDEV-30372 Assertion `state() == s_executing || state() == s_preparing || state() == s_prepared || state() == s_must_abort || state() == s_aborting || state() == s_cert_failed || state() == s_must_replay' failed
- Closed
- includes
-
MDEV-31075 KILL QUERY maintains nodes data consistency but breaks GTID sequence
- Closed
- relates to
-
MDEV-28472 BF lock wait long for trx - Assertion `mode_ == m_local || transaction_.is_streaming()' failed
- Closed
-
MDEV-29323 Galera ha_abort_transaction is not honored if there are no InnoDB lock conflicts
- Open