Details
Description
In an environment running Galera Cluster with 6 MariaDB nodes, 1 arbitrator node, some replicas and a ProxySQL, after a network issue that triggered a state transfer on two nodes,
for some reason, almost all the transactions hang in:
- “starting” state on the commit statement or on "".
- "acquiring total order isolation" on the "KILL CONNECTION" statement (The "KILL CONNECTION" was requested by the ProxySQL)
We tried to restart the service but it hangs on stopping, ProxySQL detected this node as down and switched the traffic to another node.
By looking at the backtrace it seems that we have a kind of "pthread_cond_wait() deadlock" executed by lock.wait() on the enter() function on the commit monitor during the commit order critical section.
Unfortunately, we didn't find a way to reproduce the problem
Attachments
Issue Links
- blocks
-
MDEV-30963 Assertion failure !lock.was_chosen_as_deadlock_victim in trx0trx.h:1065
-
- Closed
-
- causes
-
MDEV-29346 update_rows_log_event hung causing galera cluster failure
-
- Closed
-
-
MDEV-30372 Assertion `state() == s_executing || state() == s_preparing || state() == s_prepared || state() == s_must_abort || state() == s_aborting || state() == s_cert_failed || state() == s_must_replay' failed
-
- Closed
-
- includes
-
MDEV-31075 KILL QUERY maintains nodes data consistency but breaks GTID sequence
-
- Closed
-
- relates to
-
MDEV-28472 BF lock wait long for trx - Assertion `mode_ == m_local || transaction_.is_streaming()' failed
-
- Closed
-
-
MDEV-29323 Galera ha_abort_transaction is not honored if there are no InnoDB lock conflicts
-
- Open
-
I have just came across this issue when trying to move a DB cluster from a percona cluster into a MariaDB using logical backups.
After a while of the applications running I ended up with hundreds of processes, which were stuck in starting commit state attached is a redacted sample of the process list process-list-sample.txt
.
I have restarted the cluster and enabled wsrep debug, to try and get some additional information, as to what is happening when it locks up into this state.
Version information is:
OS: Debian 11
Kernel: 5.10.0-16-amd64 #1 SMP Debian 5.10.127-1
MariaDB: 10.5.15-0+deb11u1
Galera: 26.4.11-0+deb11u1