[MDEV-29293] MariaDB stuck on starting commit state (waiting on commit order critical section) Created: 2022-08-11 Updated: 2023-11-27 Resolved: 2023-05-22 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.5.15, 10.6.12 |
| Fix Version/s: | 10.4.31, 10.5.22, 10.6.15, 10.9.8, 10.10.6, 10.11.5, 11.0.3, 11.1.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | William Welter | Assignee: | Julius Goryavsky |
| Resolution: | Fixed | Votes: | 5 |
| Labels: | None | ||
| Environment: |
Galera version: 26.4.6-buster |
||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Description |
|
In an environment running Galera Cluster with 6 MariaDB nodes, 1 arbitrator node, some replicas and a ProxySQL, after a network issue that triggered a state transfer on two nodes,
By looking at the backtrace it seems that we have a kind of "pthread_cond_wait() deadlock" executed by lock.wait() on the enter() function on the commit monitor during the commit order critical section. Unfortunately, we didn't find a way to reproduce the problem |
| Comments |
| Comment by Carl Dobson [ 2022-09-01 ] |
|
I have just came across this issue when trying to move a DB cluster from a percona cluster into a MariaDB using logical backups. After a while of the applications running I ended up with hundreds of processes, which were stuck in starting commit state attached is a redacted sample of the process list process-list-sample.txt I have restarted the cluster and enabled wsrep debug, to try and get some additional information, as to what is happening when it locks up into this state. Version information is: |
| Comment by Khai Ping [ 2022-10-14 ] |
|
i am seeing this in my cluster as well. There will be a system user that stuck in state "committing". Why it could led to a galera cluster getting stuck/hung? |
| Comment by Seppo Jaakola [ 2023-02-24 ] |
|
Probably a similar hang was reproduced by using conflicting sysbench load and DDL (TOI mode replication), no ProxySQL involved in the test scenario. |
| Comment by Seppo Jaakola [ 2023-03-01 ] |
|
We can now reliably reproduce cluster hang, which is due to a deadlock between KILL CONNECTION execution and replication applier performing victim abort (for the connection which is target for the KILL command). However, stack traces of this hang are different than the stack traces attached in this MDEV. If the attached stack traces were recorded when the problem has not yet started, then we have matching problems. |
| Comment by Seppo Jaakola [ 2023-04-12 ] |
|
Review cycle and related fixes are still ongoing. The pull request and reviews for the PR can be tracked here: https://github.com/codership/mariadb-server/pull/293 |
| Comment by Khai Ping [ 2023-04-13 ] |
|
@seppo, can this happen on 10.6.5 as well? My cluster is on 10.6.5 |
| Comment by Teemu Ollakka [ 2023-04-18 ] |
|
Pull request opened here https://jira.mariadb.org/browse/MDEV-29293. |
| Comment by Marko Mäkelä [ 2023-04-18 ] |
|
I see that it has been previously claimed that this bug does not affect MariaDB Server 10.6 or later. Please clarify what should be done on merge to 10.6. If it is anything else than a null-merge (discarding the changes), we need to review and test the 10.6 version as well. Am I right that this is basically yet another attempt at fixing |
| Comment by Marko Mäkelä [ 2023-04-24 ] |
|
These changes seem to cause the test perfschema.nesting to fail. I reviewed the InnoDB changes of the 10.6 version of this (PR#2609), and I think that there is some room for race conditions. |
| Comment by Jan Lindström [ 2023-05-03 ] |
|
Latest version of PR fixes Marko's review comments and test failure. But Marko reviewed only 10.6 and InnoDB changes so review on sql-layer is needed. |
| Comment by Oleksandr Byelkin [ 2023-05-15 ] |
|
Looks ok for me |
| Comment by Julius Goryavsky [ 2023-05-22 ] |
|
Fix merged with head revision: |