[MDEV-31427] MariaDB replication server's SQL Thread stuck at 'Waiting for prior transaction to commit' Created: 2023-06-07 Updated: 2023-06-20 Resolved: 2023-06-08 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.6.10 |
| Fix Version/s: | 10.6.11 |
| Type: | Bug | Priority: | Major |
| Reporter: | Mohamed Ismail | Assignee: | Kristian Nielsen |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | replication | ||
| Environment: |
hw.system.vendor=HP |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
Replication hangs often at a one Master - One Replica setup. MariaDB : v10.6.10 Processlist Id User Host db Command Time State Info Progress Slave status Slave_IO_State: Waiting for master to send event Relay log events show relaylog events in 'mysql-1-relay-bin.004631' from 201088966 limit 20;
-------------------------
------------------------- Replica my.cnf binlog-format=ROW |
| Comments |
| Comment by Kristian Nielsen [ 2023-06-07 ] | |||||||||||||||||||||||||||||||
|
Thanks a lot for the excellent bug report with very good and detailed information! I did a quick initial analysis. What we see is that one replication worker thread is stuck trying to commit, probably the transaction with GTID 0-2-696762572. The remaining 47 worker threads are waiting for the first thread, each with a transaction ready to commit in order one after the other. So from the replication side all looks fine. What is really strange is where the first thread is stuck. It is inside InnoDB trying to write the commit record to the redo log, and it is hanging inside pthread_cond_signal(). That is not something that would normally be expected to be able to hang:
The __lll_lock_wait() sounds like some internal glibc locking. I'll try to find source for pthread_cond_signal() and see what this could be.
I wonder if it is some glibc bug, or some MariaDB mis-use of pthread, or what could cause such a hang. If the problem is an after-effect of some corruption inside pthread_cond_signal() data structures, the real problem may have occured earlier and be hard to identify from the hung state... | |||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-06-07 ] | |||||||||||||||||||||||||||||||
|
If I understand correctly, the source code to the pthread_cond_signal() used in RHEL 7.9 should be this: https://elixir.bootlin.com/glibc/glibc-2.17.90/source/nptl/pthread_cond_signal.c This is a relatively simple function, and there's just one internal mutex taken, presumably the one we hang on. Unfortunately this may make it hard to track down the real problem. It is interesting that you are able to reproduce reliably (if infrequently). It would be interesting to get this reproduced on a slave that is running some kind of memory debugging (valgrind/address sanitiser/...), but that is probably quite hard to pull off, as you wrote it only reproduces every couple of days. One datapoint that would be very useful as a start would be to get stack traces of a couple of other hangs as you reproduce this again. To see if they get stuck in the same place on that same condition variable; that would suggest some problem in that specific code area (InnoDB group_commit_lock::release), which may make it a little easier to track down... | |||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-06-08 ] | |||||||||||||||||||||||||||||||
|
I looked a bit deeper, and I think this is a duplicate of I also now see that the hang is inside the async completion for the client in the thread pool. The patch for So I think I'll close this as a duplicate of - Kristian. | |||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-06-08 ] | |||||||||||||||||||||||||||||||
|
Looks like a duplicate of | |||||||||||||||||||||||||||||||
| Comment by Mohamed Ismail [ 2023-06-08 ] | |||||||||||||||||||||||||||||||
|
Thank you for your efforts. We will upgrade to latest release v10.6.14 | |||||||||||||||||||||||||||||||
| Comment by Mohamed Ismail [ 2023-06-08 ] | |||||||||||||||||||||||||||||||
|
@knielsen , Would like to know if there's any work around this issue until we plan an upgrade. | |||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-06-08 ] | |||||||||||||||||||||||||||||||
|
mariadbuser, you can try setting --thread-handling=one-thread-per-connection . | |||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-06-09 ] | |||||||||||||||||||||||||||||||
|
knielsen, I also looked at this case presented in our mailing list, to reply with a guess of
While you must've penetrated deeply, perhaps you can confirm my guess? Andrei | |||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-06-09 ] | |||||||||||||||||||||||||||||||
|
Elkin, I agree the issue on the mailing list looks precisely like this issue. I wrote a reply on the list. - Kristian. |