[MDEV-30161] InnoDB lock wait timeout watchdog Created: 2022-12-05 Updated: 2023-02-14 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Galera, Storage Engine - InnoDB |
| Affects Version/s: | 10.6 |
| Fix Version/s: | 10.6 |
| Type: | Bug | Priority: | Major |
| Reporter: | Jan Lindström (Inactive) | Assignee: | Seppo Jaakola |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Description |
|
As transactions which are executed by appliers have passed certification in the cluster, they must be applied and committed successfully. These transactions are called brute force (BF). Normally, when BF transaction is requesting conflicting lock that is held by local transaction, local transaction is aborted and rolled back. However, occasionally BF aborting local transactions may not work perfectly due to race conditions or unforeseen behavior of the lock manager, which may cause appliers to wait locks indefinitely. There has been bug reports containing error log message "WSREP: BF lock wait long for trx:" (.e.g. In order to resolve indefinite applier waits, a short applier lock wait timeout is introduced. However instead of giving up with lock wait, a background thread is used to retry BF abort on behalf of the applier which is waiting for the lock. A variable to control the applier lock wait timeout is innodb_wsrep_applier_lock_wait_timeout with default value of five seconds. If the value is zero, the background BF aborting is disabled. The value of innodb_wsrep_applier_lock_wait_timeout is set to zero in Galera suite MTR test configuration to avoid non-deterministic behavior. |
| Comments |
| Comment by Marko Mäkelä [ 2022-12-10 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Before we can proceed with this, I think that we must thoroughly understand the reason of the hang. I checked /qa-dev/BF_abort_3_threads/node2/core that ramesh produced. The crashing Thread 1 (due to a lock wait timeout in a brute force applier) is blocked by a lock held by Thread 11, Thread 20, or Thread 21. I found these threads by following a hash bucket chain that matched the lock:
All threads Thread 11, 20, 21 are waiting for something inside Galera, deep inside wsrep_before_commit(). I think that we must find out why the transaction abort mechanism does not work as intended. For that, an rr record trace of this failure would be helpful. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-12-12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
As far as I understand, the actual problem is that the Galera conflict resolution does not work. The proposed addition of a configuration parameter might work around it, but it is unclear to me how it would work. I think that we first need to understand the actual problem. ramesh was able to reproduce the problem on cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo using rr record. I found the root cause in the trace:
This will stop the execution at the point where a high-priority transaction starts to wait for a lock. Here are some commands and their output that I used for gathering more detail:
At the time of the lock wait, we have these record locks in the system. I checked each hash bucket chain to find all conflicting locks. Only one hash bucket chain matched this page number (0xa9 in tablespace 8), and each element in the bucket chain is for this page:
I single-stepped through lock_wait_wsrep(), and it constructed the following collections:
I compared that to the transaction pointers in the chain:
In this chain, we find exactly the 4 victim transaction pointers, plus the high-priority transaction 0xe317fe13280 that is trying to kill them. Nothing was missed. If I continue the execution to the crash, I will see that the list starts with 0xe317fe1b680 and 0xe317fe13280. Why was 0xe317fe1b680 not killed? Let us check which thread is handling it:
By searching the output of thread apply all backtrace for this pointer, I will find one thread. It was Thread 31 in my execution:
We will see that this thread will remain blocked (and blocking the high-priority transaction) until the very end of the execution:
To fix this bug, lock_wait_wsrep_kill() needs to wake gu_cond_wait_SYS() from its sleep, to let the transaction to be rolled back. Possibly, an error needs to be propagated all the way up to wsrep_before_commit(). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2022-12-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marko thanks for the detailed gdb session, much appreciated. Bottom line is indeed that one of the blocking victims, was never aborted, or aborting was not successful. The purpose of this MDEV, however, is not to implement replication conflict resolving. Conflicts should be resolved by the high priority operator (usually replication applier), when the lock conflict is first encountered. This conflict resolving has gone broken at some point, and work started to remedy the damage a couple of months ago. Fixes have been submitted as 3 separate pull requests as part of MDEV-29684. The actual purpose of this MDEV, otoh, is to provide replication monitoring within the mariadb server, to provide diagnostics for troubleshooting purposes and releasing blocked replication appliers, if eternal hanging is observed. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2022-12-19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
seppo We applied you PRs on branch bb-10.6-
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2022-12-20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
At lock_wsrep_wait BF has following victims :
where first is thread_id and second trx->id. From these naturally interesting one is thread with id 7. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2022-12-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
note that this PR does not fix any possible issues in cluster conflict resolving. This just implements a watchdog thread, which monitors the progress of replication applying, providing logging for situations of delayed applying and potentially resolving a locked applying state when BF aborting was not successful. |