[MDEV-23328] Server hang due to Galera lock conflict resolution Created: 2020-07-29 Updated: 2022-12-16 Resolved: 2021-01-24 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.1, 10.2, 10.3, 10.4, 10.5 |
| Fix Version/s: | 10.2.37, 10.3.28, 10.4.18, 10.5.9, 10.6.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Marko Mäkelä | Assignee: | Sergei Golubchik |
| Resolution: | Fixed | Votes: | 3 |
| Labels: | hang, not-10.6 | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
When a SQL KILL statement is requesting a transaction to be aborted at the same time when the same transaction is chosen as a victim in the Galera transaction certification process, the server can hang. There have been attempts to fix this problem earlier. A suggested fix for It seems possible that something related to this caused |
| Comments |
| Comment by Jan Lindström (Inactive) [ 2020-07-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Problem with current implementation can be repeated with following very high level test case that requires at least 2-node Galera cluster:
Problem is that both kill transaction and applier transaction try to kill the same victim transaction while they hold necessary mutexes. However, mutexes are taken in different order in kill transaction compared to applier transaction leading to mutex deadlock. To further analyze the bug lets use 10.4 and start from kill transaction. Here user has issued KILL <thread> where thread id is same as victim transaction's thread id. Interesting point starts from sql_parse.cc at function kill_one_thread().
If we did find the thread we are trying to kill it will be protected from concurrent kill by LOCK_thd_kill mutex and THD's data members are protected from concurrent change by LOCK_thd_data mutex. Thus, user can't issue another concurrent KILL for this thread anymore, that KILL statement would have to wait and will not anymore find the thread. However, there is other method that can kill the same thread as the thread user is trying to kill. In problematic scheduling we assume that user kill is executed to GAP1 point and then it will pause. Let's now consider other method. For this we need conflicting transaction that is paused at node_2.
This paused update is the victim transaction that user will try to kill and another method will try to kill. Let's now move to node_2 in Galera cluster.
We can assume that in node_1 there is no conflicting transactions that access t1. Thus, this transaction will pass it certification and write set is send to node_2 to be executed by the applier. In node_2 update to row (1,0) has a conflicting lock request i.e exclusive lock. This is found in lock0lock.cc function lock_rec_other_has_conflicting() when we try to create lock request for applier transaction.
At the point we have found the conflicting lock request we are holding lock_sys->mutex that is global mutex and trx_mutex for victim transaction. Lock_sys mutex protect so that no new lock request may be created and lock request we have found can't be released. Trx_mutex does not allow victim to commit or abort. Applier transaction is BF (brute force) transaction and we would not like to wait long, so we try to kill victim transaction to get this record lock released. KIlling is done in ha_innodb.cc in function wsrep_innobase_kill_one_trx.
At the GAP2 point we are still holding global lock_sys->mutex and trx_mutex for the victim transaction but but these do not restrict the other method of killing victim threads i.e. user KILL statement to be executed. Bad CPU scheduler in node_2 can first execute applier transaction (BF) up to GAP2 and pause. Then it could start executing user kill transaction up to GAP1 and pause. Finally, we can try to execute applier transaction up to GAP3 but we cant as kill transaction is holding victim_thd->LOCK_thd_data and applier is also requesting it so applier must wait.
Execution of user KILL statement will pause when kill transaction requests lock_sys->mutex as that mutex is already locked by applier transaction. Now both user kill statement transaction and applier transaction execution has been paused on mutex wait and they wait each other.
This is a mutex deadlock. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-07-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Design how to avoid mutex deadlock. We will still use thd->wsrep_aborter but with atomic operations so that
We need to change both kill_one_thread() function as well as wsrep_innobase_kill_one_trx() function.
Condition for wsrep_aborter must be done using atomic operation with strong consistency so that if it is 0 we set our current bf_thd->thread_id as a marker that execution of kill operation has started. If it is not 0 i.e. free executing thread i.e. bf_thd->thread_id must be same as wsrep_aborter value.
Here again condition for wsrep_aborter must be done using atomic operation with strong consistency so that if it is 0 we set our current bf_thd->thread_id as a marker that execution of kill operation has started. If it is not 0 i.e. free executing thread i.e. bf_thd->id must be same as wsrep_aborter value. How this high-level description of the algorithm avoids the mutex deadlock presented earlier? Remember that reason for mutex deadlock was that both kill transaction and applier transaction was trying to lock victim_thd->LOCK_thd_data and lock_sys->mutex. For user KILL statement transaction execution we can't avoid locking victim_thd->LOCK_thd_data BUT we can avoid request to lock_sys->mutex at innobase_kill_query(). This is done by atomic operation to wsrep_aborter before we even enter victim_thd->awake_no_mutex():
Similarly on applier transaction executing wsrep_innobase_kill_one_trx function before we even request victim_thd->LOCK_thd_data mutex. Idea is to control what thread can continue kill operation execution so that only one thread will be able to do it and so that first thread who can set wsrep_aborter variable will do it.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-07-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
serg I need some guidance. Firstly, I hope my description of the problem is understandable and detailed enough. Open questions are:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-07-30 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
jplindst, how are you addressing the comment that svoj made in
He is suggesting to revert the changes that were made in I have a theoretical background on model checking and verification of parallel and distributed systems. The common line of thought is that the more global state is being accessed, the bigger the potential state space is. I believe that it was a mistake to explode the reachable state space further by introducing THD::wsrep_aborter in Concurrency is hard, and there should be no excuse against using freely available tools such as Spin. It has been successfully applied on telecommunication protocols and in space technology. It was even applied on the InnoDB rw_lock_t implementation at one time in the past. I do not think that anyone can meaningfully assess correctness without carefully checking that the implementation conforms to an unambiguous specification, and without checking that the specification itself is correct. When it comes to checking the specification, I think that we must employ exhaustive state space enumeration. Model checking in terms of temporal logic formulae is not needed; we are only interested in the absence of deadlocks or violations of safety invariants. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-07-30 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I can't address problem that thd->free_connection() can be called concurrently with KILL/thd->awake(). In my opinion that is technical debt that requires it own ticket, that part do not belong on this ticket. I do agree that this technical debt should be fixed but I'm not sure if I would be correct assignee to fix it and this fix might be so big that we should target it only for 10.6. This technical debt belong to MariaDB to fix. I let here serg to decide this part and ralf.gebhardt@mariadb.com FYI. I agree that introducing THD::wsrep_aborter will explode the reachable state space further and the fact that you could model simplified version of this problem using formal model tools. I'm not familiar with Spin or any other formal method tool and I do not have really theoretical background or experience on model checking and verification of parallel and distributed systems. On this ticket I do understand what is current problem, its cause and test case that does repeat the problem every time. With this complex code there is possibility of other bugs. I agree that it could be not possible meaningfully assess correctness of my suggested fix with just looking the changes but at the moment we do not have anything better. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-07-30 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Attached patch to create test case kill_test.diff to repeat the current known problem and a simple experimental test program that uses compare_exchange_strong method (atomics.cc). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-08-19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Attached spin models and verification results. In mdev-pre-21910.pl we have very simplified model for original problem and as expected spin finds the mutex deadlock see mdev-pre-21010-spin.txt. in mdev-23328.pl we have very simplified model for proposed fix and verification result is on mdev-23328-spin.txt. There is no deadlocks in this verification. I know that this is very simplified model but it does contain same mutexes as actual problem. Missing is communication between thread starting a victim kill and victim thread but in my understanding that is not the problem on this bug. For InnoDB point of view Galera lock conflict resolution is similar to lock deadlock situation. Both have at least two treads and one of them is selected as victim and this victim is then aborted. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-08-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I filed | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-09-03 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Current status:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-10-08 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Having as little shared data as possible should make things easier to understand. Likewise, a one-shot flag (which is reset at object creation and cannot be reset once it has been set) is much easier to reason about than a flag that can be reset and set multiple times. Is there an alternative solution that would allow us to remove the field THD::wsrep_aborter that was added in | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-10-08 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sorry, I missed the note in the page that I linked to:
It seems to refer to P0371R1, which states:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2020-10-20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The scenario Jan posted is a valid one. With proper scheduling manual KILL command and BF aborting may still end up in mutex deadlock. This anomaly is possible to avoid by changing the manual KILL processing to do victim marking up front, before the actual THD::awake call. This victim marking must happen without holding THD::LOCK_thd_data and with holding the victim's innodb trx mutex. In this model both manual KILL and BF aborting will happen through same mutex locking protocols. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-11-09 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-11-10 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This review request is for changes on server code. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-11-12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marko Can you review InnoDB changes? 10.2 and 10.3 are similar and then 10.4 and 10.5 are similar. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by MikaH [ 2020-12-02 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Is this proceeding? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-12-08 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I would like to see the SAFE_MUTEX warnings that would demonstrate the problem and the effectiveness of the solution. In the bb-10.6- It seems to me that both | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-12-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Using bb-10.6-
User KILL command hangs here:
Actual SAFE_MUTEX warning looks as follows (it would be good that it would show something more meaningful on name not just mutex):
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-12-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Rebased versions 14.12.2020:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-12-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
monty Can you please review server changes. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-12-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The function wsrep_thd_set_wsrep_aborter(), which seems to be at the core of the fix, is not documented by any source code comment and not even mentioned in the commit message. It appears to be comparing and copying data members of up to two THD objects. How the data members are protected are not documented, neither in source code comments nor in the form of mysql_mutex_assert_owner() assertions. Introducing more complex data structures might fix the hang, but it could also shift or transform the problem and make it harder to diagnose. Do we really have to extend the KILL interface with a Galera specific mode? It seems to me that the changes to innobase_kill_query() will thwart the | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-12-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Added requested source code comments and added assertions. Improved commit message. Currently, I do not know any other working solution for Galera 4, for Galera 3 we could use similar background killing method as synchronous replication is using but that change is significantly bigger. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-12-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I updated and rebased commits on list posted today. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-12-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Most of my comments were addressed. Unfortunately, no effort to fix the race condition in innobase_kill_query() was made. I think that the proper course of action would be to fix | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-01-10 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
serg Can you review https://github.com/MariaDB/server/commit/59b71d9cfa05f6b1ad3ff5f90433d141a1947582 . This does address Marko's above comment now that | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-01-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
suggestion: introduce THD::LOCK_wsrep_data that will protect all THD::wsrep* members. It'll allow to fix mutex order violation | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-01-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
serg I would not go that solution, we already had one but it was refactored away during 10.4 Galera 4 review requested by Svoj. Furthermore, using THD::LOCK_thd_data to protect THD::wsrep* variables prevent concurrent thread to manually kill (KILL QUERY|CONNECTION) while other thread accesses THD::wsrep* variables. I agree that current implementation does prevent mutex deadlock but does not solve mutex ordering violation. Thus, a better solution is required. Looked 10.2 and how asynchronous replication does when it needs to kill victim. In lock0lock.cc function DeadlockChecker::search() we hold lock_sys->mutex and call thd_rpl_deadlock_check(). This is on sql_class.cc and there we take LOCK_slave_background that looks like a global mutex. Background thread then handles these victims on handle_slave_background it will take that same LOCK_slave_background and victim_thread->LOCK_thd_data and calls victim_thd->awake(). To me it seems that we do not hold LOCK_slave_background during THD::awake(). I had similar approach for 10.2 and it did work fine, but I had problems with 10.4. Do you see some other better solution or should I try again this background kill ? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-01-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Something like https://github.com/MariaDB/server/commit/1b6bd701b30f8297db38e605692aa91aff92a06d | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-01-19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
background kill alone doesn't help unless you move the complete wsrep_innobase_kill_one_trx and the complete wsrep_abort_transaction into the background. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-01-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
ok to push from me, there is just minor formatting error on wsrep_innobase_kill_one_trx | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-01-24 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
approved and pushed commit, for the record, https://github.com/MariaDB/server/commit/29bbcac0ee with two preparatory commits | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-02-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In 10.6, this conflicted with On my merge, I also added some condition on trx->state == TRX_STATE_COMMITTED_IN_MEMORY in order to avoid unnecessary work. If a transaction has already been committed or rolled back, it will release its locks in lock_release() and let the waiting thread(s) continue execution. I think that it could make sense to backport these changes to earlier versions as well. In 10.5, I noticed some "long semaphore wait" messages in one of the tests. I believe that this is due to an unnecessary race between the killer thread and the brute-force thread (which keeps requesting the same victim transaction to be killed over and over again, while creating more and more lock requests for something that could have been handled with implicit locking). The CPU usage that I experienced was probably explained by this:
One might imagine that timeout=-1 means ‘infinite’. But, to this version of the galera-4 library, it apparently meant ‘1 second in the past’, causing an immediate return from the pthread_cond_timeout(). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-02-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
My observation about wsrep_sync_wait() turned out to be incorrect. There is a short timeout. The -1 refers to repl.causal_read_timeout from wsrep_provider_options. I checked with perf top that the wsrep_sync_wait() thread was generating very few samples, possibly corresponding to waking up once every few seconds. The CPU was consumed by high-priority ("brute-force") applier thread until I removed the offending lock_cancel_waiting_and_release(lock) call, as noted in my previous comment. I submitted | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Iosif Peterfi [ 2021-03-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This fix in 10.4.18 made the cluster freeze every few days with one node lagging behind. When the lagging node is restarted the cluster activity is resumed. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Walter Doekes [ 2021-03-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Interesting. I also appear to have an issue with the new bg_wsrep_kill_trx (in 10.3.28) as well, in | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-03-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, lets get first the facts. Did you use KILL [QUERY|CONNECTION] from command line or from application? If not then problem can't be similar to this MDEV. If your cluster freezes can you please provide some stack dump on separate MDEV. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Florian Bezdeka [ 2021-03-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
After updating from 10.4.17 to 10.4.18 I run into the "complete frozen cluster" issue twice now. After reading the changelog I assume the fix for this MDEV is introducing a regression. The "follow up" MDEV might be | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Florian Bezdeka [ 2021-08-28 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Why was this issue closed? I can't find a fix for 10.4.x... | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-12-03 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In 10.6.0, this was fixed in a simpler way by | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Stephan Vos [ 2022-09-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
OK so to check this issue has been fixed correctly in the 10.5.9 or is it still a problem? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Khai Ping [ 2022-09-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
i am still seeing this in 10.6.9. Reported in MDEV-29346 |