[MDEV-25190] Semaphore wait has lasted > 600 seconds; stuck on bg_wsrep_kill_trx Created: 2021-03-18 Updated: 2021-05-15 Resolved: 2021-04-07 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.3.28 |
| Fix Version/s: | 10.3.29 |
| Type: | Bug | Priority: | Major |
| Reporter: | Walter Doekes | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Description |
|
Hi, we've had a bunch of deadlocks (+sigabrt) now that resulted in these logs:
Relevant versions:
I've compared two core dumps:
See the attached dump1.txt and dump2.txt for closer inspection. The thread that appears to unjustly be holding the lock (68 and 97 respectively) has this BT:
Is this a known issue? Is there any additional info I can provide? Cheers, |
| Comments |
| Comment by Walter Doekes [ 2021-03-19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Also of interest:
I poked around some in the dump. Thread 68 in dump 1 is here:
That thread is what the deadlock checker (dump1:thread170) was waiting for:
If I check the changes between 10.3.25 and 10.3.28, I notice that commit 29bbcac adds the bg_wsrep_kill_trx code, where there is only one, not two locks:
But then this commit with the obscurely named message "merge" adds the second lock – adding a lock that we now appear to block on:
.. although it does rhyme with "correct lock order" comment in the first commit:
So.. that would still be the correct order. I'm probably completely off track about now. Why is lock_mutex_enter() causing a conditional wait? And who is supposed to kick it to life? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-03-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you for you anlysis. There is clearly a bug here. In my opinion it should be :
Now we have victim_thread->LOCK_thd_kill | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-04-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
There was a clear problem in sql_class.cc in function thd_need_ordering_with as it had
Here last parameter means that we would lock other_thd->LOCK_thd_data mutex. This could lead mutex deadlock as we are already holding lock_sys->mutex. Correct ordering of these mutexes is THD::LOCK_thd_data before lock_sys->mutex not other way around. This has been now fixed. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Walter Doekes [ 2021-04-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Excellent! We're anxiously awaiting the commit and the new release to show up Tack! | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Walter Doekes [ 2021-05-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Dear jplindst, I didn't notice any commits in this area at the time, and now I still don't see any changes to sql_class.cc, nor do I see anything related in any changelog. Is this really fixed? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Walter Doekes [ 2021-05-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Oh, so this was already fixed by 45e33e05e2529e456fc4ce28f9f32fbe1a546526 for That was not obvious to me. Never mind. All clear | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-05-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
wdoekes Based on extensive QA all possible problematic cases are not fixed in 10.2 and 10.3 see https://jira.mariadb.org/browse/MDEV-25609 |