[MDEV-31448] Killing a replica thread awaiting its GCO can hang/crash a parallel replica Created: 2023-06-09 Updated: 2023-08-25 Resolved: 2023-07-12 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0 |
| Fix Version/s: | 10.4.31 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Brandon Nesterenko | Assignee: | Kristian Nielsen |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | parallelslave | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
Killing a replica thread awaiting its GCO can hang/crash a parallel replica If any transactions have started their commit phase while a replica thread that is waiting on its GCO to start is killed, the parallel slave will hang (non-debug) or crash in an assertion error (debug) if the actively committing transactions error. This is because the killed replica thread will perform GCO cleanup on the previous GCO while it has not finished, leading to the same outcome as For example, on a replica with three worker threads, if we have three transactions, T1, T2, and T3, grouped into two GCOs as GCO1 {T1, T2}and GCO2 {T3}such that T1 is executing, T2 is ready and queued for group commit, and T3 is waiting for its GCO to start, if T3 is killed it will perform GCO cleanup on GCO1 even though T1 and T2 are still active. The following MTR test shows this:
|
| Comments |
| Comment by Kristian Nielsen [ 2023-06-11 ] |
|
The fix for this (in the I have implemented what I think is a correct fix here, in my knielsen_faster_stop_slave branch: https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave |
| Comment by Andrei Elkin [ 2023-06-13 ] |
|
knielsen, a test that catches a gco assert in finish_event_group is attached. |
| Comment by Andrei Elkin [ 2023-06-15 ] |
|
A fixes commit is pushed to bb-10.5-andrei. |
| Comment by Kristian Nielsen [ 2023-06-15 ] |
|
As we discussed on Zulip, there are two different approaches here to select between. 1. Force each event group to wait for the prior event group to commit or fail, even in case of kill. This is the approach that Andrei's patch is taking. 2. Allow a killed wait_for_prior_commit() to abort the thread out-of-order. Fix finish_event_group() so that it works correctly in the error case where it may complete out-of-order with earlier transactions. I have pushed a patch for this to my branch knielsen_faster_stop_slave: https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave which passes the various tests. I think both approaches have merit, still thinking to decide which I prefer. |
| Comment by Kristian Nielsen [ 2023-07-12 ] |
|
Pushed to 10.4 |