[MDEV-13915] STOP SLAVE takes very long time on a busy system Created: 2017-09-27 Updated: 2023-09-11 Resolved: 2023-07-12 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.0, 10.1, 10.2.8, 10.2, 10.3, 10.4, 10.5, 10.6, 10.7, 10.8 |
| Fix Version/s: | 10.8.8, 10.4.31, 10.5.22, 10.6.15, 10.9.8, 10.10.6, 10.11.5, 11.0.3, 11.1.2, 11.2.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Michael Xu | Assignee: | Brandon Nesterenko |
| Resolution: | Fixed | Votes: | 3 |
| Labels: | None | ||
| Environment: |
CentOS 7.4 x86_64 |
||
| Issue Links: |
|
||||||||
| Description |
|
slave machine configuration:
Is that normal? |
| Comments |
| Comment by Elena Stepanova [ 2017-09-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Indeed, parallel replication can make STOP SLAVE much longer if sizable transactions are running. The test case below demonstrates that. It updates many rows on master in a table without PK, so that RBR is really slow, then waits till the updates start running on the slave, executes STOP SLAVE, waits till it's finished and checks what's happened to the contents of the table on slave. If the test is run without parallel replication, STOP SLAVE finishes very fast, and the contents of the table remains unchanged – that is, updates on the slave are interrupted and not committed (rolled back). I expect it to be a design choice of parallel replication, hopefully Elkin will check it and confirm (or not). Note: the test case below is for reproducing only, do not put it into the regression suite!
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Angelique Sklavounos (Inactive) [ 2022-02-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I ran the test case above against 10.8.2-debug (commit 12cd3dc78d2a58a15377000a7a8adb92d4fa74fb) and could see the same 10x increase in time with
as with 10.2.43-debug (commit 941bc7053616d5ca8c9e6538828c4a65802e8c3d). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Brandon Nesterenko [ 2023-03-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi woosang, sure. It won't be done in time for the release this week, but we are planning on having it done in the next release (planned for June 8). Do they need a custom build to have that sooner? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Brandon Nesterenko [ 2023-03-08 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Andrei! This is ready for review: PR-2534. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-03-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
A round of review is done. Waiting for a new patch. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Chanjong Yu [ 2023-04-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The customer is asking if we have tested this fix with the same status of master as they encountered at Feb 2nd 13:33:07. IMHO, these are available options. Please refer to the SR (CS0537360) woosangcc | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Brandon Nesterenko [ 2023-04-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi chanjong.yu! I'm still working on the fix, but generally speaking, it should work with the version of their master, as it is change to the replica-side only. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Brandon Nesterenko [ 2023-04-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Adjustments for the first round of review have been made PR 2534 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-05-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The overall patch looks pretty good. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Brandon Nesterenko [ 2023-06-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Pushed into 10.4 as 0a99d457b Merge conflicts observed through manual cherry picking in 10.5 and 10.9, with fixes in branches bb-10.5-MDEV-13915-mergefix and bb-10.9-MDEV-13915-mergefix, respectively | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-06-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Please don't do this. And certainly not in a GA release! Edited: Ok, apparently we do that for non-parallel, but only when reading the next event. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-06-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
knielsen, salve. First my apologies that you were not explicitly notified on the patch for reviewing. It's not terminally late now I hope, provided you have time. It turns out 10.4 push was our mistake in the version that the support case needs. We then are free to play safe and revert it from 10.4. To the technical part of the patch, yes its idea is to make the sequential and parallel STOP:s work consistently, The patch is being reverted in 10.4 and rebased now on the top 10.5. bnestere and myself can discuss it interactively on zulip if you'll need that. Cheers, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Brandon Nesterenko [ 2023-06-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Re-opening to address additional findings | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-06-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Something that is missing from the discussion here is that the main reason STOP SLAVE is slow in parallel replication is not because it doesn't roll back running transactions. The main problem is that in many cases parallel replication will replicate all queued events (@@slave_parallel_max_queued). I think this is a left-over of when only conservative mode existed. The current STOP SLAVE mechanism is seen in do_gco_wait(), it continues until the current GCO is completed (wait_count > entry->stop_count). But in optimistic mode, the GCO can be very large, potentially all queued events, thus stop is delayed longer than needed. I think a much simpler solution is to fix this, so that stop_count is initialised to largest_started_sub_id, and compared against rgi->gtid_sub_id. This will not rollback an existing long-running transaction, but I think that's actually good. Forcing stop immediately will cause massive rollback when many threads are configured (Jean Francois Gagné tested using > 1000 threads), which seems undesirable. And forcing stop does not guarantee fast stop anyway, a long-running statement will not be aborted. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-06-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I implemented a much simpler fix for this in the branch knielsen_faster_stop_slave: https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-06-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
And a better fix for the https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-06-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I now pushed another patch to the branch knielsen_faster_stop_slave: https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave This implements a STOP SLAVE FORCE option, which can be used to optionally force a quick slave stop by rolling back all active transactions on the next event or row operation, if no non-transactional event groups are blocking it. I included the test cases from Brandon's pull request, they are passing with this patch (replacing STOP SLAVE with STOP SLAVE FORCE). I think this is the way to go if we want this rollback functionality, leaving the option to not roll back a lot of work needlessly. I also think this is a new feature and should only go to development version (11.1?). Comments welcome, of course. The simple fix earlier on my branch (without forcing rollback) should be sufficient to solve the user/customers problem. I think that can go in 10.5 if we want (or even 10.4, it should be safe). Some of the other bug fixes on the branch related to stop may also be appropriate for some earlier branch, suggestions wellcome. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-07-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have pushed the fix for slow STOP SLAVE to 10.4 (The STOP SLAVE FORCE feature can go to a development branch later if we wish). - Kristian. |