[MDEV-13915] STOP SLAVE takes very long time on a busy system - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.2.8, 10.0(EOL), 10.1(EOL), 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5, 10.6, 10.7(EOL), 10.8(EOL)
Fix Version/s: 10.8.8, 10.4.31, 10.5.22, 10.6.15, 10.9.8, 10.10.6, 10.11.5, 11.0.3, 11.1.2, 11.2.1
Component/s: Replication
Labels:
None
Environment:
CentOS 7.4 x86_64

Description

150     system user             NULL    Slave_IO        66361   Waiting for master to send event        NULL    0.000

152     system user             NULL    Slave_worker    0       Write_rows_log_event::write_row(-1)     NULL    0.000

153     system user             NULL    Slave_worker    0       Write_rows_log_event::write_row(-1)     NULL    0.000

154     system user             NULL    Slave_worker    0       Write_rows_log_event::write_row(-1)     NULL    0.000

155     system user             NULL    Slave_worker    0       Write_rows_log_event::write_row(-1)     NULL    0.000

156     system user             NULL    Slave_worker    0       Write_rows_log_event::write_row(-1)     NULL    0.000

157     system user             NULL    Slave_worker    0       Write_rows_log_event::write_row(-1)     NULL    0.000

158     system user             NULL    Slave_worker    0       Write_rows_log_event::write_row(-1)     NULL    0.000

159     system user             NULL    Slave_worker    0       Write_rows_log_event::write_row(-1)     NULL    0.000

160     system user             NULL    Slave_worker    0       Write_rows_log_event::write_row(-1)     NULL    0.000

162     system user             NULL    Slave_worker    0       Write_rows_log_event::write_row(-1)     NULL    0.000

161     system user             NULL    Slave_worker    0       Write_rows_log_event::write_row(-1)     NULL    0.000

163     system user             NULL    Slave_worker    0       Write_rows_log_event::write_row(-1)     NULL    0.000

151     system user             NULL    Slave_SQL       1090    Reading event from the relay log        NULL    0.000

2415    root    127.0.0.1:42392 NULL    Query   835     Killing slave   STOP SLAVE      0.000

slave machine configuration:

sync_master_info          = 500000

sync_relay_log            = 100000

sync_relay_log_info       = 500000

slave_parallel_max_queued = 67108864

slave_parallel_mode       = optimistic

slave_parallel_threads    = 12

Is that normal?

Attachments

Issue Links

blocks

MDEV-30458 Consolidate Serial Replica to Parallel Replica with 1 Worker Thread

Open

Activity

Ascending order - Click to sort in descending order

View 13 older comments

Kristian Nielsen added a comment - 2023-06-10 18:42

Something that is missing from the discussion here is that the main reason STOP SLAVE is slow in parallel replication is not because it doesn't roll back running transactions. The main problem is that in many cases parallel replication will replicate all queued events (@@slave_parallel_max_queued).

I think this is a left-over of when only conservative mode existed. The current STOP SLAVE mechanism is seen in do_gco_wait(), it continues until the current GCO is completed (wait_count > entry->stop_count). But in optimistic mode, the GCO can be very large, potentially all queued events, thus stop is delayed longer than needed.

I think a much simpler solution is to fix this, so that stop_count is initialised to largest_started_sub_id, and compared against rgi->gtid_sub_id.

This will not rollback an existing long-running transaction, but I think that's actually good. Forcing stop immediately will cause massive rollback when many threads are configured (Jean Francois Gagné tested using > 1000 threads), which seems undesirable. And forcing stop does not guarantee fast stop anyway, a long-running statement will not be aborted.

Kristian Nielsen added a comment - 2023-06-10 18:42 Something that is missing from the discussion here is that the main reason STOP SLAVE is slow in parallel replication is not because it doesn't roll back running transactions. The main problem is that in many cases parallel replication will replicate all queued events (@@slave_parallel_max_queued). I think this is a left-over of when only conservative mode existed. The current STOP SLAVE mechanism is seen in do_gco_wait(), it continues until the current GCO is completed (wait_count > entry->stop_count). But in optimistic mode, the GCO can be very large, potentially all queued events, thus stop is delayed longer than needed. I think a much simpler solution is to fix this, so that stop_count is initialised to largest_started_sub_id, and compared against rgi->gtid_sub_id. This will not rollback an existing long-running transaction, but I think that's actually good. Forcing stop immediately will cause massive rollback when many threads are configured (Jean Francois Gagné tested using > 1000 threads), which seems undesirable. And forcing stop does not guarantee fast stop anyway, a long-running statement will not be aborted.

Kristian Nielsen added a comment - 2023-06-10 20:52

I implemented a much simpler fix for this in the branch knielsen_faster_stop_slave:

https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave

Kristian Nielsen added a comment - 2023-06-10 20:52 I implemented a much simpler fix for this in the branch knielsen_faster_stop_slave: https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave

Kristian Nielsen added a comment - 2023-06-11 15:51

And a better fix for the ~~MDEV-31448~~ fix included in Brandon's patch for this issue:

https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave
https://github.com/MariaDB/server/commit/6ce9c839997c1fc78c2989540dd21155a96fb419

Kristian Nielsen added a comment - 2023-06-11 15:51 And a better fix for the MDEV-31448 fix included in Brandon's patch for this issue: https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave https://github.com/MariaDB/server/commit/6ce9c839997c1fc78c2989540dd21155a96fb419

Kristian Nielsen added a comment - 2023-06-21 21:28 - edited

I now pushed another patch to the branch knielsen_faster_stop_slave:

https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave
https://github.com/MariaDB/server/commit/c20ce0dff404890df59f1fd305bd85be0ada86f8

This implements a STOP SLAVE FORCE option, which can be used to optionally force a quick slave stop by rolling back all active transactions on the next event or row operation, if no non-transactional event groups are blocking it. I included the test cases from Brandon's pull request, they are passing with this patch (replacing STOP SLAVE with STOP SLAVE FORCE).

I think this is the way to go if we want this rollback functionality, leaving the option to not roll back a lot of work needlessly. I also think this is a new feature and should only go to development version (11.1?). Comments welcome, of course.

The simple fix earlier on my branch (without forcing rollback) should be sufficient to solve the user/customers problem. I think that can go in 10.5 if we want (or even 10.4, it should be safe). Some of the other bug fixes on the branch related to stop may also be appropriate for some earlier branch, suggestions wellcome.

Kristian Nielsen added a comment - 2023-06-21 21:28 - edited I now pushed another patch to the branch knielsen_faster_stop_slave: https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave https://github.com/MariaDB/server/commit/c20ce0dff404890df59f1fd305bd85be0ada86f8 This implements a STOP SLAVE FORCE option, which can be used to optionally force a quick slave stop by rolling back all active transactions on the next event or row operation, if no non-transactional event groups are blocking it. I included the test cases from Brandon's pull request, they are passing with this patch (replacing STOP SLAVE with STOP SLAVE FORCE). I think this is the way to go if we want this rollback functionality, leaving the option to not roll back a lot of work needlessly. I also think this is a new feature and should only go to development version (11.1?). Comments welcome, of course. The simple fix earlier on my branch (without forcing rollback) should be sufficient to solve the user/customers problem. I think that can go in 10.5 if we want (or even 10.4, it should be safe). Some of the other bug fixes on the branch related to stop may also be appropriate for some earlier branch, suggestions wellcome.

Kristian Nielsen added a comment - 2023-07-12 08:15

I have pushed the fix for slow STOP SLAVE to 10.4 (The STOP SLAVE FORCE feature can go to a development branch later if we wish).
So I suggest to close this bug?

- Kristian.

Kristian Nielsen added a comment - 2023-07-12 08:15 I have pushed the fix for slow STOP SLAVE to 10.4 (The STOP SLAVE FORCE feature can go to a development branch later if we wish). So I suggest to close this bug? - Kristian.

MariaDB Server

STOP SLAVE takes very long time on a busy system

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration