[MDEV-31448] Killing a replica thread awaiting its GCO can hang/crash a parallel replica - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.4(EOL), 10.5, 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL)
Fix Version/s: 10.4.31
Component/s: Replication
Labels:
- parallelslave

Description

Killing a replica thread awaiting its GCO can hang/crash a parallel replica

If any transactions have started their commit phase while a replica thread that is waiting on its GCO to start is killed, the parallel slave will hang (non-debug) or crash in an assertion error (debug) if the actively committing transactions error. This is because the killed replica thread will perform GCO cleanup on the previous GCO while it has not finished, leading to the same outcome as ~~MDEV-30780~~.

For example, on a replica with three worker threads, if we have three transactions, T1, T2, and T3, grouped into two GCOs as GCO1

{T1, T2}

and GCO2

{T3}

such that T1 is executing, T2 is ready and queued for group commit, and T3 is waiting for its GCO to start, if T3 is killed it will perform GCO cleanup on GCO1 even though T1 and T2 are still active. The following MTR test shows this:

--source include/master-slave.inc

--source include/have_innodb.inc

--source include/have_debug.inc

--source include/have_binlog_format_row.inc

--echo #

--echo # Initialize test data

--connection master

create table t1 (a int) engine=innodb;

create table t2 (a int) engine=innodb;

insert into t1 values (1);

--source include/save_master_gtid.inc

--connection slave

--source include/sync_with_master_gtid.inc

--source include/stop_slave.inc

--let $save_innodb_lock_wait_timeout= `SELECT @@global.innodb_lock_wait_timeout`

--let $save_transaction_retries= `SELECT @@global.slave_transaction_retries`

set @@global.slave_parallel_threads= 3;

set @@global.slave_parallel_mode= CONSERVATIVE;

set @@global.innodb_lock_wait_timeout= 2;

set @@global.slave_transaction_retries= 0;

BEGIN;

SELECT * FROM t1 WHERE a=1 FOR UPDATE;

--connection master

SET @old_dbug= @@SESSION.debug_dbug;

SET @@SESSION.debug_dbug="+d,binlog_force_commit_id";

# GCO 1

SET @commit_id= 10000;

# T1

update t1 set a=2 where a=1;

# T2

insert into t2 values (1);

# GCO 2

SET @commit_id= 10001;

# T3

insert into t1 values (3);

--connection slave

--source include/start_slave.inc

--let $wait_condition= SELECT count(*)=1 FROM information_schema.processlist WHERE state LIKE 'Update_rows_log_event::find_row(-1)' and  command LIKE 'Slave_worker';

--source include/wait_condition.inc

--let $wait_condition= SELECT count(*)=1 FROM information_schema.processlist WHERE state LIKE 'Waiting for prior transaction to commit%' and  command LIKE 'Slave_worker';

--source include/wait_condition.inc

--let $wait_condition= SELECT count(*)=1 FROM information_schema.processlist WHERE state LIKE 'Waiting for prior transaction to start commit%' and  command LIKE 'Slave_worker';

--source include/wait_condition.inc

--let $t3_tid= `SELECT ID FROM INFORMATION_SCHEMA.PROCESSLIST WHERE STATE LIKE 'Waiting for prior transaction to start commit'`

--eval kill $t3_tid

--echo #

--echo # Cleanup

--connection master

DROP TABLE t1, t2;

--source include/save_master_gtid.inc

--connection slave

--source include/sync_with_master_gtid.inc

--source include/rpl_end.inc

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mdev-31448_kill_ooo_finish_optimistic_me.test
2 kB
2023-06-13 09:37

Issue Links

relates to

MDEV-30780 parallel slave hangs after hit an error

Closed

Activity

Ascending order - Click to sort in descending order

Kristian Nielsen added a comment - 2023-06-11 15:49

The fix for this (in the ~~MDEV-13915~~ patch) is not correct.
The real problem seems to be an incorrect unmark_start_commit() in signal_error_to_sql_driver_thread(). This unmark can run after the following GCO has started, and then a later mark_start_commit() will access freed GCO.

I have implemented what I think is a correct fix here, in my knielsen_faster_stop_slave branch:

https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave
https://github.com/MariaDB/server/commit/6ce9c839997c1fc78c2989540dd21155a96fb419

Kristian Nielsen added a comment - 2023-06-11 15:49 The fix for this (in the MDEV-13915 patch) is not correct. The real problem seems to be an incorrect unmark_start_commit() in signal_error_to_sql_driver_thread(). This unmark can run after the following GCO has started, and then a later mark_start_commit() will access freed GCO. I have implemented what I think is a correct fix here, in my knielsen_faster_stop_slave branch: https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave https://github.com/MariaDB/server/commit/6ce9c839997c1fc78c2989540dd21155a96fb419

Andrei Elkin added a comment - 2023-06-13 09:54 - edited

knielsen, a test that catches a gco assert in finish_event_group is attached.
The test is to run on the bare 10.5. (The description test covers a similar issue in the conservative mode; I am finishing its reviewing shortly).
As a summary (let me place it here; I am repeating that on github pages as well) from the attached one,
an immediate cause of the killed worker behavior is its unlinking from the commit parent, very much similarly to ~~MDEV-30780~~.
~~MDEV-13915~~ fixes that to return from wait_for_prior_commit 's killed-but-no-commit-started branch with the parent link intact (as of now, June 15, this is improved to cover all kill target phases).
And the 2nd protection from OOO is a force_wait flag that the killed T2 (of the test) worker employs inside finish_event_group().

Andrei Elkin added a comment - 2023-06-13 09:54 - edited knielsen , a test that catches a gco assert in finish_event_group is attached. The test is to run on the bare 10.5. (The description test covers a similar issue in the conservative mode; I am finishing its reviewing shortly). As a summary (let me place it here; I am repeating that on github pages as well) from the attached one, an immediate cause of the killed worker behavior is its unlinking from the commit parent, very much similarly to MDEV-30780 . MDEV-13915 fixes that to return from wait_for_prior_commit 's killed-but-no-commit-started branch with the parent link intact ( as of now, June 15, this is improved to cover all kill target phases ). And the 2nd protection from OOO is a force_wait flag that the killed T2 (of the test) worker employs inside finish_event_group() .

Andrei Elkin added a comment - 2023-06-15 11:16

A fixes commit is pushed to bb-10.5-andrei.

Andrei Elkin added a comment - 2023-06-15 11:16 A fixes commit is pushed to bb-10.5-andrei.

Kristian Nielsen added a comment - 2023-06-15 20:41

As we discussed on Zulip, there are two different approaches here to select between.

1. Force each event group to wait for the prior event group to commit or fail, even in case of kill. This is the approach that Andrei's patch is taking.

2. Allow a killed wait_for_prior_commit() to abort the thread out-of-order. Fix finish_event_group() so that it works correctly in the error case where it may complete out-of-order with earlier transactions. I have pushed a patch for this to my branch knielsen_faster_stop_slave: https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave which passes the various tests.

I think both approaches have merit, still thinking to decide which I prefer.

Kristian Nielsen added a comment - 2023-06-15 20:41 As we discussed on Zulip, there are two different approaches here to select between. 1. Force each event group to wait for the prior event group to commit or fail, even in case of kill. This is the approach that Andrei's patch is taking. 2. Allow a killed wait_for_prior_commit() to abort the thread out-of-order. Fix finish_event_group() so that it works correctly in the error case where it may complete out-of-order with earlier transactions. I have pushed a patch for this to my branch knielsen_faster_stop_slave: https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave which passes the various tests. I think both approaches have merit, still thinking to decide which I prefer.

Kristian Nielsen added a comment - 2023-07-12 08:05

Pushed to 10.4

Kristian Nielsen added a comment - 2023-07-12 08:05 Pushed to 10.4

MariaDB Server

Killing a replica thread awaiting its GCO can hang/crash a parallel replica

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration