Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.4(EOL), 10.5, 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL)
Description
Killing a replica thread awaiting its GCO can hang/crash a parallel replica
If any transactions have started their commit phase while a replica thread that is waiting on its GCO to start is killed, the parallel slave will hang (non-debug) or crash in an assertion error (debug) if the actively committing transactions error. This is because the killed replica thread will perform GCO cleanup on the previous GCO while it has not finished, leading to the same outcome as MDEV-30780.
For example, on a replica with three worker threads, if we have three transactions, T1, T2, and T3, grouped into two GCOs as GCO1
{T1, T2}and GCO2
{T3}such that T1 is executing, T2 is ready and queued for group commit, and T3 is waiting for its GCO to start, if T3 is killed it will perform GCO cleanup on GCO1 even though T1 and T2 are still active. The following MTR test shows this:
--source include/master-slave.inc
|
--source include/have_innodb.inc
|
--source include/have_debug.inc
|
--source include/have_binlog_format_row.inc
|
|
--echo #
|
--echo # Initialize test data
|
--connection master
|
create table t1 (a int) engine=innodb;
|
create table t2 (a int) engine=innodb;
|
insert into t1 values (1);
|
--source include/save_master_gtid.inc
|
|
--connection slave
|
--source include/sync_with_master_gtid.inc
|
--source include/stop_slave.inc
|
--let $save_innodb_lock_wait_timeout= `SELECT @@global.innodb_lock_wait_timeout`
|
--let $save_transaction_retries= `SELECT @@global.slave_transaction_retries`
|
set @@global.slave_parallel_threads= 3;
|
set @@global.slave_parallel_mode= CONSERVATIVE;
|
set @@global.innodb_lock_wait_timeout= 2;
|
set @@global.slave_transaction_retries= 0;
|
BEGIN;
|
SELECT * FROM t1 WHERE a=1 FOR UPDATE;
|
|
--connection master
|
SET @old_dbug= @@SESSION.debug_dbug;
|
SET @@SESSION.debug_dbug="+d,binlog_force_commit_id";
|
|
# GCO 1
|
SET @commit_id= 10000;
|
# T1
|
update t1 set a=2 where a=1;
|
# T2
|
insert into t2 values (1);
|
|
# GCO 2
|
SET @commit_id= 10001;
|
# T3
|
insert into t1 values (3);
|
|
--connection slave
|
--source include/start_slave.inc
|
|
--let $wait_condition= SELECT count(*)=1 FROM information_schema.processlist WHERE state LIKE 'Update_rows_log_event::find_row(-1)' and command LIKE 'Slave_worker';
|
--source include/wait_condition.inc
|
--let $wait_condition= SELECT count(*)=1 FROM information_schema.processlist WHERE state LIKE 'Waiting for prior transaction to commit%' and command LIKE 'Slave_worker';
|
--source include/wait_condition.inc
|
--let $wait_condition= SELECT count(*)=1 FROM information_schema.processlist WHERE state LIKE 'Waiting for prior transaction to start commit%' and command LIKE 'Slave_worker';
|
--source include/wait_condition.inc
|
|
--let $t3_tid= `SELECT ID FROM INFORMATION_SCHEMA.PROCESSLIST WHERE STATE LIKE 'Waiting for prior transaction to start commit'`
|
--eval kill $t3_tid
|
|
--echo #
|
--echo # Cleanup
|
--connection master
|
DROP TABLE t1, t2;
|
--source include/save_master_gtid.inc
|
|
--connection slave
|
--source include/sync_with_master_gtid.inc
|
|
--source include/rpl_end.inc
|
Attachments
Issue Links
- relates to
-
MDEV-30780 parallel slave hangs after hit an error
- Closed