[MDEV-8196] Stopping slave IO thread between a row event and COMMIT causes data inconsistency between master and slave Created: 2015-05-21  Updated: 2022-09-08

Status: Open
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 5.5, 10.0, 10.1
Fix Version/s: 10.1

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Kristian Nielsen
Resolution: Unresolved Votes: 1
Labels: None


 Description   

Note: the test below is extremely ugly, it was made up just to confirm the theory. The test is all about races, so it's quite likely it won't work on some machines.

The test generates a binary log with a bunch of tiny row events, BEGINs and COMMITs. It looks like this:

...
# at 74263
#150521  1:49:28 server id 1  end_log_pos 74301         GTID 0-1-407
/*!100001 SET @@session.gtid_seq_no=407*//*!*/;
BEGIN
/*!*/;
# at 74301
# at 74342
#150521  1:49:28 server id 1  end_log_pos 74342         Table_map: `test`.`t1` mapped to number 19
#150521  1:49:28 server id 1  end_log_pos 74376         Write_rows: table id 19 flags: STMT_END_F
### INSERT INTO `test`.`t1`
### SET
###   @1=405
# at 74376
#150521  1:49:28 server id 1  end_log_pos 74445         Query   thread_id=7     exec_time=0     error_code=0
SET TIMESTAMP=1432162168/*!*/;
COMMIT
/*!*/;
# at 74445
#150521  1:49:28 server id 1  end_log_pos 74483         GTID 0-1-408
/*!100001 SET @@session.gtid_seq_no=408*//*!*/;
BEGIN
/*!*/;
# at 74483
# at 74524
#150521  1:49:28 server id 1  end_log_pos 74524         Table_map: `test`.`t1` mapped to number 19
#150521  1:49:28 server id 1  end_log_pos 74558         Write_rows: table id 19 flags: STMT_END_F
### INSERT INTO `test`.`t1`
### SET
###   @1=406
# at 74558
#150521  1:49:28 server id 1  end_log_pos 74627         Query   thread_id=7     exec_time=0     error_code=0
SET TIMESTAMP=1432162168/*!*/;
COMMIT
/*!*/;
...

Also, the test stops IO thread at a random moment; waits till the SQL thread executes whatever was relayed to it; stops it also; and restarts both threads.

The IO thread can stop at different points in this binary log. When it happens before BEGIN, or after BEGIN, or after COMMIT, it's all fine. The problem starts when the IO thread stops after the row event, but before COMMIT, e.g. at position 74558 in the snippet above.

Then the SQL thread executes the event, but does not properly finish it because there is no COMMIT. Then everything stops, and when the threads are restarted, the event gets executed again, naturally causing data inconsistency and/or replication abort.

In non-parallel replication this situation is accompanied by a long list of complaints in the slave error log. It throws multiple warnings

Slave SQL: Request to stop slave SQL Thread received while applying a group that has non-transactional changes; waiting for completion of the group

.. then the fatal error

[ERROR] Slave SQL: Fatal error: ... Slave SQL Thread stopped with incomplete event group having non-transactional changes. If the group consists solely of row-based events, you can try to restart the slave with --slave-exec-mode=IDEMPOTENT, which ignores duplicate key, key not found, and similar errors (see documentation for details). Gtid 0-1-758, Internal MariaDB error code: 1593

... then yet another error

[ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594
150521  1:46:38 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 138145

... and only after replication restart, it ends up with the mentioned replication abort and/or data inconsistency.

So, even though nothing good comes out of this, the server's reputation is at least protected by the disclaimers in the error log.

With the parallel replication, it's different. No warnings about incomplete groups or errors about relay log are raised; the slave jumps directly to the replication restart and further abort and/or data inconsistency.

So, I see two problems here:

  • legacy one: maybe the IO thread should not be allowed to stop in such places?
  • parallel replication one: if it cannot handle the situation any better than the traditional replication does, then it should produce the same warnings and errors.

Ugly test case

# To be run with 
# --repeat=<many> --force-restart --nocheck-testcases
 
--source include/master-slave.inc
--source include/have_binlog_format_row.inc
 
--connection slave
--source include/stop_slave_sql.inc
 
--connection master
create table t1 (i int primary key) engine=MyISAM;
--delimiter |
create procedure pr() 
begin
  declare i int default 1;
  while i <= 1000
  do
    insert into t1 values (i);
    set i = i+1;
  end while;
end |
--delimiter ;
 
send call pr();
 
--connection slave
start slave;
--sleep 0.5
stop slave io_thread;
 
#--cat_file $MYSQLTEST_VARDIR/mysqld.2/mysqld.log
 
--connection master
--reap
 
--connection slave
 
--let $status_items= Read_Master_Log_Pos,Exec_Master_Log_Pos,Last_IO_Errno,Last_SQL_Errno
--source include/show_slave_status.inc
 
sleep 10;
--let $status_items= Read_Master_Log_Pos,Exec_Master_Log_Pos,Last_IO_Errno,Last_SQL_Errno
--source include/show_slave_status.inc
select max(i) from t1;
stop slave;
sleep 1;
--let $status_items= Read_Master_Log_Pos,Exec_Master_Log_Pos,Last_IO_Errno,Last_SQL_Errno
--source include/show_slave_status.inc
start slave sql_thread, io_thread;
sleep 5;
--let $status_items= Read_Master_Log_Pos,Exec_Master_Log_Pos,Last_IO_Errno,Last_SQL_Errno
--source include/show_slave_status.inc
select count(*) from t1;
if (`select count(*) != 1000 from t1`)
{
	--die "Error!"
}
 



 Comments   
Comment by Phil Sweeney [ 2016-03-01 ]

Just experienced this in production (running MariaDB 10.1.12) on a controlled shutdown:

2016-03-01 4:00:03 139962264959744 [Note] Slave I/O thread exiting, read up to log 'log-bin.002820', position 6460292
2016-03-01 4:00:03 139962264656640 [ERROR] Slave SQL: Fatal error: ... Slave SQL Thread stopped with incomplete event group having non-transactional changes. If the group consists solely of row-based events, you can try to restart the slave with --slave-exec-mode=IDEMPOTENT, which ignores duplicate key, key not found, and similar errors (see documentation for details). Gtid 0-10-514024699, Internal MariaDB error code: 1593
2016-03-01 4:00:03 139962264656640 [Note] Slave SQL thread exiting, replication stopped in log 'log-bin.002745' at position 47634064

When the service finishing starting up again (as part of automated process), it immediately got this SQL error:

2016-03-01 4:00:11 140569318804224 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table <table name>; Can't find record in '<table name>', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log log-bin.002745, end_log_pos 47635199, Gtid 0-10-514024699, Internal MariaDB error code: 1032

Skipped 1 statement and it ran OK after that.

Generated at Thu Feb 08 07:25:21 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.