[MDEV-4725] GTID strict mode does not allow slave to continue replicating after crash which happened during writing event group Created: 2013-06-27  Updated: 2013-11-21  Resolved: 2013-11-21

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: None
Fix Version/s: 10.0.7

Type: Bug Priority: Minor
Reporter: Elena Stepanova Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-26 Global transaction ID Closed

 Description   

If a slave crashes while writing an event group into its binary log, specifically after writing GTID event but before finishing Xid, and if gtid_strict_mode is enabled, the slave cannot resume replication after restart, it aborts with the error "An attempt was made to binlog GTID X-Y-Z which would create an out-of-order sequence number with existing GTID X-Y-Z, and gtid strict mode is enabled".

Note: This is a follow-up on the failure that we discussed earlier on IRC. I can now positively confirm that the failure I observed also happened upon slave crash. And as you suggested, I was able to reproduce it with "crash_before_writing_xid" (see the test case below). In fact, in my case the picture was slightly different, the slave crashed one step earlier, right after writing the GTID event, and before writing anything else. I tried to add a debug crash point there and it causes the same problem, so I suppose crash_before_writing_xid will do just as well.

Test case:

--source include/master-slave.inc
--source include/have_innodb.inc
 
create table t1 (i int) engine=InnoDB;
insert into t1 values (1),(2);
 
--sync_slave_with_master
 
--source include/stop_slave.inc
set sql_log_bin = 0;
alter table mysql.gtid_slave_pos engine=InnoDB;
change master to master_use_gtid=current_pos;
--source include/start_slave.inc
 
--let $_server_id= `SELECT @@server_id`
--let $_expect_file_name= $MYSQLTEST_VARDIR/tmp/mysqld.$_server_id.expect
--write_file $_expect_file_name
wait
EOF
 
SET GLOBAL debug_dbug="+d,crash_before_writing_xid";
 
connection master;
 
insert into t1 values (3),(4);
 
connection slave;
 
--source include/wait_until_disconnected.inc
 
--append_file $_expect_file_name
restart: --gtid_strict_mode=1
EOF
 
--enable_reconnect
--source include/wait_until_connected_again.inc
 
show variables like 'gtid%';
 
# I intentionally don't use the include file here, 
# because start_slave.inc is indeterministic when a problem occurs on startup.
# If something goes wrong, the next sync_slave_with_master will indicate that
 
start slave;
 
connection master;
drop table t1;
 
--sync_slave_with_master
query_vertical show slave status;

bzr version-info

revision-id: sergii@pisem.net-20130624185655-3ysky07m0gvet6gl
revno: 3669
branch-nick: 10.0-base



 Comments   
Comment by Elena Stepanova [ 2013-06-28 ]

Forgot to quote your analysis from IRC:

"the bug seems to be the following: We crash in the middle of an event group. Then during crash recovery, we scan the binlog to collect all logged GTIDs in the crashed binlog, to put in GTID_LIST in the next log. But the code does not correctly handle that a partial event group should not be used"

Comment by Kristian Nielsen [ 2013-11-21 ]

Pushed to 10.0-base.

Make sure that we only recover binlog state from fully written event groups in the binlog, not from any partial group written at the end of the log just before crashing.

Generated at Thu Feb 08 06:58:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.