[MDEV-6608] Slave loses master binlog filename when master crashes in the middle of writing an event group - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Won't Fix
Affects Version/s: 10.0.13
Fix Version/s: N/A
Component/s: OTHER
Labels:
- replication

Description

If the master crashes in the middle of writing an event group to the binlog,
the slave can receive this partial event group once the master is restarted.

The slave code is written to handle this by recognising the master restart on
the format description event logged at the restart (it has a special flag only
set for the first format description logged after restart). The slave will
rollback the partial transaction and drop all temporary tables.

But there is a bug in the update of master binlog position. The update of the
current master filename normally happens during processing of the Rotate
events. However, due to the crash, no Rotate event occurs at the end of the
master-bin.000001. And the fake Rotate event sent from the master at reconnect
after restart is not processed, because Rotate events in the middle of an
event group are not processed.

The result is that the slave ends up in a state where it has processed events
in master-bin.000002, but the filename part of the replication position
(eg. SHOW SLAVE STATUS) is still master-bin.000001.

I checked MySQL, it seems they avoid this problem during binlog recovery at
master restart after a crash. If they find a partial event group at the end of
the last master binlog, they truncate the file to before that event.

Here is an MTR test case. The test case fails by --sync-with-master not
completing because of the incorrect binlog position. This test case uses
parallel replication, but I tested that the same bug is present for
non-parallel replication.

--source include/have_innodb.inc

--source include/have_binlog_format_statement.inc

--let $rpl_topology=1->2

--source include/rpl_init.inc

--connection server_2

--source include/stop_slave.inc

CHANGE MASTER TO master_use_gtid= current_pos;

--source include/start_slave.inc

--connection server_1

CREATE TABLE t1 (a INT PRIMARY KEY, b VARCHAR(100) CHARACTER SET utf8);

--save_master_pos

--connection server_2

--sync_with_master

--source include/stop_slave.inc

--connection server_1

SET gtid_domain_id= 1;

DELETE FROM t1;

ALTER TABLE t1 ENGINE=InnoDB;

CREATE TEMPORARY TABLE t2 (a INT PRIMARY KEY);

INSERT INTO t2 VALUES (1);

INSERT INTO t2 VALUES (2);

--connection default

SET gtid_domain_id= 2;

CREATE TEMPORARY TABLE t3 (a INT PRIMARY KEY);

INSERT INTO t3 VALUES (10);

INSERT INTO t3 VALUES (20);

--connection server_1

INSERT INTO t1 SELECT a, 'server_1' FROM t2;

--connection default

INSERT INTO t1 SELECT a, 'default' FROM t3;

--connection server_1

INSERT INTO t1 SELECT a+2, '+server_1' FROM t2;

# Crash the master server in the middle of writing an event group.

--write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect

wait

EOF

FLUSH TABLES;

SET SESSION debug_dbug="+d,crash_before_writing_xid";

--error 2006,2013

INSERT INTO t1 SELECT a+4, '++server_1' FROM t2;

--source include/wait_until_disconnected.inc

--connection default

--source include/wait_until_disconnected.inc

--append_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect

restart

EOF

--connection default

--enable_reconnect

--source include/wait_until_connected_again.inc

--connection server_1

--enable_reconnect

--source include/wait_until_connected_again.inc

INSERT INTO t1 VALUES (0, 1);

--save_master_pos

SHOW BINLOG EVENTS IN 'master-bin.000001';

SHOW BINLOG EVENTS IN 'master-bin.000002';

--connection server_2

# Start the slave replicating the events.

# The main thing to test here is that the slave will know that it

# needs to abort the partially received event group, so that the

# execution of format_description event will not wait infinitely

# for a commit of the incomplete group that never happens.

--source include/start_slave.inc

--sync_with_master

SELECT * FROM t1 ORDER BY a;

SHOW STATUS LIKE 'Slave_open_temp_tables';

--connection server_1

DROP TABLE t1;

--source include/rpl_end.inc

Attachments

Activity

People

Assignee:: Andrei Elkin

Reporter:: Kristian Nielsen

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2014-08-19 14:53

Updated:: 2022-09-12 10:14

Resolved:: 2022-09-12 10:14

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server