[MDEV-4605] GTID slave_pos gets corrupted after slave crash recovery, replication starts from the beginning - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 10.0.3
Component/s: None
Labels:
None

Description

This is the problem we discussed earlier on IRC, but I also promised to describe how to reproduce it, so I'll do it here

Scenario:

I have master => slave setup;
master executes INSERTs / DELETEs on InnoDB tables, non-concurrently (1 thread) and writes to the binlog;
slave does not write the binlog;
every 30 seconds the slave gets killed (with SIGKILL) and restarted;
at some point, after slave restart it resumes replicating not from the previous position, but from the "zero" point (0-1-1).

The test executes additional checks between the main steps:

restarts the slave with skip-slave-start and checks gtid* values before configuring the slave;
executes CHANGE MASTER ... master_use_gtid=current_pos and checks gtid* values again;
executes START SLAVE and checks gtid* values again, and also verifies that slave threads reached 'running: Yes' status;
checks gtid* values every 10 seconds;
on every gtid* values check, it verifies that the current seq_no in mysql.gtid_slave_pos is not less than the last stored seq_no. If it's less, the checker exits with STATUS_REPLICATION_FAILURE (see output below).

The logging shows that usually the position reset happens after START SLAVE. After server startup and before START SLAVE gtid*pos variables are empty, while mysql.gtid_slave_pos contains the correct value. After START SLAVE gtid_current_pos, gtid_slave_pos and the table get updated to contain the wrong position.
Sometimes it takes two test runs to hit the problem. If it doesn't work for you but you still want to run it, I can set it up for you on perro, I checked that it's reproducible there.

To reproduce,

branch lp:~elenst/randgen/rqg-mariadb
cd rqg-mariadb

copy the grammar below into test.yy

query:

  insert | insert | insert | insert | insert | insert | insert | delete ;

insert:

  INSERT INTO _table ( _field[invariant] ) SELECT _field[invariant] FROM _table ;

delete:

	DELETE FROM _table ;

in the command line below set correct basedir, vardir (vardir and vardir_slave will be created) and rpl_mode of you choice and run

perl ./runall-new.pl --basedir=<basedir> --vardir=<server logs> --rpl_mode=mixed --grammar=test.yy --threads=1 --queries=100M --duration=600 --reporters=GtidCrashSafety --mysqld=--skip-federated --mysqld=--skip-archive

With the given parameters, the test won't shut down servers after the finish, you need to do it manually, otherwise next time the test won't run.
skip-archive and skip-federated have nothing to do with the test flow, it's just a workaround for MAX_HA problem in the current 10.0 trees (to be fixed by the next merge from 5.5).

That's how the typical log output looks:

Normal cycle between intentional crashes (nothing bad happens):

# 2013-05-31T20:15:47 [5520] Sending SIGKILL to server with pid 5745 in order to force crash recovery

# 2013-05-31T20:15:49 [5520] Trying to restart the server ...

# 2013-05-31T20:15:49 [5520] MySQL Version:10.0.1

# 2013-05-31T20:15:49 [5520] Binary: /home/elenst/bzr/10.0-base//sql/mysqld

# 2013-05-31T20:15:49 [5520] Type: Debug

# 2013-05-31T20:15:49 [5520] Datadir: /data/test_results/gtid-18/analyze_slave/data

# 2013-05-31T20:15:49 [5520] Corefile: /data/test_results/gtid-18/analyze_slave/data/core.5745

# 2013-05-31T20:15:49 [5520] Starting MySQL 10.0.1: "/home/elenst/bzr/10.0-base//sql/mysqld" "--no-defaults" .....

# 2013-05-31T20:15:51 [5520] Stale connection. Reconnecting

# 2013-05-31T20:15:52 [5520] Slave GTID status after server restart:

# 2013-05-31T20:15:52 [5520]   table mysql.gtid_slave_pos: 0:240:1:240

# 2013-05-31T20:15:52 [5520]   gtid_current_pos:

# 2013-05-31T20:15:52 [5520]   gtid_slave_pos:

# 2013-05-31T20:15:52 [5520]   gtid_binlog_pos:

# 2013-05-31T20:15:52 [5520] Server restart was apparently successfull, running CHANGE MASTER...

# 2013-05-31T20:15:52 [5520] Slave GTID status after CHANGE MASTER:

# 2013-05-31T20:15:52 [5520]   table mysql.gtid_slave_pos: 0:240:1:240

# 2013-05-31T20:15:52 [5520]   gtid_current_pos:

# 2013-05-31T20:15:52 [5520]   gtid_slave_pos:

# 2013-05-31T20:15:52 [5520]   gtid_binlog_pos:

# 2013-05-31T20:15:52 [5520] Starting replication...

# 2013-05-31T20:15:53 [5520] Current replication status: IO thread Yes, SQL thread Yes

# 2013-05-31T20:15:53 [5520] Slave GTID status after START SLAVE:

# 2013-05-31T20:15:53 [5520]   table mysql.gtid_slave_pos: 0:240:1:240

# 2013-05-31T20:15:53 [5520]   gtid_current_pos: 0-1-240

# 2013-05-31T20:15:53 [5520]   gtid_slave_pos:   0-1-240

# 2013-05-31T20:15:53 [5520]   gtid_binlog_pos:

# 2013-05-31T20:16:04 [5520] Slave GTID status:

# 2013-05-31T20:16:04 [5520]   table mysql.gtid_slave_pos: 0:244:1:244

# 2013-05-31T20:16:04 [5520]   gtid_current_pos: 0-1-244

# 2013-05-31T20:16:04 [5520]   gtid_slave_pos:   0-1-244

# 2013-05-31T20:16:04 [5520]   gtid_binlog_pos:

# 2013-05-31T20:16:14 [5520] Slave GTID status:

# 2013-05-31T20:16:14 [5520]   table mysql.gtid_slave_pos: 0:244:1:244

# 2013-05-31T20:16:14 [5520]   gtid_current_pos: 0-1-244

# 2013-05-31T20:16:14 [5520]   gtid_slave_pos:   0-1-244

# 2013-05-31T20:16:14 [5520]   gtid_binlog_pos:

# 2013-05-31T20:16:24 [5520] Slave GTID status:

# 2013-05-31T20:16:24 [5520]   table mysql.gtid_slave_pos: 0:244:1:244

# 2013-05-31T20:16:24 [5520]   gtid_current_pos: 0-1-244

# 2013-05-31T20:16:24 [5520]   gtid_slave_pos:   0-1-244

# 2013-05-31T20:16:24 [5520]   gtid_binlog_pos:

# 2013-05-31T20:16:24 [5520] Sending SIGKILL to server with pid 5780 in order to force crash recovery

Erroneous startup:

# 2013-05-31T20:16:24 [5520] Sending SIGKILL to server with pid 5780 in order to force crash recovery

# 2013-05-31T20:16:26 [5520] Trying to restart the server ...

# 2013-05-31T20:16:26 [5520] MySQL Version:10.0.1

# 2013-05-31T20:16:26 [5520] Binary: /home/elenst/bzr/10.0-base//sql/mysqld

# 2013-05-31T20:16:26 [5520] Type: Debug

# 2013-05-31T20:16:26 [5520] Datadir: /data/test_results/gtid-18/analyze_slave/data

# 2013-05-31T20:16:26 [5520] Corefile: /data/test_results/gtid-18/analyze_slave/data/core.5780

# 2013-05-31T20:16:26 [5520] Starting MySQL 10.0.1: "/home/elenst/bzr/10.0-base//sql/mysqld" "--no-defaults" ....

# 2013-05-31T20:16:29 [5520] Stale connection. Reconnecting

# 2013-05-31T20:16:29 [5520] Slave GTID status after server restart:

# 2013-05-31T20:16:29 [5520]   table mysql.gtid_slave_pos: 0:244:1:244

# 2013-05-31T20:16:29 [5520]   gtid_current_pos:

# 2013-05-31T20:16:29 [5520]   gtid_slave_pos:

# 2013-05-31T20:16:29 [5520]   gtid_binlog_pos:

# 2013-05-31T20:16:29 [5520] Server restart was apparently successfull, running CHANGE MASTER...

# 2013-05-31T20:16:30 [5520] Slave GTID status after CHANGE MASTER:

# 2013-05-31T20:16:30 [5520]   table mysql.gtid_slave_pos: 0:244:1:244

# 2013-05-31T20:16:30 [5520]   gtid_current_pos:

# 2013-05-31T20:16:30 [5520]   gtid_slave_pos:

# 2013-05-31T20:16:30 [5520]   gtid_binlog_pos:

# 2013-05-31T20:16:30 [5520] Starting replication...

# 2013-05-31T20:16:31 [5520] Current replication status: IO thread Yes, SQL thread Yes

# 2013-05-31T20:16:31 [5520] Slave GTID status after START SLAVE:

# 2013-05-31T20:16:31 [5520]   table mysql.gtid_slave_pos: 0:245:1:1

# 2013-05-31T20:16:31 [5520]   gtid_current_pos: 0-1-1

# 2013-05-31T20:16:31 [5520]   gtid_slave_pos:   0-1-1

# 2013-05-31T20:16:31 [5520]   gtid_binlog_pos:

# 2013-05-31T20:16:31 [5520] ERROR: current maximum seq_no 1 is less than the last stored seq_no 244.

# 2013-05-31T20:16:31 [5520] Stopping replication, the test will continue without this reporter

# 2013-05-31T20:16:33 [5520] Slave restart has failed with status STATUS_REPLICATION_FAILURE

At this point the test continuous running the flow on master till the end of duration, but you can interrupt it here.

bzr version-info

revision-id: knielsen@knielsen-hq.org-20130529122340-kadsw91xbmnmsg37

revno: 3639

branch-nick: 10.0-base

Attachments

Issue Links

relates to

MDEV-26 Global transaction ID

Closed

Activity

Ascending order - Click to sort in descending order

Kristian Nielsen added a comment - 2013-06-03 08:47

Fix pushed to 10.0-base. Thanks for finding this serious issue!

The main problem here was that the code for loading the position
when the slave starts was in the completely wrong place (in the
SQL thread rather than the IO thread). This caused a race where
the state might or might not be loaded when the slave sends the
start position to the master. Another problem was that there was
no check that the slave position was loaded before accessing
@@gtid_slave_pos.

In addition to fixing these places in the code, I also now made
sure that the slave position is loaded already at server startup.

Elena, I did not try to run the RQG, so I hope you will try that
and let me know if there are more problems.

Kristian Nielsen added a comment - 2013-06-03 08:47 Fix pushed to 10.0-base. Thanks for finding this serious issue! The main problem here was that the code for loading the position when the slave starts was in the completely wrong place (in the SQL thread rather than the IO thread). This caused a race where the state might or might not be loaded when the slave sends the start position to the master. Another problem was that there was no check that the slave position was loaded before accessing @@gtid_slave_pos. In addition to fixing these places in the code, I also now made sure that the slave position is loaded already at server startup. Elena, I did not try to run the RQG, so I hope you will try that and let me know if there are more problems.

Elena Stepanova added a comment - 2013-06-03 21:03

Yes, I've run the test several times, no problems so far. Also, slave_pos is now initialized right away after server restart, before START SLAVE (as expected).

Elena Stepanova added a comment - 2013-06-03 21:03 Yes, I've run the test several times, no problems so far. Also, slave_pos is now initialized right away after server restart, before START SLAVE (as expected).

MariaDB Server

GTID slave_pos gets corrupted after slave crash recovery, replication starts from the beginning

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration