Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4605

GTID slave_pos gets corrupted after slave crash recovery, replication starts from the beginning

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • None
    • 10.0.3
    • None
    • None

    Description

      This is the problem we discussed earlier on IRC, but I also promised to describe how to reproduce it, so I'll do it here

      Scenario:

      • I have master => slave setup;
      • master executes INSERTs / DELETEs on InnoDB tables, non-concurrently (1 thread) and writes to the binlog;
      • slave does not write the binlog;
      • every 30 seconds the slave gets killed (with SIGKILL) and restarted;
      • at some point, after slave restart it resumes replicating not from the previous position, but from the "zero" point (0-1-1).

      The test executes additional checks between the main steps:

      • restarts the slave with skip-slave-start and checks gtid* values before configuring the slave;
      • executes CHANGE MASTER ... master_use_gtid=current_pos and checks gtid* values again;
      • executes START SLAVE and checks gtid* values again, and also verifies that slave threads reached 'running: Yes' status;
      • checks gtid* values every 10 seconds;
      • on every gtid* values check, it verifies that the current seq_no in mysql.gtid_slave_pos is not less than the last stored seq_no. If it's less, the checker exits with STATUS_REPLICATION_FAILURE (see output below).

      The logging shows that usually the position reset happens after START SLAVE. After server startup and before START SLAVE gtid*pos variables are empty, while mysql.gtid_slave_pos contains the correct value. After START SLAVE gtid_current_pos, gtid_slave_pos and the table get updated to contain the wrong position.
      Sometimes it takes two test runs to hit the problem. If it doesn't work for you but you still want to run it, I can set it up for you on perro, I checked that it's reproducible there.

      To reproduce,

      • branch lp:~elenst/randgen/rqg-mariadb
      • cd rqg-mariadb
      • copy the grammar below into test.yy

        query:
          insert | insert | insert | insert | insert | insert | insert | delete ; 
         
        insert:
          INSERT INTO _table ( _field[invariant] ) SELECT _field[invariant] FROM _table ;
         
        delete:
        	DELETE FROM _table ;
         

      • in the command line below set correct basedir, vardir (vardir and vardir_slave will be created) and rpl_mode of you choice and run

        perl ./runall-new.pl --basedir=<basedir> --vardir=<server logs> --rpl_mode=mixed --grammar=test.yy --threads=1 --queries=100M --duration=600 --reporters=GtidCrashSafety --mysqld=--skip-federated --mysqld=--skip-archive

      With the given parameters, the test won't shut down servers after the finish, you need to do it manually, otherwise next time the test won't run.
      skip-archive and skip-federated have nothing to do with the test flow, it's just a workaround for MAX_HA problem in the current 10.0 trees (to be fixed by the next merge from 5.5).

      That's how the typical log output looks:

      Normal cycle between intentional crashes (nothing bad happens):

      # 2013-05-31T20:15:47 [5520] Sending SIGKILL to server with pid 5745 in order to force crash recovery
      # 2013-05-31T20:15:49 [5520] Trying to restart the server ...
      # 2013-05-31T20:15:49 [5520] MySQL Version:10.0.1
      # 2013-05-31T20:15:49 [5520] Binary: /home/elenst/bzr/10.0-base//sql/mysqld
      # 2013-05-31T20:15:49 [5520] Type: Debug
      # 2013-05-31T20:15:49 [5520] Datadir: /data/test_results/gtid-18/analyze_slave/data
      # 2013-05-31T20:15:49 [5520] Corefile: /data/test_results/gtid-18/analyze_slave/data/core.5745
      # 2013-05-31T20:15:49 [5520] Starting MySQL 10.0.1: "/home/elenst/bzr/10.0-base//sql/mysqld" "--no-defaults" .....
      # 2013-05-31T20:15:51 [5520] Stale connection. Reconnecting
      # 2013-05-31T20:15:52 [5520] Slave GTID status after server restart:
      # 2013-05-31T20:15:52 [5520]   table mysql.gtid_slave_pos: 0:240:1:240
      # 2013-05-31T20:15:52 [5520]   gtid_current_pos: 
      # 2013-05-31T20:15:52 [5520]   gtid_slave_pos:   
      # 2013-05-31T20:15:52 [5520]   gtid_binlog_pos:  
      # 2013-05-31T20:15:52 [5520] Server restart was apparently successfull, running CHANGE MASTER...
      # 2013-05-31T20:15:52 [5520] Slave GTID status after CHANGE MASTER:
      # 2013-05-31T20:15:52 [5520]   table mysql.gtid_slave_pos: 0:240:1:240
      # 2013-05-31T20:15:52 [5520]   gtid_current_pos: 
      # 2013-05-31T20:15:52 [5520]   gtid_slave_pos:   
      # 2013-05-31T20:15:52 [5520]   gtid_binlog_pos:  
      # 2013-05-31T20:15:52 [5520] Starting replication...
      # 2013-05-31T20:15:53 [5520] Current replication status: IO thread Yes, SQL thread Yes
      # 2013-05-31T20:15:53 [5520] Slave GTID status after START SLAVE:
      # 2013-05-31T20:15:53 [5520]   table mysql.gtid_slave_pos: 0:240:1:240
      # 2013-05-31T20:15:53 [5520]   gtid_current_pos: 0-1-240
      # 2013-05-31T20:15:53 [5520]   gtid_slave_pos:   0-1-240
      # 2013-05-31T20:15:53 [5520]   gtid_binlog_pos:  
      # 2013-05-31T20:16:04 [5520] Slave GTID status:
      # 2013-05-31T20:16:04 [5520]   table mysql.gtid_slave_pos: 0:244:1:244
      # 2013-05-31T20:16:04 [5520]   gtid_current_pos: 0-1-244
      # 2013-05-31T20:16:04 [5520]   gtid_slave_pos:   0-1-244
      # 2013-05-31T20:16:04 [5520]   gtid_binlog_pos:  
      # 2013-05-31T20:16:14 [5520] Slave GTID status:
      # 2013-05-31T20:16:14 [5520]   table mysql.gtid_slave_pos: 0:244:1:244
      # 2013-05-31T20:16:14 [5520]   gtid_current_pos: 0-1-244
      # 2013-05-31T20:16:14 [5520]   gtid_slave_pos:   0-1-244
      # 2013-05-31T20:16:14 [5520]   gtid_binlog_pos:  
      # 2013-05-31T20:16:24 [5520] Slave GTID status:
      # 2013-05-31T20:16:24 [5520]   table mysql.gtid_slave_pos: 0:244:1:244
      # 2013-05-31T20:16:24 [5520]   gtid_current_pos: 0-1-244
      # 2013-05-31T20:16:24 [5520]   gtid_slave_pos:   0-1-244
      # 2013-05-31T20:16:24 [5520]   gtid_binlog_pos:  
      # 2013-05-31T20:16:24 [5520] Sending SIGKILL to server with pid 5780 in order to force crash recovery

      Erroneous startup:

      # 2013-05-31T20:16:24 [5520] Sending SIGKILL to server with pid 5780 in order to force crash recovery
      # 2013-05-31T20:16:26 [5520] Trying to restart the server ...
      # 2013-05-31T20:16:26 [5520] MySQL Version:10.0.1
      # 2013-05-31T20:16:26 [5520] Binary: /home/elenst/bzr/10.0-base//sql/mysqld
      # 2013-05-31T20:16:26 [5520] Type: Debug
      # 2013-05-31T20:16:26 [5520] Datadir: /data/test_results/gtid-18/analyze_slave/data
      # 2013-05-31T20:16:26 [5520] Corefile: /data/test_results/gtid-18/analyze_slave/data/core.5780
      # 2013-05-31T20:16:26 [5520] Starting MySQL 10.0.1: "/home/elenst/bzr/10.0-base//sql/mysqld" "--no-defaults" ....
      # 2013-05-31T20:16:29 [5520] Stale connection. Reconnecting
      # 2013-05-31T20:16:29 [5520] Slave GTID status after server restart:
      # 2013-05-31T20:16:29 [5520]   table mysql.gtid_slave_pos: 0:244:1:244
      # 2013-05-31T20:16:29 [5520]   gtid_current_pos: 
      # 2013-05-31T20:16:29 [5520]   gtid_slave_pos:   
      # 2013-05-31T20:16:29 [5520]   gtid_binlog_pos:  
      # 2013-05-31T20:16:29 [5520] Server restart was apparently successfull, running CHANGE MASTER...
      # 2013-05-31T20:16:30 [5520] Slave GTID status after CHANGE MASTER:
      # 2013-05-31T20:16:30 [5520]   table mysql.gtid_slave_pos: 0:244:1:244
      # 2013-05-31T20:16:30 [5520]   gtid_current_pos: 
      # 2013-05-31T20:16:30 [5520]   gtid_slave_pos:   
      # 2013-05-31T20:16:30 [5520]   gtid_binlog_pos:  
      # 2013-05-31T20:16:30 [5520] Starting replication...
      # 2013-05-31T20:16:31 [5520] Current replication status: IO thread Yes, SQL thread Yes
      # 2013-05-31T20:16:31 [5520] Slave GTID status after START SLAVE:
      # 2013-05-31T20:16:31 [5520]   table mysql.gtid_slave_pos: 0:245:1:1
      # 2013-05-31T20:16:31 [5520]   gtid_current_pos: 0-1-1
      # 2013-05-31T20:16:31 [5520]   gtid_slave_pos:   0-1-1
      # 2013-05-31T20:16:31 [5520]   gtid_binlog_pos:  
      # 2013-05-31T20:16:31 [5520] ERROR: current maximum seq_no 1 is less than the last stored seq_no 244. 
      # 2013-05-31T20:16:31 [5520] Stopping replication, the test will continue without this reporter
      # 2013-05-31T20:16:33 [5520] Slave restart has failed with status STATUS_REPLICATION_FAILURE

      At this point the test continuous running the flow on master till the end of duration, but you can interrupt it here.

      bzr version-info

      revision-id: knielsen@knielsen-hq.org-20130529122340-kadsw91xbmnmsg37
      revno: 3639
      branch-nick: 10.0-base

      Attachments

        Issue Links

          Activity

            People

              knielsen Kristian Nielsen
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.