Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-8196

Stopping slave IO thread between a row event and COMMIT causes data inconsistency between master and slave

    XMLWordPrintable

Details

    • Can result in data loss

    Description

      Note: the test below is extremely ugly, it was made up just to confirm the theory. The test is all about races, so it's quite likely it won't work on some machines.

      The test generates a binary log with a bunch of tiny row events, BEGINs and COMMITs. It looks like this:

      ...
      # at 74263
      #150521  1:49:28 server id 1  end_log_pos 74301         GTID 0-1-407
      /*!100001 SET @@session.gtid_seq_no=407*//*!*/;
      BEGIN
      /*!*/;
      # at 74301
      # at 74342
      #150521  1:49:28 server id 1  end_log_pos 74342         Table_map: `test`.`t1` mapped to number 19
      #150521  1:49:28 server id 1  end_log_pos 74376         Write_rows: table id 19 flags: STMT_END_F
      ### INSERT INTO `test`.`t1`
      ### SET
      ###   @1=405
      # at 74376
      #150521  1:49:28 server id 1  end_log_pos 74445         Query   thread_id=7     exec_time=0     error_code=0
      SET TIMESTAMP=1432162168/*!*/;
      COMMIT
      /*!*/;
      # at 74445
      #150521  1:49:28 server id 1  end_log_pos 74483         GTID 0-1-408
      /*!100001 SET @@session.gtid_seq_no=408*//*!*/;
      BEGIN
      /*!*/;
      # at 74483
      # at 74524
      #150521  1:49:28 server id 1  end_log_pos 74524         Table_map: `test`.`t1` mapped to number 19
      #150521  1:49:28 server id 1  end_log_pos 74558         Write_rows: table id 19 flags: STMT_END_F
      ### INSERT INTO `test`.`t1`
      ### SET
      ###   @1=406
      # at 74558
      #150521  1:49:28 server id 1  end_log_pos 74627         Query   thread_id=7     exec_time=0     error_code=0
      SET TIMESTAMP=1432162168/*!*/;
      COMMIT
      /*!*/;
      ...

      Also, the test stops IO thread at a random moment; waits till the SQL thread executes whatever was relayed to it; stops it also; and restarts both threads.

      The IO thread can stop at different points in this binary log. When it happens before BEGIN, or after BEGIN, or after COMMIT, it's all fine. The problem starts when the IO thread stops after the row event, but before COMMIT, e.g. at position 74558 in the snippet above.

      Then the SQL thread executes the event, but does not properly finish it because there is no COMMIT. Then everything stops, and when the threads are restarted, the event gets executed again, naturally causing data inconsistency and/or replication abort.

      In non-parallel replication this situation is accompanied by a long list of complaints in the slave error log. It throws multiple warnings

      Slave SQL: Request to stop slave SQL Thread received while applying a group that has non-transactional changes; waiting for completion of the group

      .. then the fatal error

      [ERROR] Slave SQL: Fatal error: ... Slave SQL Thread stopped with incomplete event group having non-transactional changes. If the group consists solely of row-based events, you can try to restart the slave with --slave-exec-mode=IDEMPOTENT, which ignores duplicate key, key not found, and similar errors (see documentation for details). Gtid 0-1-758, Internal MariaDB error code: 1593

      ... then yet another error

      [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594
      150521  1:46:38 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 138145

      ... and only after replication restart, it ends up with the mentioned replication abort and/or data inconsistency.

      So, even though nothing good comes out of this, the server's reputation is at least protected by the disclaimers in the error log.

      With the parallel replication, it's different. No warnings about incomplete groups or errors about relay log are raised; the slave jumps directly to the replication restart and further abort and/or data inconsistency.

      So, I see two problems here:

      • legacy one: maybe the IO thread should not be allowed to stop in such places?
      • parallel replication one: if it cannot handle the situation any better than the traditional replication does, then it should produce the same warnings and errors.

      Ugly test case

      # To be run with 
      # --repeat=<many> --force-restart --nocheck-testcases
       
      --source include/master-slave.inc
      --source include/have_binlog_format_row.inc
       
      --connection slave
      --source include/stop_slave_sql.inc
       
      --connection master
      create table t1 (i int primary key) engine=MyISAM;
      --delimiter |
      create procedure pr() 
      begin
        declare i int default 1;
        while i <= 1000
        do
          insert into t1 values (i);
          set i = i+1;
        end while;
      end |
      --delimiter ;
       
      send call pr();
       
      --connection slave
      start slave;
      --sleep 0.5
      stop slave io_thread;
       
      #--cat_file $MYSQLTEST_VARDIR/mysqld.2/mysqld.log
       
      --connection master
      --reap
       
      --connection slave
       
      --let $status_items= Read_Master_Log_Pos,Exec_Master_Log_Pos,Last_IO_Errno,Last_SQL_Errno
      --source include/show_slave_status.inc
       
      sleep 10;
      --let $status_items= Read_Master_Log_Pos,Exec_Master_Log_Pos,Last_IO_Errno,Last_SQL_Errno
      --source include/show_slave_status.inc
      select max(i) from t1;
      stop slave;
      sleep 1;
      --let $status_items= Read_Master_Log_Pos,Exec_Master_Log_Pos,Last_IO_Errno,Last_SQL_Errno
      --source include/show_slave_status.inc
      start slave sql_thread, io_thread;
      sleep 5;
      --let $status_items= Read_Master_Log_Pos,Exec_Master_Log_Pos,Last_IO_Errno,Last_SQL_Errno
      --source include/show_slave_status.inc
      select count(*) from t1;
      if (`select count(*) != 1000 from t1`)
      {
      	--die "Error!"
      }
       

      Attachments

        Activity

          People

            knielsen Kristian Nielsen
            elenst Elena Stepanova
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.