Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7432

STOP SLAVE in the middle of event group can cause rollback of transaction mixing transactional and non-transactional DML, causing slave corruption

Details

    Description

      http://buildbot.askmonty.org/buildbot/builders/work-amd64-valgrind/builds/6554/steps/test/logs/stdio

      rpl.rpl_parallel 'innodb_plugin,row'     w3 [ fail ]
              Test ended at 2015-01-06 23:50:40
       
      CURRENT_TEST: rpl.rpl_parallel
      mysqltest: In included file "./include/wait_for_slave_param.inc": 
      included from ./include/wait_for_slave_sql_to_start.inc at line 32:
      included from ./include/wait_for_slave_to_start.inc at line 27:
      included from ./include/start_slave.inc at line 35:
      included from /var/lib/buildbot/maria-slave/work-opensuse-amd64/build/mysql-test/suite/rpl/t/rpl_parallel.test at line 344:
      At line 115: Timeout in include/wait_for_slave_param.inc
       
      The result from queries just before the failure was:
      < snip >
      'group_commit_waiting_for_prior SIGNAL slave_queued3',
      ''))
      master-bin.000002	1417	Xid	1	1444	COMMIT /* xid=358 */
      master-bin.000002	1444	Gtid	1	1482	BEGIN GTID 0-1-15
      master-bin.000002	1482	Query	1	1571	use `test`; INSERT INTO t2 VALUES (20)
      master-bin.000002	1571	Query	1	1660	use `test`; INSERT INTO t1 VALUES (20)
      master-bin.000002	1660	Query	1	1749	use `test`; INSERT INTO t2 VALUES (21)
      master-bin.000002	1749	Query	1	1842	use `test`; INSERT INTO t3 VALUES (20, 20)
      master-bin.000002	1842	Xid	1	1869	COMMIT /* xid=421 */
      master-bin.000002	1869	Gtid	1	1907	BEGIN GTID 0-1-16
      master-bin.000002	1907	Query	1	1999	use `test`; INSERT INTO t3 VALUES(21, 21)
      master-bin.000002	1999	Xid	1	2026	COMMIT /* xid=426 */
      master-bin.000002	2026	Gtid	1	2064	BEGIN GTID 0-1-17
      master-bin.000002	2064	Query	1	2156	use `test`; INSERT INTO t3 VALUES(22, 22)
      master-bin.000002	2156	Xid	1	2183	COMMIT /* xid=427 */
       
      **** SHOW RELAYLOG EVENTS on server_1 ****
      relaylog_name = 'No such row'
      SHOW RELAYLOG EVENTS IN 'No such row';
      Log_name	Pos	Event_type	Server_id	End_log_pos	Info
       
      More results from queries before failure can be found in /dev/shm/var_auto_lwsQ/3/log/rpl_parallel.log
       
       - saving '/dev/shm/var_auto_lwsQ/3/log/rpl.rpl_parallel-innodb_plugin,row/' to '/var/lib/buildbot/maria-slave/work-opensuse-amd64/build/mysql-test/var/log/rpl.rpl_parallel-innodb_plugin,row/'
       
      Retrying test rpl.rpl_parallel, attempt(2/3)...
       
      worker[3] > Restart  - not started
      worker[3] > Restart  - not started
      ***Warnings generated in error logs during shutdown after running tests: rpl.rpl_gtid_startpos rpl.rpl_commit_after_flush rpl.rpl_gtid_errorhandling rpl.rpl_foreign_key_innodb rpl.rpl_old_master rpl.rpl_bug26395 rpl.rpl_gtid_mdev4484 rpl.rpl_failed_optimize rpl.rpl_dual_pos_advance rpl.rpl_gtid_ignored rpl.rpl_insert_id_pk rpl.rpl_gtid_reconnect rpl.rpl_insert_ignore rpl.rpl_insert_id rpl.rpl_parallel rpl.rpl_multi_engine rpl.rpl_gtid_until rpl.rpl_mysql_upgrade rpl.rpl_checksum_cache
       
      150106 23:44:13 [Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. Statement is unsafe because it accesses a non-transactional table after accessing a transactional table within the same transaction. Statement: INSERT INTO t1 VALUES (20)
      150106 23:44:15 [Warning] Slave: Duplicate entry '20' for key 'PRIMARY' Error_code: 1062

      Attachments

        Issue Links

          Activity

            Failure can be reproduced by adding this sleep in the code:

            === modified file 'sql/rpl_parallel.cc'
            --- sql/rpl_parallel.cc	2015-01-07 13:45:39 +0000
            +++ sql/rpl_parallel.cc	2015-01-14 07:22:39 +0000
            @@ -2107,6 +2107,7 @@ rpl_parallel::do_event(rpl_group_info *s
               unlock_or_exit_cond(rli->sql_driver_thd, &cur_thread->LOCK_rpl_thread,
                                   &did_enter_cond, &old_stage);
               mysql_cond_signal(&cur_thread->COND_rpl_thread);
            +my_sleep(100000);
             
               return 0;
             }
             

            knielsen Kristian Nielsen added a comment - Failure can be reproduced by adding this sleep in the code: === modified file 'sql/rpl_parallel.cc' --- sql/rpl_parallel.cc 2015-01-07 13:45:39 +0000 +++ sql/rpl_parallel.cc 2015-01-14 07:22:39 +0000 @@ -2107,6 +2107,7 @@ rpl_parallel::do_event(rpl_group_info *s unlock_or_exit_cond(rli->sql_driver_thd, &cur_thread->LOCK_rpl_thread, &did_enter_cond, &old_stage); mysql_cond_signal(&cur_thread->COND_rpl_thread); +my_sleep(100000); return 0; }  

            So this looks like a real bug in the code.

            When STOP SLAVE is run, a flag is set.

            When a worker thread sees the stop flag set, it checks if it has any more
            events queued up for it. If not, it force rolls back any current transaction.
            Since after the stop flag is set, no more events will be queued for worker
            threads.

            However, if the transaction (event group, really) contains a mix of
            transactional and non-transactional DML, then this will corrupt the slave
            state. Because the transactional part will be rolled back, but the
            non-transactional part will remain.

            So this means that STOP SLAVE is currently not safe for parallel replication
            if there are event groups containing multiple non-transactional DML events.

            To fix, I think a timeout mechanism needs to be implemented in the SQL driver
            thread, similar to the non-parallel case. At STOP SLAVE, continue queuing
            events into the relay log and into worker threads until either the current
            event group has been completely queued, or until the stop timeout has elapsed.

            This is however complicated by the fact that in the SQL driver thread, we
            cannot really know if an event group contains non-transactional events. In the
            non-parallel case, we know this, and can apply the timeout only if
            non-transactional events are present (making rollback in the middle of the
            event group unsafe).

            In the parallel case, seems we would need to apply the timeout always, or in
            10.1 we could look at the "transactional" flag in the GTID event (but then we
            will still be unsafe if replicating from InnoDB on the master to MyISAM on the
            slave, something however that in general is not really safe to do for parallel
            replication).

            knielsen Kristian Nielsen added a comment - So this looks like a real bug in the code. When STOP SLAVE is run, a flag is set. When a worker thread sees the stop flag set, it checks if it has any more events queued up for it. If not, it force rolls back any current transaction. Since after the stop flag is set, no more events will be queued for worker threads. However, if the transaction (event group, really) contains a mix of transactional and non-transactional DML, then this will corrupt the slave state. Because the transactional part will be rolled back, but the non-transactional part will remain. So this means that STOP SLAVE is currently not safe for parallel replication if there are event groups containing multiple non-transactional DML events. To fix, I think a timeout mechanism needs to be implemented in the SQL driver thread, similar to the non-parallel case. At STOP SLAVE, continue queuing events into the relay log and into worker threads until either the current event group has been completely queued, or until the stop timeout has elapsed. This is however complicated by the fact that in the SQL driver thread, we cannot really know if an event group contains non-transactional events. In the non-parallel case, we know this, and can apply the timeout only if non-transactional events are present (making rollback in the middle of the event group unsafe). In the parallel case, seems we would need to apply the timeout always, or in 10.1 we could look at the "transactional" flag in the GTID event (but then we will still be unsafe if replicating from InnoDB on the master to MyISAM on the slave, something however that in general is not really safe to do for parallel replication).

            I don't think this should be closed as "wontfix". The problem described is real and a fix that tries to avoid aborting an event group in STOP SLAVE when it contains updates that cannot be rolled back, would be valuable.

            Not saying that this is necessarily high prio to fix at the moment.

            So re-opening, unless there's an important reason to not keep it open?

            knielsen Kristian Nielsen added a comment - I don't think this should be closed as "wontfix". The problem described is real and a fix that tries to avoid aborting an event group in STOP SLAVE when it contains updates that cannot be rolled back, would be valuable. Not saying that this is necessarily high prio to fix at the moment. So re-opening, unless there's an important reason to not keep it open?

            It was closed automatically (or semi-automatically) because all listed affected/fix versions already went EOL.
            If it's still relevant, the best way to keep it open and on someone's radar would be by changing affected/fix versions to currently active versions.

            elenst Elena Stepanova added a comment - It was closed automatically (or semi-automatically) because all listed affected/fix versions already went EOL. If it's still relevant, the best way to keep it open and on someone's radar would be by changing affected/fix versions to currently active versions.

            Thanks for the information Elena, I've updated affects/fix versions.

            knielsen Kristian Nielsen added a comment - Thanks for the information Elena, I've updated affects/fix versions.

            People

              knielsen Kristian Nielsen
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.