[MDEV-7432] STOP SLAVE in the middle of event group can cause rollback of transaction mixing transactional and non-transactional DML, causing slave corruption Created: 2015-01-11  Updated: 2023-11-29  Resolved: 2023-11-29

Status: Closed
Project: MariaDB Server
Component/s: Tests
Affects Version/s: 10.0, 10.1
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Kristian Nielsen
Resolution: Won't Fix Votes: 0
Labels: buildbot, tests

Issue Links:
Relates
relates to MDEV-7069 Fix buildbot failures in main server ... Stalled

 Description   

http://buildbot.askmonty.org/buildbot/builders/work-amd64-valgrind/builds/6554/steps/test/logs/stdio

rpl.rpl_parallel 'innodb_plugin,row'     w3 [ fail ]
        Test ended at 2015-01-06 23:50:40
 
CURRENT_TEST: rpl.rpl_parallel
mysqltest: In included file "./include/wait_for_slave_param.inc": 
included from ./include/wait_for_slave_sql_to_start.inc at line 32:
included from ./include/wait_for_slave_to_start.inc at line 27:
included from ./include/start_slave.inc at line 35:
included from /var/lib/buildbot/maria-slave/work-opensuse-amd64/build/mysql-test/suite/rpl/t/rpl_parallel.test at line 344:
At line 115: Timeout in include/wait_for_slave_param.inc
 
The result from queries just before the failure was:
< snip >
'group_commit_waiting_for_prior SIGNAL slave_queued3',
''))
master-bin.000002	1417	Xid	1	1444	COMMIT /* xid=358 */
master-bin.000002	1444	Gtid	1	1482	BEGIN GTID 0-1-15
master-bin.000002	1482	Query	1	1571	use `test`; INSERT INTO t2 VALUES (20)
master-bin.000002	1571	Query	1	1660	use `test`; INSERT INTO t1 VALUES (20)
master-bin.000002	1660	Query	1	1749	use `test`; INSERT INTO t2 VALUES (21)
master-bin.000002	1749	Query	1	1842	use `test`; INSERT INTO t3 VALUES (20, 20)
master-bin.000002	1842	Xid	1	1869	COMMIT /* xid=421 */
master-bin.000002	1869	Gtid	1	1907	BEGIN GTID 0-1-16
master-bin.000002	1907	Query	1	1999	use `test`; INSERT INTO t3 VALUES(21, 21)
master-bin.000002	1999	Xid	1	2026	COMMIT /* xid=426 */
master-bin.000002	2026	Gtid	1	2064	BEGIN GTID 0-1-17
master-bin.000002	2064	Query	1	2156	use `test`; INSERT INTO t3 VALUES(22, 22)
master-bin.000002	2156	Xid	1	2183	COMMIT /* xid=427 */
 
**** SHOW RELAYLOG EVENTS on server_1 ****
relaylog_name = 'No such row'
SHOW RELAYLOG EVENTS IN 'No such row';
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
 
More results from queries before failure can be found in /dev/shm/var_auto_lwsQ/3/log/rpl_parallel.log
 
 - saving '/dev/shm/var_auto_lwsQ/3/log/rpl.rpl_parallel-innodb_plugin,row/' to '/var/lib/buildbot/maria-slave/work-opensuse-amd64/build/mysql-test/var/log/rpl.rpl_parallel-innodb_plugin,row/'
 
Retrying test rpl.rpl_parallel, attempt(2/3)...
 
worker[3] > Restart  - not started
worker[3] > Restart  - not started
***Warnings generated in error logs during shutdown after running tests: rpl.rpl_gtid_startpos rpl.rpl_commit_after_flush rpl.rpl_gtid_errorhandling rpl.rpl_foreign_key_innodb rpl.rpl_old_master rpl.rpl_bug26395 rpl.rpl_gtid_mdev4484 rpl.rpl_failed_optimize rpl.rpl_dual_pos_advance rpl.rpl_gtid_ignored rpl.rpl_insert_id_pk rpl.rpl_gtid_reconnect rpl.rpl_insert_ignore rpl.rpl_insert_id rpl.rpl_parallel rpl.rpl_multi_engine rpl.rpl_gtid_until rpl.rpl_mysql_upgrade rpl.rpl_checksum_cache
 
150106 23:44:13 [Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. Statement is unsafe because it accesses a non-transactional table after accessing a transactional table within the same transaction. Statement: INSERT INTO t1 VALUES (20)
150106 23:44:15 [Warning] Slave: Duplicate entry '20' for key 'PRIMARY' Error_code: 1062



 Comments   
Comment by Kristian Nielsen [ 2015-01-14 ]

Failure can be reproduced by adding this sleep in the code:

=== modified file 'sql/rpl_parallel.cc'
--- sql/rpl_parallel.cc	2015-01-07 13:45:39 +0000
+++ sql/rpl_parallel.cc	2015-01-14 07:22:39 +0000
@@ -2107,6 +2107,7 @@ rpl_parallel::do_event(rpl_group_info *s
   unlock_or_exit_cond(rli->sql_driver_thd, &cur_thread->LOCK_rpl_thread,
                       &did_enter_cond, &old_stage);
   mysql_cond_signal(&cur_thread->COND_rpl_thread);
+my_sleep(100000);
 
   return 0;
 }
 

Comment by Kristian Nielsen [ 2015-01-14 ]

So this looks like a real bug in the code.

When STOP SLAVE is run, a flag is set.

When a worker thread sees the stop flag set, it checks if it has any more
events queued up for it. If not, it force rolls back any current transaction.
Since after the stop flag is set, no more events will be queued for worker
threads.

However, if the transaction (event group, really) contains a mix of
transactional and non-transactional DML, then this will corrupt the slave
state. Because the transactional part will be rolled back, but the
non-transactional part will remain.

So this means that STOP SLAVE is currently not safe for parallel replication
if there are event groups containing multiple non-transactional DML events.

To fix, I think a timeout mechanism needs to be implemented in the SQL driver
thread, similar to the non-parallel case. At STOP SLAVE, continue queuing
events into the relay log and into worker threads until either the current
event group has been completely queued, or until the stop timeout has elapsed.

This is however complicated by the fact that in the SQL driver thread, we
cannot really know if an event group contains non-transactional events. In the
non-parallel case, we know this, and can apply the timeout only if
non-transactional events are present (making rollback in the middle of the
event group unsafe).

In the parallel case, seems we would need to apply the timeout always, or in
10.1 we could look at the "transactional" flag in the GTID event (but then we
will still be unsafe if replicating from InnoDB on the master to MyISAM on the
slave, something however that in general is not really safe to do for parallel
replication).

Generated at Thu Feb 08 07:19:33 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.