[MDEV-7432] STOP SLAVE in the middle of event group can cause rollback of transaction mixing transactional and non-transactional DML, causing slave corruption - Jira

Details

Type: Bug
Status: Confirmed (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.0(EOL), 10.4(EOL), 10.6, 10.11, 11.4
Fix Version/s: 11.8
Component/s: Tests
Labels:

Description

http://buildbot.askmonty.org/buildbot/builders/work-amd64-valgrind/builds/6554/steps/test/logs/stdio

rpl.rpl_parallel 'innodb_plugin,row'     w3 [ fail ]

        Test ended at 2015-01-06 23:50:40

CURRENT_TEST: rpl.rpl_parallel

mysqltest: In included file "./include/wait_for_slave_param.inc":

included from ./include/wait_for_slave_sql_to_start.inc at line 32:

included from ./include/wait_for_slave_to_start.inc at line 27:

included from ./include/start_slave.inc at line 35:

included from /var/lib/buildbot/maria-slave/work-opensuse-amd64/build/mysql-test/suite/rpl/t/rpl_parallel.test at line 344:

At line 115: Timeout in include/wait_for_slave_param.inc

The result from queries just before the failure was:

< snip >

'group_commit_waiting_for_prior SIGNAL slave_queued3',

''))

master-bin.000002	1417	Xid	1	1444	COMMIT /* xid=358 */

master-bin.000002	1444	Gtid	1	1482	BEGIN GTID 0-1-15

master-bin.000002	1482	Query	1	1571	use `test`; INSERT INTO t2 VALUES (20)

master-bin.000002	1571	Query	1	1660	use `test`; INSERT INTO t1 VALUES (20)

master-bin.000002	1660	Query	1	1749	use `test`; INSERT INTO t2 VALUES (21)

master-bin.000002	1749	Query	1	1842	use `test`; INSERT INTO t3 VALUES (20, 20)

master-bin.000002	1842	Xid	1	1869	COMMIT /* xid=421 */

master-bin.000002	1869	Gtid	1	1907	BEGIN GTID 0-1-16

master-bin.000002	1907	Query	1	1999	use `test`; INSERT INTO t3 VALUES(21, 21)

master-bin.000002	1999	Xid	1	2026	COMMIT /* xid=426 */

master-bin.000002	2026	Gtid	1	2064	BEGIN GTID 0-1-17

master-bin.000002	2064	Query	1	2156	use `test`; INSERT INTO t3 VALUES(22, 22)

master-bin.000002	2156	Xid	1	2183	COMMIT /* xid=427 */

**** SHOW RELAYLOG EVENTS on server_1 ****

relaylog_name = 'No such row'

SHOW RELAYLOG EVENTS IN 'No such row';

Log_name	Pos	Event_type	Server_id	End_log_pos	Info

More results from queries before failure can be found in /dev/shm/var_auto_lwsQ/3/log/rpl_parallel.log

 - saving '/dev/shm/var_auto_lwsQ/3/log/rpl.rpl_parallel-innodb_plugin,row/' to '/var/lib/buildbot/maria-slave/work-opensuse-amd64/build/mysql-test/var/log/rpl.rpl_parallel-innodb_plugin,row/'

Retrying test rpl.rpl_parallel, attempt(2/3)...

worker[3] > Restart  - not started

worker[3] > Restart  - not started

***Warnings generated in error logs during shutdown after running tests: rpl.rpl_gtid_startpos rpl.rpl_commit_after_flush rpl.rpl_gtid_errorhandling rpl.rpl_foreign_key_innodb rpl.rpl_old_master rpl.rpl_bug26395 rpl.rpl_gtid_mdev4484 rpl.rpl_failed_optimize rpl.rpl_dual_pos_advance rpl.rpl_gtid_ignored rpl.rpl_insert_id_pk rpl.rpl_gtid_reconnect rpl.rpl_insert_ignore rpl.rpl_insert_id rpl.rpl_parallel rpl.rpl_multi_engine rpl.rpl_gtid_until rpl.rpl_mysql_upgrade rpl.rpl_checksum_cache

150106 23:44:13 [Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. Statement is unsafe because it accesses a non-transactional table after accessing a transactional table within the same transaction. Statement: INSERT INTO t1 VALUES (20)

150106 23:44:15 [Warning] Slave: Duplicate entry '20' for key 'PRIMARY' Error_code: 1062

Attachments

Issue Links

relates to

MDEV-7069 Fix buildbot failures in main server trees

Stalled

Activity

Ascending order - Click to sort in descending order

Kristian Nielsen added a comment - 2015-01-14 09:25

Failure can be reproduced by adding this sleep in the code:

=== modified file 'sql/rpl_parallel.cc'

--- sql/rpl_parallel.cc	2015-01-07 13:45:39 +0000

+++ sql/rpl_parallel.cc	2015-01-14 07:22:39 +0000

@@ -2107,6 +2107,7 @@ rpl_parallel::do_event(rpl_group_info *s

   unlock_or_exit_cond(rli->sql_driver_thd, &cur_thread->LOCK_rpl_thread,

                       &did_enter_cond, &old_stage);

   mysql_cond_signal(&cur_thread->COND_rpl_thread);

+my_sleep(100000);

   return 0;

Kristian Nielsen added a comment - 2015-01-14 09:25 Failure can be reproduced by adding this sleep in the code: === modified file 'sql/rpl_parallel.cc' --- sql/rpl_parallel.cc 2015-01-07 13:45:39 +0000 +++ sql/rpl_parallel.cc 2015-01-14 07:22:39 +0000 @@ -2107,6 +2107,7 @@ rpl_parallel::do_event(rpl_group_info *s unlock_or_exit_cond(rli->sql_driver_thd, &cur_thread->LOCK_rpl_thread, &did_enter_cond, &old_stage); mysql_cond_signal(&cur_thread->COND_rpl_thread); +my_sleep(100000); return 0; }

Kristian Nielsen added a comment - 2015-01-14 09:56

So this looks like a real bug in the code.

When STOP SLAVE is run, a flag is set.

When a worker thread sees the stop flag set, it checks if it has any more
events queued up for it. If not, it force rolls back any current transaction.
Since after the stop flag is set, no more events will be queued for worker
threads.

However, if the transaction (event group, really) contains a mix of
transactional and non-transactional DML, then this will corrupt the slave
state. Because the transactional part will be rolled back, but the
non-transactional part will remain.

So this means that STOP SLAVE is currently not safe for parallel replication
if there are event groups containing multiple non-transactional DML events.

To fix, I think a timeout mechanism needs to be implemented in the SQL driver
thread, similar to the non-parallel case. At STOP SLAVE, continue queuing
events into the relay log and into worker threads until either the current
event group has been completely queued, or until the stop timeout has elapsed.

This is however complicated by the fact that in the SQL driver thread, we
cannot really know if an event group contains non-transactional events. In the
non-parallel case, we know this, and can apply the timeout only if
non-transactional events are present (making rollback in the middle of the
event group unsafe).

In the parallel case, seems we would need to apply the timeout always, or in
10.1 we could look at the "transactional" flag in the GTID event (but then we
will still be unsafe if replicating from InnoDB on the master to MyISAM on the
slave, something however that in general is not really safe to do for parallel
replication).

Kristian Nielsen added a comment - 2015-01-14 09:56 So this looks like a real bug in the code. When STOP SLAVE is run, a flag is set. When a worker thread sees the stop flag set, it checks if it has any more events queued up for it. If not, it force rolls back any current transaction. Since after the stop flag is set, no more events will be queued for worker threads. However, if the transaction (event group, really) contains a mix of transactional and non-transactional DML, then this will corrupt the slave state. Because the transactional part will be rolled back, but the non-transactional part will remain. So this means that STOP SLAVE is currently not safe for parallel replication if there are event groups containing multiple non-transactional DML events. To fix, I think a timeout mechanism needs to be implemented in the SQL driver thread, similar to the non-parallel case. At STOP SLAVE, continue queuing events into the relay log and into worker threads until either the current event group has been completely queued, or until the stop timeout has elapsed. This is however complicated by the fact that in the SQL driver thread, we cannot really know if an event group contains non-transactional events. In the non-parallel case, we know this, and can apply the timeout only if non-transactional events are present (making rollback in the middle of the event group unsafe). In the parallel case, seems we would need to apply the timeout always, or in 10.1 we could look at the "transactional" flag in the GTID event (but then we will still be unsafe if replicating from InnoDB on the master to MyISAM on the slave, something however that in general is not really safe to do for parallel replication).

Kristian Nielsen added a comment - 2024-03-15 13:21

I don't think this should be closed as "wontfix". The problem described is real and a fix that tries to avoid aborting an event group in STOP SLAVE when it contains updates that cannot be rolled back, would be valuable.

Not saying that this is necessarily high prio to fix at the moment.

So re-opening, unless there's an important reason to not keep it open?

Kristian Nielsen added a comment - 2024-03-15 13:21 I don't think this should be closed as "wontfix". The problem described is real and a fix that tries to avoid aborting an event group in STOP SLAVE when it contains updates that cannot be rolled back, would be valuable. Not saying that this is necessarily high prio to fix at the moment. So re-opening, unless there's an important reason to not keep it open?

Elena Stepanova added a comment - 2024-03-15 13:47

It was closed automatically (or semi-automatically) because all listed affected/fix versions already went EOL.
If it's still relevant, the best way to keep it open and on someone's radar would be by changing affected/fix versions to currently active versions.

Elena Stepanova added a comment - 2024-03-15 13:47 It was closed automatically (or semi-automatically) because all listed affected/fix versions already went EOL. If it's still relevant, the best way to keep it open and on someone's radar would be by changing affected/fix versions to currently active versions.

Kristian Nielsen added a comment - 2024-03-15 15:26

Thanks for the information Elena, I've updated affects/fix versions.

Kristian Nielsen added a comment - 2024-03-15 15:26 Thanks for the information Elena, I've updated affects/fix versions.

MariaDB Server

STOP SLAVE in the middle of event group can cause rollback of transaction mixing transactional and non-transactional DML, causing slave corruption

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration