Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
5.5(EOL), 10.0(EOL), 10.1(EOL)
-
Can result in data loss
Description
Note: the test below is extremely ugly, it was made up just to confirm the theory. The test is all about races, so it's quite likely it won't work on some machines.
The test generates a binary log with a bunch of tiny row events, BEGINs and COMMITs. It looks like this:
...
|
# at 74263
|
#150521 1:49:28 server id 1 end_log_pos 74301 GTID 0-1-407
|
/*!100001 SET @@session.gtid_seq_no=407*//*!*/;
|
BEGIN
|
/*!*/;
|
# at 74301
|
# at 74342
|
#150521 1:49:28 server id 1 end_log_pos 74342 Table_map: `test`.`t1` mapped to number 19
|
#150521 1:49:28 server id 1 end_log_pos 74376 Write_rows: table id 19 flags: STMT_END_F
|
### INSERT INTO `test`.`t1`
|
### SET
|
### @1=405
|
# at 74376
|
#150521 1:49:28 server id 1 end_log_pos 74445 Query thread_id=7 exec_time=0 error_code=0
|
SET TIMESTAMP=1432162168/*!*/;
|
COMMIT
|
/*!*/;
|
# at 74445
|
#150521 1:49:28 server id 1 end_log_pos 74483 GTID 0-1-408
|
/*!100001 SET @@session.gtid_seq_no=408*//*!*/;
|
BEGIN
|
/*!*/;
|
# at 74483
|
# at 74524
|
#150521 1:49:28 server id 1 end_log_pos 74524 Table_map: `test`.`t1` mapped to number 19
|
#150521 1:49:28 server id 1 end_log_pos 74558 Write_rows: table id 19 flags: STMT_END_F
|
### INSERT INTO `test`.`t1`
|
### SET
|
### @1=406
|
# at 74558
|
#150521 1:49:28 server id 1 end_log_pos 74627 Query thread_id=7 exec_time=0 error_code=0
|
SET TIMESTAMP=1432162168/*!*/;
|
COMMIT
|
/*!*/;
|
...
|
Also, the test stops IO thread at a random moment; waits till the SQL thread executes whatever was relayed to it; stops it also; and restarts both threads.
The IO thread can stop at different points in this binary log. When it happens before BEGIN, or after BEGIN, or after COMMIT, it's all fine. The problem starts when the IO thread stops after the row event, but before COMMIT, e.g. at position 74558 in the snippet above.
Then the SQL thread executes the event, but does not properly finish it because there is no COMMIT. Then everything stops, and when the threads are restarted, the event gets executed again, naturally causing data inconsistency and/or replication abort.
In non-parallel replication this situation is accompanied by a long list of complaints in the slave error log. It throws multiple warnings
Slave SQL: Request to stop slave SQL Thread received while applying a group that has non-transactional changes; waiting for completion of the group
|
.. then the fatal error
[ERROR] Slave SQL: Fatal error: ... Slave SQL Thread stopped with incomplete event group having non-transactional changes. If the group consists solely of row-based events, you can try to restart the slave with --slave-exec-mode=IDEMPOTENT, which ignores duplicate key, key not found, and similar errors (see documentation for details). Gtid 0-1-758, Internal MariaDB error code: 1593
|
... then yet another error
[ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594
|
150521 1:46:38 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 138145
|
... and only after replication restart, it ends up with the mentioned replication abort and/or data inconsistency.
So, even though nothing good comes out of this, the server's reputation is at least protected by the disclaimers in the error log.
With the parallel replication, it's different. No warnings about incomplete groups or errors about relay log are raised; the slave jumps directly to the replication restart and further abort and/or data inconsistency.
So, I see two problems here:
- legacy one: maybe the IO thread should not be allowed to stop in such places?
- parallel replication one: if it cannot handle the situation any better than the traditional replication does, then it should produce the same warnings and errors.
|
Ugly test case |
# To be run with |
# --repeat=<many> --force-restart --nocheck-testcases |
|
|
--source include/master-slave.inc
|
--source include/have_binlog_format_row.inc
|
|
|
--connection slave
|
--source include/stop_slave_sql.inc
|
|
|
--connection master
|
create table t1 (i int primary key) engine=MyISAM; |
--delimiter |
|
create procedure pr() |
begin
|
declare i int default 1; |
while i <= 1000
|
do
|
insert into t1 values (i); |
set i = i+1; |
end while; |
end | |
--delimiter ;
|
|
|
send call pr();
|
|
|
--connection slave
|
start slave;
|
--sleep 0.5
|
stop slave io_thread;
|
|
|
#--cat_file $MYSQLTEST_VARDIR/mysqld.2/mysqld.log |
|
|
--connection master
|
--reap
|
|
|
--connection slave
|
|
|
--let $status_items= Read_Master_Log_Pos,Exec_Master_Log_Pos,Last_IO_Errno,Last_SQL_Errno
|
--source include/show_slave_status.inc
|
|
|
sleep 10;
|
--let $status_items= Read_Master_Log_Pos,Exec_Master_Log_Pos,Last_IO_Errno,Last_SQL_Errno
|
--source include/show_slave_status.inc
|
select max(i) from t1; |
stop slave;
|
sleep 1;
|
--let $status_items= Read_Master_Log_Pos,Exec_Master_Log_Pos,Last_IO_Errno,Last_SQL_Errno
|
--source include/show_slave_status.inc
|
start slave sql_thread, io_thread;
|
sleep 5;
|
--let $status_items= Read_Master_Log_Pos,Exec_Master_Log_Pos,Last_IO_Errno,Last_SQL_Errno
|
--source include/show_slave_status.inc
|
select count(*) from t1; |
if (`select count(*) != 1000 from t1`) |
{
|
--die "Error!" |
}
|
|
|