[MDEV-6551] Some replication errors are ignored if slave_parallel_threads > 0 Created: 2014-08-08  Updated: 2014-08-15  Resolved: 2014-08-15

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.0.12
Fix Version/s: 10.0.14

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: parallelslave

Issue Links:
Relates
relates to MDEV-6589 Incorrect relay log start position wh... Closed

 Description   

The problem is observed on binary logs attached to MDEV-6549.

  • put the attached set of binary logs into the master datadir;
  • start master server with log-bin=mysql-bin, other options can stay default;
  • start slave with default options;
  • start replication with MASTER_USE_GTID = current_pos (maybe GTID is not necessary in the scenario, I didn't try without it);
  • after a while, replication fails here:

*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: root
                  Master_Port: 9306
                Connect_Retry: 1
              Master_Log_File: mysql-bin.002277
          Read_Master_Log_Pos: 326
               Relay_Log_File: wheezy-64-relay-bin.002025
                Relay_Log_Pos: 613
        Relay_Master_Log_File: mysql-bin.001011
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1062
                   Last_Error: Error 'Duplicate entry '15' for key 'PRIMARY'' on query. Default database: 'test1'. Query: 'INSERT   INTO test1 .   /* table10_int  table0_int  t2_temp_myisam_25855 */ table10_innodb_int  ( `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` ) SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM test1 .   /* t2_temp_innodb_25855  table0_int_autoinc  table1_myisam_int_autoinc */ table10_innodb_int  AS A WHERE col_tinyint BETWEEN -82 AND -82 + 2 /**/ AND ( `col_varchar_257_binary` IS NULL OR `col_varchar_257_binary` <> ROW_COUNT() )  UNION SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM ( SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`c
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 326
              Relay_Log_Space: 8561745
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1062
               Last_SQL_Error: Error 'Duplicate entry '15' for key 'PRIMARY'' on query. Default database: 'test1'. Query: 'INSERT   INTO test1 .   /* table10_int  table0_int  t2_temp_myisam_25855 */ table10_innodb_int  ( `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` ) SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM test1 .   /* t2_temp_innodb_25855  table0_int_autoinc  table1_myisam_int_autoinc */ table10_innodb_int  AS A WHERE col_tinyint BETWEEN -82 AND -82 + 2 /**/ AND ( `col_varchar_257_binary` IS NULL OR `col_varchar_257_binary` <> ROW_COUNT() )  UNION SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM ( SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`c
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: Current_Pos
                  Gtid_IO_Pos: 0-1-11874

The failure might be caused by an old replication bug or an erroneous test, it is not the point here.
Trying to run start slave again does not help, it still aborts at the same position (naturally).
Then,

  • run set global slave_parallel_threads=1;
  • run start slave
  • it fails again, but note that the position has advanced:

    MariaDB [test]> show slave status \G
    *************************** 1. row ***************************
                   Slave_IO_State: Waiting for master to send event
                      Master_Host: 127.0.0.1
                      Master_User: root
                      Master_Port: 9306
                    Connect_Retry: 1
                  Master_Log_File: mysql-bin.002277
              Read_Master_Log_Pos: 326
                   Relay_Log_File: wheezy-64-relay-bin.002027
                    Relay_Log_Pos: 613
            Relay_Master_Log_File: mysql-bin.001012
                 Slave_IO_Running: Yes
                Slave_SQL_Running: No
                  Replicate_Do_DB: 
              Replicate_Ignore_DB: 
               Replicate_Do_Table: 
           Replicate_Ignore_Table: 
          Replicate_Wild_Do_Table: 
      Replicate_Wild_Ignore_Table: 
                       Last_Errno: 1062
                       Last_Error: Error 'Duplicate entry '15' for key 'PRIMARY'' on query. Default database: 'test1'. Query: 'INSERT   INTO test1 .   /* table10_int  table0_int  t2_temp_myisam_25855 */ table10_innodb_int  ( `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` ) SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM test1 .   /* t2_temp_innodb_25855  table0_int_autoinc  table1_myisam_int_autoinc */ table10_innodb_int  AS A WHERE col_tinyint BETWEEN -82 AND -82 + 2 /**/ AND ( `col_varchar_257_binary` IS NULL OR `col_varchar_257_binary` <> ROW_COUNT() )  UNION SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM ( SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`c
                     Skip_Counter: 0
              Exec_Master_Log_Pos: 326
                  Relay_Log_Space: 8562815
                  Until_Condition: None
                   Until_Log_File: 
                    Until_Log_Pos: 0
               Master_SSL_Allowed: No
               Master_SSL_CA_File: 
               Master_SSL_CA_Path: 
                  Master_SSL_Cert: 
                Master_SSL_Cipher: 
                   Master_SSL_Key: 
            Seconds_Behind_Master: NULL
    Master_SSL_Verify_Server_Cert: No
                    Last_IO_Errno: 0
                    Last_IO_Error: 
                   Last_SQL_Errno: 1062
                   Last_SQL_Error: Error 'Duplicate entry '15' for key 'PRIMARY'' on query. Default database: 'test1'. Query: 'INSERT   INTO test1 .   /* table10_int  table0_int  t2_temp_myisam_25855 */ table10_innodb_int  ( `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` ) SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM test1 .   /* t2_temp_innodb_25855  table0_int_autoinc  table1_myisam_int_autoinc */ table10_innodb_int  AS A WHERE col_tinyint BETWEEN -82 AND -82 + 2 /**/ AND ( `col_varchar_257_binary` IS NULL OR `col_varchar_257_binary` <> ROW_COUNT() )  UNION SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM ( SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`c
      Replicate_Ignore_Server_Ids: 
                 Master_Server_Id: 1
                   Master_SSL_Crl: 
               Master_SSL_Crlpath: 
                       Using_Gtid: Current_Pos
                      Gtid_IO_Pos: 0-1-11874
    1 row in set (0.00 sec)

  • run start slave yet another time. Now, the replication goes on:

    MariaDB [test]> show slave status \G
    *************************** 1. row ***************************
                   Slave_IO_State: Waiting for master to send event
                      Master_Host: 127.0.0.1
                      Master_User: root
                      Master_Port: 9306
                    Connect_Retry: 1
                  Master_Log_File: mysql-bin.002277
              Read_Master_Log_Pos: 326
                   Relay_Log_File: wheezy-64-relay-bin.003535
                    Relay_Log_Pos: 2332
            Relay_Master_Log_File: mysql-bin.001728
                 Slave_IO_Running: Yes
                Slave_SQL_Running: Yes
                  Replicate_Do_DB: 
              Replicate_Ignore_DB: 
               Replicate_Do_Table: 
           Replicate_Ignore_Table: 
          Replicate_Wild_Do_Table: 
      Replicate_Wild_Ignore_Table: 
                       Last_Errno: 0
                       Last_Error: 
                     Skip_Counter: 0
              Exec_Master_Log_Pos: 2045
                  Relay_Log_Space: 3094464
                  Until_Condition: None
                   Until_Log_File: 
                    Until_Log_Pos: 0
               Master_SSL_Allowed: No
               Master_SSL_CA_File: 
               Master_SSL_CA_Path: 
                  Master_SSL_Cert: 
                Master_SSL_Cipher: 
                   Master_SSL_Key: 
            Seconds_Behind_Master: 397789
    Master_SSL_Verify_Server_Cert: No
                    Last_IO_Errno: 0
                    Last_IO_Error: 
                   Last_SQL_Errno: 0
                   Last_SQL_Error: 
      Replicate_Ignore_Server_Ids: 
                 Master_Server_Id: 1
                   Master_SSL_Crl: 
               Master_SSL_Crlpath: 
                       Using_Gtid: Current_Pos
                      Gtid_IO_Pos: 0-1-11874

I don't think it's supposed to happen.

The binary logs were produced using 10.0 tree on Power 8 machine, and weren't anyhow tampered with.

The test is executed on a usual Debian machine, with this server:

revision-id: sergii@pisem.net-20140805163520-tz0w14h8uju0ti9n
revno: 4330
branch-nick: 10.0



 Comments   
Comment by Kristian Nielsen [ 2014-08-13 ]

I was able to reproduce as described. Looks like a real issue.

The difference in SHOW SLAVE STATUS after the first restart with parallel replication is the Relay_Master_Log_File:

< Relay_Master_Log_File: mysql-bin.001011

> Relay_Master_Log_File: mysql-bin.001012

I suspect this requires that the IO thread not be restarted to reproduce, which may explain why it wasn't caught before. I will investigate.

Comment by Kristian Nielsen [ 2014-08-14 ]

Here is an MTR test case to reproduce.

The bug seems to occur if:

  • The slave aborts with an error in parallel replication mode
  • The failing event group is the last in a master's binlog, and another later
    binlog exists
  • The slave SQL thread is restarted without restarting the IO thread
    in-between the failure and the restart.

I suspect the problem is that the problem is related to the Rotate_event after
the transaction that is executed out-of-band with the other transactions. I
guess it incorrectly updates the relay log position to after the error,
causing the problem.

But I will investigate more to know for sure.

--source include/have_innodb.inc
--let $rpl_topology=1->2
--source include/rpl_init.inc
 
--connection server_2
--source include/stop_slave.inc
CHANGE MASTER TO master_use_gtid=slave_pos;
--source include/start_slave.inc
 
 
--echo *** MDEV-6551: Some replication errors are ignored if slave_parallel_threads > 0 ***
 
--connection server_1
ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
CREATE TABLE t1 (a int PRIMARY KEY) ENGINE=InnoDB;
INSERT INTO t1 VALUES (1);
--source include/save_master_gtid.inc
 
--connection server_2
--source include/sync_with_master_gtid.inc
 
# Force a duplicate key error on the slave.
SET sql_log_bin= 0;
INSERT INTO t1 VALUES (2);
SET sql_log_bin= 1;
 
--connection server_1
INSERT INTO t1 VALUES (2);
# Rotate the binlog; the bug is triggered when the master binlog file changes
# after the event group that causes the duplicate key error.
FLUSH LOGS;
INSERT INTO t1 VALUES (3);
INSERT INTO t1 VALUES (4);
SELECT * FROM t1 ORDER BY a;
--source include/save_master_gtid.inc
 
--connection server_2
--let $slave_sql_errno= 1062
--let $slave_timeout= 10
--source include/wait_for_slave_sql_error.inc
 
--connection server_2
SET @old_parallel_threads=@@GLOBAL.slave_parallel_threads;
--source include/stop_slave_io.inc
SET GLOBAL slave_parallel_threads=1;
START SLAVE;
 
--let $slave_sql_errno= 1062
--let $slave_timeout= 10
--source include/wait_for_slave_sql_error.inc
 
# Note: IO thread is still running at this point.
# The bug seems to have been that restarting the SQL thread after an error with
# the IO thread still running, somehow picks up a later relay log position and
# thus ends up skipping the failing event, rather than re-executing.
 
START SLAVE SQL_THREAD;
--let $slave_sql_errno= 1062
--let $slave_timeout= 10
--source include/wait_for_slave_sql_error.inc
 
SELECT * FROM t1 ORDER BY a;
 
# Skip the duplicate error, so we can proceed.
SET sql_slave_skip_counter= 1;
--source include/start_slave.inc
--source include/sync_with_master_gtid.inc
 
SELECT * FROM t1 ORDER BY a;
 
 
# Clean up.
--connection server_2
--source include/stop_slave.inc
SET GLOBAL slave_parallel_threads=@old_parallel_threads;
--source include/start_slave.inc
 
--connection server_1
DROP TABLE t1;
 
--source include/rpl_end.inc

Comment by Kristian Nielsen [ 2014-08-15 ]

Pushed to 10.0.14.

The problem was that in parallel replication a Rotate_event could be executed
from the relay log even after an earlier event had failed and caused
replication to stop. This could cause the replication position to be advanced
too far ahead, so that events would be lost and replication become incorrect.

Now the Rotate_event is correctly skipped after an error occurs that cause the
SQL thread to stop.

While investigating this, I found a related bug MDEV-6589. That other bug
concerns a similar problem with the replication position advancing too far
ahead when restarting after error, but only concerns using parallel
replication with different replication domains. Since the cause, as well as
necessary fix, for that other bug is rather different, I filed it separately.

The bug described here, where there is only one replication domain used,
should be fixed now in current 10.0.14 code.

Generated at Thu Feb 08 07:12:47 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.