Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6551

Some replication errors are ignored if slave_parallel_threads > 0

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.0.12
    • 10.0.14
    • None

    Description

      The problem is observed on binary logs attached to MDEV-6549.

      • put the attached set of binary logs into the master datadir;
      • start master server with log-bin=mysql-bin, other options can stay default;
      • start slave with default options;
      • start replication with MASTER_USE_GTID = current_pos (maybe GTID is not necessary in the scenario, I didn't try without it);
      • after a while, replication fails here:

      *************************** 1. row ***************************
                     Slave_IO_State: Waiting for master to send event
                        Master_Host: 127.0.0.1
                        Master_User: root
                        Master_Port: 9306
                      Connect_Retry: 1
                    Master_Log_File: mysql-bin.002277
                Read_Master_Log_Pos: 326
                     Relay_Log_File: wheezy-64-relay-bin.002025
                      Relay_Log_Pos: 613
              Relay_Master_Log_File: mysql-bin.001011
                   Slave_IO_Running: Yes
                  Slave_SQL_Running: No
                    Replicate_Do_DB: 
                Replicate_Ignore_DB: 
                 Replicate_Do_Table: 
             Replicate_Ignore_Table: 
            Replicate_Wild_Do_Table: 
        Replicate_Wild_Ignore_Table: 
                         Last_Errno: 1062
                         Last_Error: Error 'Duplicate entry '15' for key 'PRIMARY'' on query. Default database: 'test1'. Query: 'INSERT   INTO test1 .   /* table10_int  table0_int  t2_temp_myisam_25855 */ table10_innodb_int  ( `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` ) SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM test1 .   /* t2_temp_innodb_25855  table0_int_autoinc  table1_myisam_int_autoinc */ table10_innodb_int  AS A WHERE col_tinyint BETWEEN -82 AND -82 + 2 /**/ AND ( `col_varchar_257_binary` IS NULL OR `col_varchar_257_binary` <> ROW_COUNT() )  UNION SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM ( SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`c
                       Skip_Counter: 0
                Exec_Master_Log_Pos: 326
                    Relay_Log_Space: 8561745
                    Until_Condition: None
                     Until_Log_File: 
                      Until_Log_Pos: 0
                 Master_SSL_Allowed: No
                 Master_SSL_CA_File: 
                 Master_SSL_CA_Path: 
                    Master_SSL_Cert: 
                  Master_SSL_Cipher: 
                     Master_SSL_Key: 
              Seconds_Behind_Master: NULL
      Master_SSL_Verify_Server_Cert: No
                      Last_IO_Errno: 0
                      Last_IO_Error: 
                     Last_SQL_Errno: 1062
                     Last_SQL_Error: Error 'Duplicate entry '15' for key 'PRIMARY'' on query. Default database: 'test1'. Query: 'INSERT   INTO test1 .   /* table10_int  table0_int  t2_temp_myisam_25855 */ table10_innodb_int  ( `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` ) SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM test1 .   /* t2_temp_innodb_25855  table0_int_autoinc  table1_myisam_int_autoinc */ table10_innodb_int  AS A WHERE col_tinyint BETWEEN -82 AND -82 + 2 /**/ AND ( `col_varchar_257_binary` IS NULL OR `col_varchar_257_binary` <> ROW_COUNT() )  UNION SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM ( SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`c
        Replicate_Ignore_Server_Ids: 
                   Master_Server_Id: 1
                     Master_SSL_Crl: 
                 Master_SSL_Crlpath: 
                         Using_Gtid: Current_Pos
                        Gtid_IO_Pos: 0-1-11874

      The failure might be caused by an old replication bug or an erroneous test, it is not the point here.
      Trying to run start slave again does not help, it still aborts at the same position (naturally).
      Then,

      • run set global slave_parallel_threads=1;
      • run start slave
      • it fails again, but note that the position has advanced:

        MariaDB [test]> show slave status \G
        *************************** 1. row ***************************
                       Slave_IO_State: Waiting for master to send event
                          Master_Host: 127.0.0.1
                          Master_User: root
                          Master_Port: 9306
                        Connect_Retry: 1
                      Master_Log_File: mysql-bin.002277
                  Read_Master_Log_Pos: 326
                       Relay_Log_File: wheezy-64-relay-bin.002027
                        Relay_Log_Pos: 613
                Relay_Master_Log_File: mysql-bin.001012
                     Slave_IO_Running: Yes
                    Slave_SQL_Running: No
                      Replicate_Do_DB: 
                  Replicate_Ignore_DB: 
                   Replicate_Do_Table: 
               Replicate_Ignore_Table: 
              Replicate_Wild_Do_Table: 
          Replicate_Wild_Ignore_Table: 
                           Last_Errno: 1062
                           Last_Error: Error 'Duplicate entry '15' for key 'PRIMARY'' on query. Default database: 'test1'. Query: 'INSERT   INTO test1 .   /* table10_int  table0_int  t2_temp_myisam_25855 */ table10_innodb_int  ( `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` ) SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM test1 .   /* t2_temp_innodb_25855  table0_int_autoinc  table1_myisam_int_autoinc */ table10_innodb_int  AS A WHERE col_tinyint BETWEEN -82 AND -82 + 2 /**/ AND ( `col_varchar_257_binary` IS NULL OR `col_varchar_257_binary` <> ROW_COUNT() )  UNION SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM ( SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`c
                         Skip_Counter: 0
                  Exec_Master_Log_Pos: 326
                      Relay_Log_Space: 8562815
                      Until_Condition: None
                       Until_Log_File: 
                        Until_Log_Pos: 0
                   Master_SSL_Allowed: No
                   Master_SSL_CA_File: 
                   Master_SSL_CA_Path: 
                      Master_SSL_Cert: 
                    Master_SSL_Cipher: 
                       Master_SSL_Key: 
                Seconds_Behind_Master: NULL
        Master_SSL_Verify_Server_Cert: No
                        Last_IO_Errno: 0
                        Last_IO_Error: 
                       Last_SQL_Errno: 1062
                       Last_SQL_Error: Error 'Duplicate entry '15' for key 'PRIMARY'' on query. Default database: 'test1'. Query: 'INSERT   INTO test1 .   /* table10_int  table0_int  t2_temp_myisam_25855 */ table10_innodb_int  ( `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` ) SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM test1 .   /* t2_temp_innodb_25855  table0_int_autoinc  table1_myisam_int_autoinc */ table10_innodb_int  AS A WHERE col_tinyint BETWEEN -82 AND -82 + 2 /**/ AND ( `col_varchar_257_binary` IS NULL OR `col_varchar_257_binary` <> ROW_COUNT() )  UNION SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM ( SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`c
          Replicate_Ignore_Server_Ids: 
                     Master_Server_Id: 1
                       Master_SSL_Crl: 
                   Master_SSL_Crlpath: 
                           Using_Gtid: Current_Pos
                          Gtid_IO_Pos: 0-1-11874
        1 row in set (0.00 sec)

      • run start slave yet another time. Now, the replication goes on:

        MariaDB [test]> show slave status \G
        *************************** 1. row ***************************
                       Slave_IO_State: Waiting for master to send event
                          Master_Host: 127.0.0.1
                          Master_User: root
                          Master_Port: 9306
                        Connect_Retry: 1
                      Master_Log_File: mysql-bin.002277
                  Read_Master_Log_Pos: 326
                       Relay_Log_File: wheezy-64-relay-bin.003535
                        Relay_Log_Pos: 2332
                Relay_Master_Log_File: mysql-bin.001728
                     Slave_IO_Running: Yes
                    Slave_SQL_Running: Yes
                      Replicate_Do_DB: 
                  Replicate_Ignore_DB: 
                   Replicate_Do_Table: 
               Replicate_Ignore_Table: 
              Replicate_Wild_Do_Table: 
          Replicate_Wild_Ignore_Table: 
                           Last_Errno: 0
                           Last_Error: 
                         Skip_Counter: 0
                  Exec_Master_Log_Pos: 2045
                      Relay_Log_Space: 3094464
                      Until_Condition: None
                       Until_Log_File: 
                        Until_Log_Pos: 0
                   Master_SSL_Allowed: No
                   Master_SSL_CA_File: 
                   Master_SSL_CA_Path: 
                      Master_SSL_Cert: 
                    Master_SSL_Cipher: 
                       Master_SSL_Key: 
                Seconds_Behind_Master: 397789
        Master_SSL_Verify_Server_Cert: No
                        Last_IO_Errno: 0
                        Last_IO_Error: 
                       Last_SQL_Errno: 0
                       Last_SQL_Error: 
          Replicate_Ignore_Server_Ids: 
                     Master_Server_Id: 1
                       Master_SSL_Crl: 
                   Master_SSL_Crlpath: 
                           Using_Gtid: Current_Pos
                          Gtid_IO_Pos: 0-1-11874

      I don't think it's supposed to happen.

      The binary logs were produced using 10.0 tree on Power 8 machine, and weren't anyhow tampered with.

      The test is executed on a usual Debian machine, with this server:

      revision-id: sergii@pisem.net-20140805163520-tz0w14h8uju0ti9n
      revno: 4330
      branch-nick: 10.0

      Attachments

        Issue Links

          Activity

            I was able to reproduce as described. Looks like a real issue.

            The difference in SHOW SLAVE STATUS after the first restart with parallel replication is the Relay_Master_Log_File:

            < Relay_Master_Log_File: mysql-bin.001011
            —
            > Relay_Master_Log_File: mysql-bin.001012

            I suspect this requires that the IO thread not be restarted to reproduce, which may explain why it wasn't caught before. I will investigate.

            knielsen Kristian Nielsen added a comment - I was able to reproduce as described. Looks like a real issue. The difference in SHOW SLAVE STATUS after the first restart with parallel replication is the Relay_Master_Log_File: < Relay_Master_Log_File: mysql-bin.001011 — > Relay_Master_Log_File: mysql-bin.001012 I suspect this requires that the IO thread not be restarted to reproduce, which may explain why it wasn't caught before. I will investigate.
            knielsen Kristian Nielsen added a comment - - edited

            Here is an MTR test case to reproduce.

            The bug seems to occur if:

            • The slave aborts with an error in parallel replication mode
            • The failing event group is the last in a master's binlog, and another later
              binlog exists
            • The slave SQL thread is restarted without restarting the IO thread
              in-between the failure and the restart.

            I suspect the problem is that the problem is related to the Rotate_event after
            the transaction that is executed out-of-band with the other transactions. I
            guess it incorrectly updates the relay log position to after the error,
            causing the problem.

            But I will investigate more to know for sure.

            --source include/have_innodb.inc
            --let $rpl_topology=1->2
            --source include/rpl_init.inc
             
            --connection server_2
            --source include/stop_slave.inc
            CHANGE MASTER TO master_use_gtid=slave_pos;
            --source include/start_slave.inc
             
             
            --echo *** MDEV-6551: Some replication errors are ignored if slave_parallel_threads > 0 ***
             
            --connection server_1
            ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
            CREATE TABLE t1 (a int PRIMARY KEY) ENGINE=InnoDB;
            INSERT INTO t1 VALUES (1);
            --source include/save_master_gtid.inc
             
            --connection server_2
            --source include/sync_with_master_gtid.inc
             
            # Force a duplicate key error on the slave.
            SET sql_log_bin= 0;
            INSERT INTO t1 VALUES (2);
            SET sql_log_bin= 1;
             
            --connection server_1
            INSERT INTO t1 VALUES (2);
            # Rotate the binlog; the bug is triggered when the master binlog file changes
            # after the event group that causes the duplicate key error.
            FLUSH LOGS;
            INSERT INTO t1 VALUES (3);
            INSERT INTO t1 VALUES (4);
            SELECT * FROM t1 ORDER BY a;
            --source include/save_master_gtid.inc
             
            --connection server_2
            --let $slave_sql_errno= 1062
            --let $slave_timeout= 10
            --source include/wait_for_slave_sql_error.inc
             
            --connection server_2
            SET @old_parallel_threads=@@GLOBAL.slave_parallel_threads;
            --source include/stop_slave_io.inc
            SET GLOBAL slave_parallel_threads=1;
            START SLAVE;
             
            --let $slave_sql_errno= 1062
            --let $slave_timeout= 10
            --source include/wait_for_slave_sql_error.inc
             
            # Note: IO thread is still running at this point.
            # The bug seems to have been that restarting the SQL thread after an error with
            # the IO thread still running, somehow picks up a later relay log position and
            # thus ends up skipping the failing event, rather than re-executing.
             
            START SLAVE SQL_THREAD;
            --let $slave_sql_errno= 1062
            --let $slave_timeout= 10
            --source include/wait_for_slave_sql_error.inc
             
            SELECT * FROM t1 ORDER BY a;
             
            # Skip the duplicate error, so we can proceed.
            SET sql_slave_skip_counter= 1;
            --source include/start_slave.inc
            --source include/sync_with_master_gtid.inc
             
            SELECT * FROM t1 ORDER BY a;
             
             
            # Clean up.
            --connection server_2
            --source include/stop_slave.inc
            SET GLOBAL slave_parallel_threads=@old_parallel_threads;
            --source include/start_slave.inc
             
            --connection server_1
            DROP TABLE t1;
             
            --source include/rpl_end.inc

            knielsen Kristian Nielsen added a comment - - edited Here is an MTR test case to reproduce. The bug seems to occur if: The slave aborts with an error in parallel replication mode The failing event group is the last in a master's binlog, and another later binlog exists The slave SQL thread is restarted without restarting the IO thread in-between the failure and the restart. I suspect the problem is that the problem is related to the Rotate_event after the transaction that is executed out-of-band with the other transactions. I guess it incorrectly updates the relay log position to after the error, causing the problem. But I will investigate more to know for sure. --source include/have_innodb.inc --let $rpl_topology=1->2 --source include/rpl_init.inc   --connection server_2 --source include/stop_slave.inc CHANGE MASTER TO master_use_gtid=slave_pos; --source include/start_slave.inc     --echo *** MDEV-6551: Some replication errors are ignored if slave_parallel_threads > 0 ***   --connection server_1 ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB; CREATE TABLE t1 (a int PRIMARY KEY) ENGINE=InnoDB; INSERT INTO t1 VALUES (1); --source include/save_master_gtid.inc   --connection server_2 --source include/sync_with_master_gtid.inc   # Force a duplicate key error on the slave. SET sql_log_bin= 0; INSERT INTO t1 VALUES (2); SET sql_log_bin= 1;   --connection server_1 INSERT INTO t1 VALUES (2); # Rotate the binlog; the bug is triggered when the master binlog file changes # after the event group that causes the duplicate key error. FLUSH LOGS; INSERT INTO t1 VALUES (3); INSERT INTO t1 VALUES (4); SELECT * FROM t1 ORDER BY a; --source include/save_master_gtid.inc   --connection server_2 --let $slave_sql_errno= 1062 --let $slave_timeout= 10 --source include/wait_for_slave_sql_error.inc   --connection server_2 SET @old_parallel_threads=@@GLOBAL.slave_parallel_threads; --source include/stop_slave_io.inc SET GLOBAL slave_parallel_threads=1; START SLAVE;   --let $slave_sql_errno= 1062 --let $slave_timeout= 10 --source include/wait_for_slave_sql_error.inc   # Note: IO thread is still running at this point. # The bug seems to have been that restarting the SQL thread after an error with # the IO thread still running, somehow picks up a later relay log position and # thus ends up skipping the failing event, rather than re-executing.   START SLAVE SQL_THREAD; --let $slave_sql_errno= 1062 --let $slave_timeout= 10 --source include/wait_for_slave_sql_error.inc   SELECT * FROM t1 ORDER BY a;   # Skip the duplicate error, so we can proceed. SET sql_slave_skip_counter= 1; --source include/start_slave.inc --source include/sync_with_master_gtid.inc   SELECT * FROM t1 ORDER BY a;     # Clean up. --connection server_2 --source include/stop_slave.inc SET GLOBAL slave_parallel_threads=@old_parallel_threads; --source include/start_slave.inc   --connection server_1 DROP TABLE t1;   --source include/rpl_end.inc

            Pushed to 10.0.14.

            The problem was that in parallel replication a Rotate_event could be executed
            from the relay log even after an earlier event had failed and caused
            replication to stop. This could cause the replication position to be advanced
            too far ahead, so that events would be lost and replication become incorrect.

            Now the Rotate_event is correctly skipped after an error occurs that cause the
            SQL thread to stop.

            While investigating this, I found a related bug MDEV-6589. That other bug
            concerns a similar problem with the replication position advancing too far
            ahead when restarting after error, but only concerns using parallel
            replication with different replication domains. Since the cause, as well as
            necessary fix, for that other bug is rather different, I filed it separately.

            The bug described here, where there is only one replication domain used,
            should be fixed now in current 10.0.14 code.

            knielsen Kristian Nielsen added a comment - Pushed to 10.0.14. The problem was that in parallel replication a Rotate_event could be executed from the relay log even after an earlier event had failed and caused replication to stop. This could cause the replication position to be advanced too far ahead, so that events would be lost and replication become incorrect. Now the Rotate_event is correctly skipped after an error occurs that cause the SQL thread to stop. While investigating this, I found a related bug MDEV-6589 . That other bug concerns a similar problem with the replication position advancing too far ahead when restarting after error, but only concerns using parallel replication with different replication domains. Since the cause, as well as necessary fix, for that other bug is rather different, I filed it separately. The bug described here, where there is only one replication domain used, should be fixed now in current 10.0.14 code.

            People

              knielsen Kristian Nielsen
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.