[MDEV-6551] Some replication errors are ignored if slave_parallel_threads > 0 - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.0.12
Fix Version/s: 10.0.14
Component/s: None
Labels:
- parallelslave

Description

The problem is observed on binary logs attached to ~~MDEV-6549~~.

put the attached set of binary logs into the master datadir;
start master server with log-bin=mysql-bin, other options can stay default;
start slave with default options;
start replication with MASTER_USE_GTID = current_pos (maybe GTID is not necessary in the scenario, I didn't try without it);
after a while, replication fails here:

*************************** 1. row ***************************

               Slave_IO_State: Waiting for master to send event

                  Master_Host: 127.0.0.1

                  Master_User: root

                  Master_Port: 9306

                Connect_Retry: 1

              Master_Log_File: mysql-bin.002277

          Read_Master_Log_Pos: 326

               Relay_Log_File: wheezy-64-relay-bin.002025

                Relay_Log_Pos: 613

        Relay_Master_Log_File: mysql-bin.001011

             Slave_IO_Running: Yes

            Slave_SQL_Running: No

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 1062

                   Last_Error: Error 'Duplicate entry '15' for key 'PRIMARY'' on query. Default database: 'test1'. Query: 'INSERT   INTO test1 .   /* table10_int  table0_int  t2_temp_myisam_25855 */ table10_innodb_int  ( `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` ) SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM test1 .   /* t2_temp_innodb_25855  table0_int_autoinc  table1_myisam_int_autoinc */ table10_innodb_int  AS A WHERE col_tinyint BETWEEN -82 AND -82 + 2 /**/ AND ( `col_varchar_257_binary` IS NULL OR `col_varchar_257_binary` <> ROW_COUNT() )  UNION SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM ( SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`c

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 326

              Relay_Log_Space: 8561745

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: NULL

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 1062

               Last_SQL_Error: Error 'Duplicate entry '15' for key 'PRIMARY'' on query. Default database: 'test1'. Query: 'INSERT   INTO test1 .   /* table10_int  table0_int  t2_temp_myisam_25855 */ table10_innodb_int  ( `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` ) SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM test1 .   /* t2_temp_innodb_25855  table0_int_autoinc  table1_myisam_int_autoinc */ table10_innodb_int  AS A WHERE col_tinyint BETWEEN -82 AND -82 + 2 /**/ AND ( `col_varchar_257_binary` IS NULL OR `col_varchar_257_binary` <> ROW_COUNT() )  UNION SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM ( SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`c

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 1

               Master_SSL_Crl:

           Master_SSL_Crlpath:

                   Using_Gtid: Current_Pos

                  Gtid_IO_Pos: 0-1-11874

The failure might be caused by an old replication bug or an erroneous test, it is not the point here.
Trying to run start slave again does not help, it still aborts at the same position (naturally).
Then,

run set global slave_parallel_threads=1;
run start slave

it fails again, but note that the position has advanced:

MariaDB [test]> show slave status \G

*************************** 1. row ***************************

               Slave_IO_State: Waiting for master to send event

                  Master_Host: 127.0.0.1

                  Master_User: root

                  Master_Port: 9306

                Connect_Retry: 1

              Master_Log_File: mysql-bin.002277

          Read_Master_Log_Pos: 326

               Relay_Log_File: wheezy-64-relay-bin.002027

                Relay_Log_Pos: 613

        Relay_Master_Log_File: mysql-bin.001012

             Slave_IO_Running: Yes

            Slave_SQL_Running: No

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 1062

                   Last_Error: Error 'Duplicate entry '15' for key 'PRIMARY'' on query. Default database: 'test1'. Query: 'INSERT   INTO test1 .   /* table10_int  table0_int  t2_temp_myisam_25855 */ table10_innodb_int  ( `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` ) SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM test1 .   /* t2_temp_innodb_25855  table0_int_autoinc  table1_myisam_int_autoinc */ table10_innodb_int  AS A WHERE col_tinyint BETWEEN -82 AND -82 + 2 /**/ AND ( `col_varchar_257_binary` IS NULL OR `col_varchar_257_binary` <> ROW_COUNT() )  UNION SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM ( SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`c

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 326

              Relay_Log_Space: 8562815

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: NULL

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 1062

               Last_SQL_Error: Error 'Duplicate entry '15' for key 'PRIMARY'' on query. Default database: 'test1'. Query: 'INSERT   INTO test1 .   /* table10_int  table0_int  t2_temp_myisam_25855 */ table10_innodb_int  ( `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` ) SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM test1 .   /* t2_temp_innodb_25855  table0_int_autoinc  table1_myisam_int_autoinc */ table10_innodb_int  AS A WHERE col_tinyint BETWEEN -82 AND -82 + 2 /**/ AND ( `col_varchar_257_binary` IS NULL OR `col_varchar_257_binary` <> ROW_COUNT() )  UNION SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`col_tinyint`,`col_tinyint_unsigned`,`col_varchar_257_binary`,`col_varchar_257_latin1`,`col_varchar_257_utf8`,`pk` FROM ( SELECT `col_bigint`,`col_bigint_unsigned`,`col_timestamp`,`c

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 1

               Master_SSL_Crl:

           Master_SSL_Crlpath:

                   Using_Gtid: Current_Pos

                  Gtid_IO_Pos: 0-1-11874

1 row in set (0.00 sec)

run start slave yet another time. Now, the replication goes on:

MariaDB [test]> show slave status \G

*************************** 1. row ***************************

               Slave_IO_State: Waiting for master to send event

                  Master_Host: 127.0.0.1

                  Master_User: root

                  Master_Port: 9306

                Connect_Retry: 1

              Master_Log_File: mysql-bin.002277

          Read_Master_Log_Pos: 326

               Relay_Log_File: wheezy-64-relay-bin.003535

                Relay_Log_Pos: 2332

        Relay_Master_Log_File: mysql-bin.001728

             Slave_IO_Running: Yes

            Slave_SQL_Running: Yes

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 0

                   Last_Error:

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 2045

              Relay_Log_Space: 3094464

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: 397789

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 0

               Last_SQL_Error:

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 1

               Master_SSL_Crl:

           Master_SSL_Crlpath:

                   Using_Gtid: Current_Pos

                  Gtid_IO_Pos: 0-1-11874

I don't think it's supposed to happen.

The binary logs were produced using 10.0 tree on Power 8 machine, and weren't anyhow tampered with.

The test is executed on a usual Debian machine, with this server:

revision-id: sergii@pisem.net-20140805163520-tz0w14h8uju0ti9n

revno: 4330

branch-nick: 10.0

Attachments

Issue Links

relates to

MDEV-6589 Incorrect relay log start position when restarting SQL thread after error in parallel replication

Closed

Activity

Ascending order - Click to sort in descending order

Kristian Nielsen added a comment - 2014-08-13 16:25

I was able to reproduce as described. Looks like a real issue.

The difference in SHOW SLAVE STATUS after the first restart with parallel replication is the Relay_Master_Log_File:

< Relay_Master_Log_File: mysql-bin.001011
—
> Relay_Master_Log_File: mysql-bin.001012

I suspect this requires that the IO thread not be restarted to reproduce, which may explain why it wasn't caught before. I will investigate.

Kristian Nielsen added a comment - 2014-08-13 16:25 I was able to reproduce as described. Looks like a real issue. The difference in SHOW SLAVE STATUS after the first restart with parallel replication is the Relay_Master_Log_File: < Relay_Master_Log_File: mysql-bin.001011 — > Relay_Master_Log_File: mysql-bin.001012 I suspect this requires that the IO thread not be restarted to reproduce, which may explain why it wasn't caught before. I will investigate.

Kristian Nielsen added a comment - 2014-08-14 10:48 - edited

Here is an MTR test case to reproduce.

The bug seems to occur if:

The slave aborts with an error in parallel replication mode

The failing event group is the last in a master's binlog, and another later
binlog exists

The slave SQL thread is restarted without restarting the IO thread
in-between the failure and the restart.

I suspect the problem is that the problem is related to the Rotate_event after
the transaction that is executed out-of-band with the other transactions. I
guess it incorrectly updates the relay log position to after the error,
causing the problem.

But I will investigate more to know for sure.

--source include/have_innodb.inc

--let $rpl_topology=1->2

--source include/rpl_init.inc

--connection server_2

--source include/stop_slave.inc

CHANGE MASTER TO master_use_gtid=slave_pos;

--source include/start_slave.inc

--echo *** MDEV-6551: Some replication errors are ignored if slave_parallel_threads > 0 ***

--connection server_1

ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;

CREATE TABLE t1 (a int PRIMARY KEY) ENGINE=InnoDB;

INSERT INTO t1 VALUES (1);

--source include/save_master_gtid.inc

--connection server_2

--source include/sync_with_master_gtid.inc

# Force a duplicate key error on the slave.

SET sql_log_bin= 0;

INSERT INTO t1 VALUES (2);

SET sql_log_bin= 1;

--connection server_1

INSERT INTO t1 VALUES (2);

# Rotate the binlog; the bug is triggered when the master binlog file changes

# after the event group that causes the duplicate key error.

FLUSH LOGS;

INSERT INTO t1 VALUES (3);

INSERT INTO t1 VALUES (4);

SELECT * FROM t1 ORDER BY a;

--source include/save_master_gtid.inc

--connection server_2

--let $slave_sql_errno= 1062

--let $slave_timeout= 10

--source include/wait_for_slave_sql_error.inc

--connection server_2

SET @old_parallel_threads=@@GLOBAL.slave_parallel_threads;

--source include/stop_slave_io.inc

SET GLOBAL slave_parallel_threads=1;

START SLAVE;

--let $slave_sql_errno= 1062

--let $slave_timeout= 10

--source include/wait_for_slave_sql_error.inc

# Note: IO thread is still running at this point.

# The bug seems to have been that restarting the SQL thread after an error with

# the IO thread still running, somehow picks up a later relay log position and

# thus ends up skipping the failing event, rather than re-executing.

START SLAVE SQL_THREAD;

--let $slave_sql_errno= 1062

--let $slave_timeout= 10

--source include/wait_for_slave_sql_error.inc

SELECT * FROM t1 ORDER BY a;

# Skip the duplicate error, so we can proceed.

SET sql_slave_skip_counter= 1;

--source include/start_slave.inc

--source include/sync_with_master_gtid.inc

SELECT * FROM t1 ORDER BY a;

# Clean up.

--connection server_2

--source include/stop_slave.inc

SET GLOBAL slave_parallel_threads=@old_parallel_threads;

--source include/start_slave.inc

--connection server_1

DROP TABLE t1;

--source include/rpl_end.inc

Kristian Nielsen added a comment - 2014-08-14 10:48 - edited Here is an MTR test case to reproduce. The bug seems to occur if: The slave aborts with an error in parallel replication mode The failing event group is the last in a master's binlog, and another later binlog exists The slave SQL thread is restarted without restarting the IO thread in-between the failure and the restart. I suspect the problem is that the problem is related to the Rotate_event after the transaction that is executed out-of-band with the other transactions. I guess it incorrectly updates the relay log position to after the error, causing the problem. But I will investigate more to know for sure. --source include/have_innodb.inc --let $rpl_topology=1->2 --source include/rpl_init.inc --connection server_2 --source include/stop_slave.inc CHANGE MASTER TO master_use_gtid=slave_pos; --source include/start_slave.inc --echo *** MDEV-6551: Some replication errors are ignored if slave_parallel_threads > 0 *** --connection server_1 ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB; CREATE TABLE t1 (a int PRIMARY KEY) ENGINE=InnoDB; INSERT INTO t1 VALUES (1); --source include/save_master_gtid.inc --connection server_2 --source include/sync_with_master_gtid.inc # Force a duplicate key error on the slave. SET sql_log_bin= 0; INSERT INTO t1 VALUES (2); SET sql_log_bin= 1; --connection server_1 INSERT INTO t1 VALUES (2); # Rotate the binlog; the bug is triggered when the master binlog file changes # after the event group that causes the duplicate key error. FLUSH LOGS; INSERT INTO t1 VALUES (3); INSERT INTO t1 VALUES (4); SELECT * FROM t1 ORDER BY a; --source include/save_master_gtid.inc --connection server_2 --let $slave_sql_errno= 1062 --let $slave_timeout= 10 --source include/wait_for_slave_sql_error.inc --connection server_2 SET @old_parallel_threads=@@GLOBAL.slave_parallel_threads; --source include/stop_slave_io.inc SET GLOBAL slave_parallel_threads=1; START SLAVE; --let $slave_sql_errno= 1062 --let $slave_timeout= 10 --source include/wait_for_slave_sql_error.inc # Note: IO thread is still running at this point. # The bug seems to have been that restarting the SQL thread after an error with # the IO thread still running, somehow picks up a later relay log position and # thus ends up skipping the failing event, rather than re-executing. START SLAVE SQL_THREAD; --let $slave_sql_errno= 1062 --let $slave_timeout= 10 --source include/wait_for_slave_sql_error.inc SELECT * FROM t1 ORDER BY a; # Skip the duplicate error, so we can proceed. SET sql_slave_skip_counter= 1; --source include/start_slave.inc --source include/sync_with_master_gtid.inc SELECT * FROM t1 ORDER BY a; # Clean up. --connection server_2 --source include/stop_slave.inc SET GLOBAL slave_parallel_threads=@old_parallel_threads; --source include/start_slave.inc --connection server_1 DROP TABLE t1; --source include/rpl_end.inc

Kristian Nielsen added a comment - 2014-08-15 13:39

Pushed to 10.0.14.

The problem was that in parallel replication a Rotate_event could be executed
from the relay log even after an earlier event had failed and caused
replication to stop. This could cause the replication position to be advanced
too far ahead, so that events would be lost and replication become incorrect.

Now the Rotate_event is correctly skipped after an error occurs that cause the
SQL thread to stop.

While investigating this, I found a related bug ~~MDEV-6589~~. That other bug
concerns a similar problem with the replication position advancing too far
ahead when restarting after error, but only concerns using parallel
replication with different replication domains. Since the cause, as well as
necessary fix, for that other bug is rather different, I filed it separately.

The bug described here, where there is only one replication domain used,
should be fixed now in current 10.0.14 code.

Kristian Nielsen added a comment - 2014-08-15 13:39 Pushed to 10.0.14. The problem was that in parallel replication a Rotate_event could be executed from the relay log even after an earlier event had failed and caused replication to stop. This could cause the replication position to be advanced too far ahead, so that events would be lost and replication become incorrect. Now the Rotate_event is correctly skipped after an error occurs that cause the SQL thread to stop. While investigating this, I found a related bug MDEV-6589 . That other bug concerns a similar problem with the replication position advancing too far ahead when restarting after error, but only concerns using parallel replication with different replication domains. Since the cause, as well as necessary fix, for that other bug is rather different, I filed it separately. The bug described here, where there is only one replication domain used, should be fixed now in current 10.0.14 code.

MariaDB Server

Some replication errors are ignored if slave_parallel_threads > 0

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration