[MDEV-9138] Relay log position corrupted with parallel replication after interrupting LOAD DATA on master - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.0.19, 10.0.23
Fix Version/s: 10.0.29, 10.1.20
Component/s: Replication
Labels:
- parallelslave
- replication

Description

A master was running a LOAD DATA statement, and the statement was interrupted by pressing CTRL-C on the client. This partially completed query caused errors on the slaves. For example, SHOW SLAVE STATUS showed:

root@server2 [(none)]> show slave status\G

*************************** 1. row ***************************

               Slave_IO_State: Waiting for master to send event

                  Master_Host: server2

                  Master_User: replicator

                  Master_Port: 3306

                Connect_Retry: 60

              Master_Log_File: binlog.140014

          Read_Master_Log_Pos: 88410906

               Relay_Log_File: relaylog.033127

                Relay_Log_Pos: 4

        Relay_Master_Log_File: binlog.140008

             Slave_IO_Running: Yes

            Slave_SQL_Running: No

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 1158

                   Last_Error: Query partially completed on the master (error on master: 1158) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'LOAD DATA INFILE '../tmp/SQL_LOAD-187227023-187218030-143925.data' IGNORE INTO  TABLE `sometab` CHARACTER SET utf8 FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n' IGNORE 1 LINES  (`field1`, `field2`)'

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 20798609

              Relay_Log_Space: 3019418333

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: NULL

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 1158

               Last_SQL_Error: Query partially completed on the master (error on master: 1158) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'LOAD DATA INFILE '../tmp/SQL_LOAD-187227023-187218030-143925.data' IGNORE INTO  TABLE `sometab` CHARACTER SET utf8 FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n' IGNORE 1 LINES  (`field1`, `field2`)'

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 187218030

               Master_SSL_Crl:

           Master_SSL_Crlpath:

                   Using_Gtid: Slave_Pos

                  Gtid_IO_Pos: 0-187226073-1732279097,999-186069024-2

1 row in set (0.00 sec)

The error log also showed the error:

151109 13:10:41 [ERROR] Slave SQL: Query partially completed on the master (error on master: 1158) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'LOAD DATA INFILE '../tmp/SQL_LOAD-187227023-187218030-143925.data' IGNORE INTO  TABLE `sometab` CHARACTER SET utf8 FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n' IGNORE 1 LINES  (`field1`, `field2`)', Gtid 0-187218030-1732217639, Internal MariaDB error code: 1158

151109 13:10:41 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.140008' position 20798609; GTID position '0-187226073-1732217638,999-186069024-2'

151109 13:10:41 [Note] Error reading relay log event: slave SQL thread was killed

An error on the slave about a partially completed query is expected, but note that the master's error in this case was #1158 (ER_NET_READ_ERROR), and not error #1317 (ER_QUERY_INTERRUPTED), which would be the expected error when a query is interrupted.

When the partially completed query was skipped with "SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;" as the error message recommended, this caused something strange to happen on slaves that had slave_parallel_threads > 0:

151109 13:19:41 [Note] Slave SQL thread initialized, starting replication in log 'binlog.140008' at position 20798609, relay log '../log/relaylog.031235' position: 4; GTID position '0-187226073-1732217638,999-186069024-2'

151109 13:19:41 [Note] 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='../log/relaylog.031235', relay_log_pos='4', master_log_name='binlog.140008', master_log_pos='20798609', GTID '0-187226073-1732217638,999-186069024-2'; and new position at relay_log_file='../log/relaylog.031235', relay_log_pos='1866', master_log_name='binlog.140007', master_log_pos='1582', GTID '0-187226073-1732162247,999-186069024-2';

The relay log position jumps ahead from (relaylog.031235, 4) to (relaylog.031235, 1866), but the master binlog position actually goes backwards from (binlog.140008, 20798609) to (binlog.140007, 1582) and the GTID position goes backwards from '0-187226073-1732217638,999-186069024-2' to 0-187226073-1732162247,999-186069024-2'. When replication is restarted, duplicate key errors are seen as old insert statements are re-run.

This did not happen on slaves that were not using parallel replication.

After closer inspection, it appeared that the values of (Relay_Log_File, Relay_Log_Pos) and (Relay_Master_Log_File, Exec_Master_Log_Pos) reported by SHOW SLAVE STATUS did not match up to the same logical position. It seems that the slave's relay log position is corrupted when parallel replication is enabled and LOAD DATA is cancelled on the master with error #1158.

When the error occurred on slaves using parallel replication, relay log position corruption could be avoided by using this process:

stop slave;
start slave io_thread; # This wipes the relay logs and download back the binlog from the failed LOAD DATA INFILE.
SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE SQL_THREAD;

Attachments

Issue Links

relates to

MDEV-10863 parallel replication tries to continue from wrong position after stopping and restarting slave threads

Closed

Activity

Ascending order - Click to sort in descending order

Elena Stepanova added a comment - 2015-11-18 14:10

I've spawned the problem with Ctrl-C (wrong behavior and wrong error code) into ~~MDEV-9149~~.

Elena Stepanova added a comment - 2015-11-18 14:10 I've spawned the problem with Ctrl-C (wrong behavior and wrong error code) into MDEV-9149 .

Elena Stepanova added a comment - 2015-11-18 18:57

For the behavior of SQL_SLAVE_SKIP_COUNTER on a parallel slave, I am assigning it to Kristian who as I expect can expect when and why such transformations can happen.

Elena Stepanova added a comment - 2015-11-18 18:57 For the behavior of SQL_SLAVE_SKIP_COUNTER on a parallel slave, I am assigning it to Kristian who as I expect can expect when and why such transformations can happen.

Geoff Montee (Inactive) added a comment - 2015-12-23 19:19 - edited

Here's a way to reproduce the issue:

1.) Set up replication between a MariaDB 10.0 master and slave.

I used the following master configuration file:

[mariadb-10.0]

server_id=1

log_bin=mariadb-bin

binlog_format=STATEMENT

And the following slave configuration file:

[mariadb-10.0]

server_id=2

log_bin=mariadb-bin

binlog_format=STATEMENT

And then I created a user with sufficient privileges on the master:

CREATE USER 'repl'@'%' IDENTIFIED BY 'password';

GRANT ALL PRIVILEGES ON *.* TO 'repl'@'%';

And then set up replication on the slave:

SET GLOBAL gtid_slave_pos='0-1-2';

CHANGE MASTER TO Master_host='mymasterip', Master_user='repl', Master_password='password', master_use_gtid=slave_pos;

START SLAVE;

SHOW SLAVE STATUS\G

2.) Execute the following on the master:

mysql -u root << EOF

CREATE DATABASE test_10453;

SET gtid_domain_id=1; DROP DATABASE IF EXISTS test_10453a;

SET GLOBAL binlog_commit_wait_count=30;

SET GLOBAL binlog_commit_wait_usec=100000;

EOF

3.) Execute the following on the slave:

mysql -u root <<< "STOP SLAVE; SET GLOBAL slave_parallel_threads = 20; START SLAVE; SHOW SLAVE STATUS\G"

After this step, I had output like the following:

               Slave_IO_State: Checking master version

                  Master_Host: mymasterip

                  Master_User: repl

                  Master_Port: 3306

                Connect_Retry: 60

              Master_Log_File: mariadb-bin.000016

          Read_Master_Log_Pos: 779

               Relay_Log_File: ip-172-31-5-186-relay-bin.000001

                Relay_Log_Pos: 4

        Relay_Master_Log_File: mariadb-bin.000016

             Slave_IO_Running: Yes

            Slave_SQL_Running: Yes

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 0

                   Last_Error:

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 779

              Relay_Log_Space: 248

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: 0

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 0

               Last_SQL_Error:

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 1

               Master_SSL_Crl:

           Master_SSL_Crlpath:

                   Using_Gtid: Slave_Pos

                  Gtid_IO_Pos: 1-1-7,0-1-768964,2-1-3

4.) Generate some fake data on the master by executing the following:

seq -f "%.0f,\\N" 1 1000000 > fake_data.csv

5.) To reproduce the issue, we need to have some concurrent queries running in the background. Execute this on the master to spawn some mysql processes that will insert a bunch of data:

for n in $(seq 1 20); do

    mysql -u root <<< "CREATE TABLE test_10453.t$n (id BIGINT PRIMARY KEY, id2 BIGINT DEFAULT NULL);"

    seq 1 10000 | ( while read a; do echo "INSERT into test_10453.t$n VALUES ($a, NULL);"; done; ) | mysql -u root &

done

6.) While the processes from step 5 are running, execute the following on the master to start executing LOAD DATA:

mysql -u root << EOF

CREATE TABLE test_10453.m (id BIGINT PRIMARY KEY, id2 BIGINT DEFAULT NULL) ENGINE=MyISAM;

LOAD DATA LOCAL INFILE 'fake_data.csv' INTO TABLE test_10453.m FIELDS TERMINATED BY ',';

EOF

Then cancel the LOAD DATA by pressing CTRL-C.

7.) After cancelling the LOAD DATA, the slave will fail with error #1317:

MariaDB [(none)]> SHOW SLAVE STATUS\G

*************************** 1. row ***************************

               Slave_IO_State: Waiting for master to send event

                  Master_Host: mymasterip

                  Master_User: repl

                  Master_Port: 3306

                Connect_Retry: 60

              Master_Log_File: mariadb-bin.000016

          Read_Master_Log_Pos: 123296888

               Relay_Log_File: ip-172-31-5-186-relay-bin.000001

                Relay_Log_Pos: 4

        Relay_Master_Log_File: mariadb-bin.000016

             Slave_IO_Running: Yes

            Slave_SQL_Running: No

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 1317

                   Last_Error: Query partially completed on the master (error on master: 1317) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'LOAD DATA INFILE '/tmp/SQL_LOAD-2-1-2.data' IGNORE INTO  TABLE `test_10453`.`m` FIELDS TERMINATED BY ',' ENCLOSED BY '' ESCAPED BY '\\' LINES TERMINATED BY '\n' (`id`, `id2`)'

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 12891357

              Relay_Log_Space: 123297175

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: NULL

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 1317

               Last_SQL_Error: Query partially completed on the master (error on master: 1317) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'LOAD DATA INFILE '/tmp/SQL_LOAD-2-1-2.data' IGNORE INTO  TABLE `test_10453`.`m` FIELDS TERMINATED BY ',' ENCLOSED BY '' ESCAPED BY '\\' LINES TERMINATED BY '\n' (`id`, `id2`)'

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 1

               Master_SSL_Crl:

           Master_SSL_Crlpath:

                   Using_Gtid: Slave_Pos

                  Gtid_IO_Pos: 1-1-7,0-1-794457,2-1-3

1 row in set (0.00 sec)

We can try to skip this event in the binlog, but then we will see the same error:

MariaDB [(none)]> SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1;

Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> START SLAVE SQL_THREAD;

Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> SHOW SLAVE STATUS\G

*************************** 1. row ***************************

               Slave_IO_State: Waiting for master to send event

                  Master_Host: mymasterip

                  Master_User: repl

                  Master_Port: 3306

                Connect_Retry: 60

              Master_Log_File: mariadb-bin.000016

          Read_Master_Log_Pos: 129767754

               Relay_Log_File: ip-172-31-5-186-relay-bin.000001

                Relay_Log_Pos: 4

        Relay_Master_Log_File: mariadb-bin.000016

             Slave_IO_Running: Yes

            Slave_SQL_Running: No

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 1317

                   Last_Error: Query partially completed on the master (error on master: 1317) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'LOAD DATA INFILE '/tmp/SQL_LOAD-2-1-2.data' IGNORE INTO  TABLE `test_10453`.`m` FIELDS TERMINATED BY ',' ENCLOSED BY '' ESCAPED BY '\\' LINES TERMINATED BY '\n' (`id`, `id2`)'

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 951

              Relay_Log_Space: 129768041

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: NULL

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 1317

               Last_SQL_Error: Query partially completed on the master (error on master: 1317) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'LOAD DATA INFILE '/tmp/SQL_LOAD-2-1-2.data' IGNORE INTO  TABLE `test_10453`.`m` FIELDS TERMINATED BY ',' ENCLOSED BY '' ESCAPED BY '\\' LINES TERMINATED BY '\n' (`id`, `id2`)'

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 1

               Master_SSL_Crl:

           Master_SSL_Crlpath:

                   Using_Gtid: Slave_Pos

                  Gtid_IO_Pos: 1-1-7,0-1-832177,2-1-3

1 row in set (0.00 sec)

If we try to skip another event, now we see a duplicate key error:

MariaDB [(none)]> SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1;

Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> START SLAVE SQL_THREAD;

Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> SHOW SLAVE STATUS\G

*************************** 1. row ***************************

               Slave_IO_State: Waiting for master to send event

                  Master_Host: mymasterip

                  Master_User: repl

                  Master_Port: 3306

                Connect_Retry: 60

              Master_Log_File: mariadb-bin.000016

          Read_Master_Log_Pos: 131496978

               Relay_Log_File: ip-172-31-5-186-relay-bin.000001

                Relay_Log_Pos: 4

        Relay_Master_Log_File: mariadb-bin.000016

             Slave_IO_Running: Yes

            Slave_SQL_Running: No

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 1062

                   Last_Error: Error 'Duplicate entry '1' for key 'PRIMARY'' on query. Default database: ''. Query: 'INSERT into test_10453.t1 VALUES (1, NULL)'

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 1123

              Relay_Log_Space: 131497265

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: NULL

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 1062

               Last_SQL_Error: Error 'Duplicate entry '1' for key 'PRIMARY'' on query. Default database: ''. Query: 'INSERT into test_10453.t1 VALUES (1, NULL)'

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 1

               Master_SSL_Crl:

           Master_SSL_Crlpath:

                   Using_Gtid: Slave_Pos

                  Gtid_IO_Pos: 1-1-7,0-1-842257,2-1-3

1 row in set (0.00 sec)

Geoff Montee (Inactive) added a comment - 2015-12-23 19:19 - edited Here's a way to reproduce the issue: 1.) Set up replication between a MariaDB 10.0 master and slave. I used the following master configuration file: [mariadb-10.0] server_id=1 log_bin=mariadb-bin binlog_format=STATEMENT And the following slave configuration file: [mariadb-10.0] server_id=2 log_bin=mariadb-bin binlog_format=STATEMENT And then I created a user with sufficient privileges on the master: CREATE USER 'repl'@'%' IDENTIFIED BY 'password'; GRANT ALL PRIVILEGES ON *.* TO 'repl'@'%'; And then set up replication on the slave: SET GLOBAL gtid_slave_pos='0-1-2'; CHANGE MASTER TO Master_host='mymasterip', Master_user='repl', Master_password='password', master_use_gtid=slave_pos; START SLAVE; SHOW SLAVE STATUS\G 2.) Execute the following on the master: mysql -u root << EOF CREATE DATABASE test_10453; SET gtid_domain_id=1; DROP DATABASE IF EXISTS test_10453a; SET GLOBAL binlog_commit_wait_count=30; SET GLOBAL binlog_commit_wait_usec=100000; EOF 3.) Execute the following on the slave: mysql -u root <<< "STOP SLAVE; SET GLOBAL slave_parallel_threads = 20; START SLAVE; SHOW SLAVE STATUS\G" After this step, I had output like the following: Slave_IO_State: Checking master version Master_Host: mymasterip Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mariadb-bin.000016 Read_Master_Log_Pos: 779 Relay_Log_File: ip-172-31-5-186-relay-bin.000001 Relay_Log_Pos: 4 Relay_Master_Log_File: mariadb-bin.000016 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 779 Relay_Log_Space: 248 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 1-1-7,0-1-768964,2-1-3 4.) Generate some fake data on the master by executing the following: seq -f "%.0f,\\N" 1 1000000 > fake_data.csv 5.) To reproduce the issue, we need to have some concurrent queries running in the background. Execute this on the master to spawn some mysql processes that will insert a bunch of data: for n in $(seq 1 20); do mysql -u root <<< "CREATE TABLE test_10453.t$n (id BIGINT PRIMARY KEY, id2 BIGINT DEFAULT NULL);" seq 1 10000 | ( while read a; do echo "INSERT into test_10453.t$n VALUES ($a, NULL);"; done; ) | mysql -u root & done 6.) While the processes from step 5 are running, execute the following on the master to start executing LOAD DATA: mysql -u root << EOF CREATE TABLE test_10453.m (id BIGINT PRIMARY KEY, id2 BIGINT DEFAULT NULL) ENGINE=MyISAM; LOAD DATA LOCAL INFILE 'fake_data.csv' INTO TABLE test_10453.m FIELDS TERMINATED BY ','; EOF Then cancel the LOAD DATA by pressing CTRL-C. 7.) After cancelling the LOAD DATA, the slave will fail with error #1317: MariaDB [(none)]> SHOW SLAVE STATUS\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: mymasterip Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mariadb-bin.000016 Read_Master_Log_Pos: 123296888 Relay_Log_File: ip-172-31-5-186-relay-bin.000001 Relay_Log_Pos: 4 Relay_Master_Log_File: mariadb-bin.000016 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1317 Last_Error: Query partially completed on the master (error on master: 1317) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'LOAD DATA INFILE '/tmp/SQL_LOAD-2-1-2.data' IGNORE INTO TABLE `test_10453`.`m` FIELDS TERMINATED BY ',' ENCLOSED BY '' ESCAPED BY '\\' LINES TERMINATED BY '\n' (`id`, `id2`)' Skip_Counter: 0 Exec_Master_Log_Pos: 12891357 Relay_Log_Space: 123297175 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1317 Last_SQL_Error: Query partially completed on the master (error on master: 1317) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'LOAD DATA INFILE '/tmp/SQL_LOAD-2-1-2.data' IGNORE INTO TABLE `test_10453`.`m` FIELDS TERMINATED BY ',' ENCLOSED BY '' ESCAPED BY '\\' LINES TERMINATED BY '\n' (`id`, `id2`)' Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 1-1-7,0-1-794457,2-1-3 1 row in set (0.00 sec) We can try to skip this event in the binlog, but then we will see the same error: MariaDB [(none)]> SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> START SLAVE SQL_THREAD; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> SHOW SLAVE STATUS\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: mymasterip Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mariadb-bin.000016 Read_Master_Log_Pos: 129767754 Relay_Log_File: ip-172-31-5-186-relay-bin.000001 Relay_Log_Pos: 4 Relay_Master_Log_File: mariadb-bin.000016 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1317 Last_Error: Query partially completed on the master (error on master: 1317) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'LOAD DATA INFILE '/tmp/SQL_LOAD-2-1-2.data' IGNORE INTO TABLE `test_10453`.`m` FIELDS TERMINATED BY ',' ENCLOSED BY '' ESCAPED BY '\\' LINES TERMINATED BY '\n' (`id`, `id2`)' Skip_Counter: 0 Exec_Master_Log_Pos: 951 Relay_Log_Space: 129768041 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1317 Last_SQL_Error: Query partially completed on the master (error on master: 1317) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'LOAD DATA INFILE '/tmp/SQL_LOAD-2-1-2.data' IGNORE INTO TABLE `test_10453`.`m` FIELDS TERMINATED BY ',' ENCLOSED BY '' ESCAPED BY '\\' LINES TERMINATED BY '\n' (`id`, `id2`)' Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 1-1-7,0-1-832177,2-1-3 1 row in set (0.00 sec) If we try to skip another event, now we see a duplicate key error: MariaDB [(none)]> SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> START SLAVE SQL_THREAD; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> SHOW SLAVE STATUS\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: mymasterip Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mariadb-bin.000016 Read_Master_Log_Pos: 131496978 Relay_Log_File: ip-172-31-5-186-relay-bin.000001 Relay_Log_Pos: 4 Relay_Master_Log_File: mariadb-bin.000016 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1062 Last_Error: Error 'Duplicate entry '1' for key 'PRIMARY'' on query. Default database: ''. Query: 'INSERT into test_10453.t1 VALUES (1, NULL)' Skip_Counter: 0 Exec_Master_Log_Pos: 1123 Relay_Log_Space: 131497265 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1062 Last_SQL_Error: Error 'Duplicate entry '1' for key 'PRIMARY'' on query. Default database: ''. Query: 'INSERT into test_10453.t1 VALUES (1, NULL)' Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 1-1-7,0-1-842257,2-1-3 1 row in set (0.00 sec)

Kristian Nielsen added a comment - 2016-01-15 15:51

This might be fixed by this commit, at least it looks related (multi-dimensional GTID position with parallel replication, and slave position being incorrect at second restart attempt):

http://lists.askmonty.org/pipermail/commits/2016-January/008828.html

sql_slave_skip_counter is not really useable in these kinds of scenarios, because it does not allow to specify which domain to skip an event in. So when the position is multi-dimentional, sql_slave_skip_counter does not really make sense. The above commit makes setting sql_slave_skip_counter an error in this case. The alternative way is to stop the IO thread and explicitly set @@gtid_slave_pos to point to the event to be skipped.

When parallel replication stops with an error and multiple domains are in use, the relay log position is modified to point back at the start of the earliest still active relay log file. This is so that all domains will be re-executed correctly; extra events before the relevant positions in each domain will be skipped (the above commit fixes one problem where this skipping was not done correctly, causing re-apply of earlier events).

Kristian Nielsen added a comment - 2016-01-15 15:51 This might be fixed by this commit, at least it looks related (multi-dimensional GTID position with parallel replication, and slave position being incorrect at second restart attempt): http://lists.askmonty.org/pipermail/commits/2016-January/008828.html sql_slave_skip_counter is not really useable in these kinds of scenarios, because it does not allow to specify which domain to skip an event in. So when the position is multi-dimentional, sql_slave_skip_counter does not really make sense. The above commit makes setting sql_slave_skip_counter an error in this case. The alternative way is to stop the IO thread and explicitly set @@gtid_slave_pos to point to the event to be skipped. When parallel replication stops with an error and multiple domains are in use, the relay log position is modified to point back at the start of the earliest still active relay log file. This is so that all domains will be re-executed correctly; extra events before the relevant positions in each domain will be skipped (the above commit fixes one problem where this skipping was not done correctly, causing re-apply of earlier events).

Jean-François Gagné added a comment - 2016-01-15 18:33

Hi Kristian, thanks for the fix.

About sql_slave_skip_counter with out-or-order // replication, instead of playing with @@gtid_slave_pos, wouldn't it be more simple to:
1. restart replication with slave_parallel_threads = 0 until it breaks again
2. skip the transaction with the usual way ?

For that to work, skipping should be disable if:
-last stop slave saved a GTID context,
-or more than one domain id and slave_parallel_thread > 0.

JFG

Jean-François Gagné added a comment - 2016-01-15 18:33 Hi Kristian, thanks for the fix. About sql_slave_skip_counter with out-or-order // replication, instead of playing with @@gtid_slave_pos, wouldn't it be more simple to: 1. restart replication with slave_parallel_threads = 0 until it breaks again 2. skip the transaction with the usual way ? For that to work, skipping should be disable if: -last stop slave saved a GTID context, -or more than one domain id and slave_parallel_thread > 0. JFG

Kristian Nielsen added a comment - 2016-01-15 19:00

> About sql_slave_skip_counter with out-or-order // replication, instead of
> playing with @@gtid_slave_pos, wouldn't it be more simple to:
> 1. restart replication with slave_parallel_threads = 0 until it breaks again
> 2. skip the transaction with the usual way ?

I think that should work, yes.

> For that to work, skipping should be disable if:
> -last stop slave saved a GTID context,
> -or more than one domain id and slave_parallel_thread > 0.

This is what the patch does, if I understand you correctly.

Kristian Nielsen added a comment - 2016-01-15 19:00 > About sql_slave_skip_counter with out-or-order // replication, instead of > playing with @@gtid_slave_pos, wouldn't it be more simple to: > 1. restart replication with slave_parallel_threads = 0 until it breaks again > 2. skip the transaction with the usual way ? I think that should work, yes. > For that to work, skipping should be disable if: > -last stop slave saved a GTID context, > -or more than one domain id and slave_parallel_thread > 0. This is what the patch does, if I understand you correctly.

Jean-François Gagné added a comment - 2016-01-15 19:15

Does the patch cover the following case ?
1. an out-or-order // replicating slave break --> SQL thread stopped, but io thread running.
2. set global slave_parallel_workers = 0
3. set global sql_slave_skip_counter = 1
4. start slave

It is clear to me that is we omit #2, #3 will cause an error. It is also clear to me that if we omit #3, things should go as planned (break again with a reliable position and skip). But if we have both #2 and #3 in this order, we still have a problem (if #3 does not cause an error, which is my understanding of the code).

I might be reading the code wrong though...

Jean-François Gagné added a comment - 2016-01-15 19:15 Does the patch cover the following case ? 1. an out-or-order // replicating slave break --> SQL thread stopped, but io thread running. 2. set global slave_parallel_workers = 0 3. set global sql_slave_skip_counter = 1 4. start slave It is clear to me that is we omit #2, #3 will cause an error. It is also clear to me that if we omit #3, things should go as planned (break again with a reliable position and skip). But if we have both #2 and #3 in this order, we still have a problem (if #3 does not cause an error, which is my understanding of the code). I might be reading the code wrong though...

Kristian Nielsen added a comment - 2016-01-15 20:17

Hm, I thought that changing slave_parallel_workers required that IO thread was stopped. Not sure why that is allowed, that seems dangerous... like in the scenario you describe.

Kristian Nielsen added a comment - 2016-01-15 20:17 Hm, I thought that changing slave_parallel_workers required that IO thread was stopped. Not sure why that is allowed, that seems dangerous... like in the scenario you describe.

Kristian Nielsen added a comment - 2016-01-18 12:38

I think the problem goes back to this commit:

https://github.com/MariaDB/server/commit/572560f38c248d5020f0e63aeb3f8905cd568208

That patch replaces checks for both slave threads being stopped with a
function that checks only if the SQL thread has been stopped.
This function was subsequently incorrectly used for similar checks in new
code.

Kristian Nielsen added a comment - 2016-01-18 12:38 I think the problem goes back to this commit: https://github.com/MariaDB/server/commit/572560f38c248d5020f0e63aeb3f8905cd568208 That patch replaces checks for both slave threads being stopped with a function that checks only if the SQL thread has been stopped. This function was subsequently incorrectly used for similar checks in new code.

Jean-François Gagné added a comment - 2016-01-18 12:56

I am not sure I understand: do you think slave_parallel_threads should only be changed when both IO and SQL threads are stopped or only the SQL thread.

I think only stopping the SQL thread should be needed for details explained in:
http://jfg-mysql.blogspot.nl/2015/10/bad-commands-with-mariadb-gtids-2.html

Jean-François Gagné added a comment - 2016-01-18 12:56 I am not sure I understand: do you think slave_parallel_threads should only be changed when both IO and SQL threads are stopped or only the SQL thread. I think only stopping the SQL thread should be needed for details explained in: http://jfg-mysql.blogspot.nl/2015/10/bad-commands-with-mariadb-gtids-2.html

Kristian Nielsen added a comment - 2016-01-18 14:21

Yes, slave_parallel_threads must only be changed when all slave threads are stopped. To solve the issue you refer to, it should be implemented that the IO and SQL threads can restart on an existing relay log file without deleting and re-fetching it (in GTID mode). This would be desirable for a number of reasons, but is not implemented currently.

Kristian Nielsen added a comment - 2016-01-18 14:21 Yes, slave_parallel_threads must only be changed when all slave threads are stopped. To solve the issue you refer to, it should be implemented that the IO and SQL threads can restart on an existing relay log file without deleting and re-fetching it (in GTID mode). This would be desirable for a number of reasons, but is not implemented currently.

Geoff Montee (Inactive) added a comment - 2017-02-01 17:44

I've been told that this issue is not reproducible anymore in MariaDB 10.0.29 and 10.1.21.

Geoff Montee (Inactive) added a comment - 2017-02-01 17:44 I've been told that this issue is not reproducible anymore in MariaDB 10.0.29 and 10.1.21.

Jean-François Gagné added a comment - 2017-04-22 01:19

I think this can be closed as solved in ~~MDEV-10863~~.

Jean-François Gagné added a comment - 2017-04-22 01:19 I think this can be closed as solved in MDEV-10863 .

Lixun Peng added a comment - 2017-05-25 05:21

The same as ~~MDEV-10863~~.

Lixun Peng added a comment - 2017-05-25 05:21 The same as MDEV-10863 .

People

Assignee:: Lixun Peng

Reporter:: Geoff Montee (Inactive)

Votes:: 4 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 2015-11-17 06:32

Updated:: 2024-07-08 01:11

Resolved:: 2017-05-25 05:21

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server