[MDEV-9138] Relay log position corrupted with parallel replication after interrupting LOAD DATA on master - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.0.19, 10.0.23
Fix Version/s: 10.0.29, 10.1.20
Component/s: Replication
Labels:
- parallelslave
- replication

Description

A master was running a LOAD DATA statement, and the statement was interrupted by pressing CTRL-C on the client. This partially completed query caused errors on the slaves. For example, SHOW SLAVE STATUS showed:

root@server2 [(none)]> show slave status\G

*************************** 1. row ***************************

               Slave_IO_State: Waiting for master to send event

                  Master_Host: server2

                  Master_User: replicator

                  Master_Port: 3306

                Connect_Retry: 60

              Master_Log_File: binlog.140014

          Read_Master_Log_Pos: 88410906

               Relay_Log_File: relaylog.033127

                Relay_Log_Pos: 4

        Relay_Master_Log_File: binlog.140008

             Slave_IO_Running: Yes

            Slave_SQL_Running: No

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 1158

                   Last_Error: Query partially completed on the master (error on master: 1158) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'LOAD DATA INFILE '../tmp/SQL_LOAD-187227023-187218030-143925.data' IGNORE INTO  TABLE `sometab` CHARACTER SET utf8 FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n' IGNORE 1 LINES  (`field1`, `field2`)'

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 20798609

              Relay_Log_Space: 3019418333

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: NULL

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 1158

               Last_SQL_Error: Query partially completed on the master (error on master: 1158) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'LOAD DATA INFILE '../tmp/SQL_LOAD-187227023-187218030-143925.data' IGNORE INTO  TABLE `sometab` CHARACTER SET utf8 FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n' IGNORE 1 LINES  (`field1`, `field2`)'

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 187218030

               Master_SSL_Crl:

           Master_SSL_Crlpath:

                   Using_Gtid: Slave_Pos

                  Gtid_IO_Pos: 0-187226073-1732279097,999-186069024-2

1 row in set (0.00 sec)

The error log also showed the error:

151109 13:10:41 [ERROR] Slave SQL: Query partially completed on the master (error on master: 1158) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'LOAD DATA INFILE '../tmp/SQL_LOAD-187227023-187218030-143925.data' IGNORE INTO  TABLE `sometab` CHARACTER SET utf8 FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n' IGNORE 1 LINES  (`field1`, `field2`)', Gtid 0-187218030-1732217639, Internal MariaDB error code: 1158

151109 13:10:41 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.140008' position 20798609; GTID position '0-187226073-1732217638,999-186069024-2'

151109 13:10:41 [Note] Error reading relay log event: slave SQL thread was killed

An error on the slave about a partially completed query is expected, but note that the master's error in this case was #1158 (ER_NET_READ_ERROR), and not error #1317 (ER_QUERY_INTERRUPTED), which would be the expected error when a query is interrupted.

When the partially completed query was skipped with "SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;" as the error message recommended, this caused something strange to happen on slaves that had slave_parallel_threads > 0:

151109 13:19:41 [Note] Slave SQL thread initialized, starting replication in log 'binlog.140008' at position 20798609, relay log '../log/relaylog.031235' position: 4; GTID position '0-187226073-1732217638,999-186069024-2'

151109 13:19:41 [Note] 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='../log/relaylog.031235', relay_log_pos='4', master_log_name='binlog.140008', master_log_pos='20798609', GTID '0-187226073-1732217638,999-186069024-2'; and new position at relay_log_file='../log/relaylog.031235', relay_log_pos='1866', master_log_name='binlog.140007', master_log_pos='1582', GTID '0-187226073-1732162247,999-186069024-2';

The relay log position jumps ahead from (relaylog.031235, 4) to (relaylog.031235, 1866), but the master binlog position actually goes backwards from (binlog.140008, 20798609) to (binlog.140007, 1582) and the GTID position goes backwards from '0-187226073-1732217638,999-186069024-2' to 0-187226073-1732162247,999-186069024-2'. When replication is restarted, duplicate key errors are seen as old insert statements are re-run.

This did not happen on slaves that were not using parallel replication.

After closer inspection, it appeared that the values of (Relay_Log_File, Relay_Log_Pos) and (Relay_Master_Log_File, Exec_Master_Log_Pos) reported by SHOW SLAVE STATUS did not match up to the same logical position. It seems that the slave's relay log position is corrupted when parallel replication is enabled and LOAD DATA is cancelled on the master with error #1158.

When the error occurred on slaves using parallel replication, relay log position corruption could be avoided by using this process:

stop slave;
start slave io_thread; # This wipes the relay logs and download back the binlog from the failed LOAD DATA INFILE.
SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE SQL_THREAD;

Attachments

Issue Links

relates to

MDEV-10863 parallel replication tries to continue from wrong position after stopping and restarting slave threads

Closed

Activity

People

Assignee:: Lixun Peng

Reporter:: Geoff Montee (Inactive)

Votes:: 4 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 2015-11-17 06:32

Updated:: 2020-08-25 16:33

Resolved:: 2017-05-25 05:21

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.