[MDEV-7237] Parallel replication: incorrect relaylog position after stop/start the slave - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.0.15
Fix Version/s: 10.0.16
Component/s: Replication
Labels:
- parallelslave
- replication

Description

141201  8:59:25 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000116' at position 1146, relay log './frigg-relay-bin.000256' position: 1780

141201  8:59:25 [ERROR] Error in Log_event::read_log_event(): 'Event too big', data_len: 1734964078, event_type: 105

141201  8:59:25 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error

141201  8:59:25 [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594

141201  8:59:25 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000116' position 1146

141201  8:59:25 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:3310',replication started in log 'mysql-bin.000323' at position 4

I reproduced this by first preparing a binlog using RQG replication-dml. I
then setup a master-slave using this binlog. While the slave was running, I
executed STOP SLAVE and START SLAVE a couple of times, and eventually got this
error.

This was reproduced in non-GTID mode.

I had a couple reports of a similar problem from two users. Given the ease of
reproducing this problem, it seems likely that it is the same as at least some
of the cases experienced by users.

Attachments

Activity

Ascending order - Click to sort in descending order

Kristian Nielsen created issue - 2014-12-01 10:07

Kristian Nielsen made changes - 2014-12-01 10:07

Field	Original Value	New Value
Status	Open [ 1 ]	In Progress [ 3 ]

Kristian Nielsen added a comment - 2014-12-01 13:45

Ok, so the problem is when a transaction has committed and goes to update the
group relay log position. The relay log file name to set is taken from
Relay_log_info, so it is whatever position the SQL driver thread has
reached. Thus, if the SQL driver thread has scheduled a transaction in a
following relay log file, then position updates for transaction in the
previous relay log files will be with the wrong relay log file name.

The result is that if the SQL thread is stopped (by STOP SLAVE or due to an
error) near the end of a relay log, then the slave position can have the wrong
relay log file. Then when the SQL thread is later started again, it will start
in the wrong relay log file, probably getting a relay log read error. If we
are unlucky, the position in the previous relay log file might happen to work
in the following relay log file, and then the SQL thread will skip a number of
events, causing serious replication corruption.

This problem does not happen as easy in GTID mode (stopping both slave threads
resets the relay log position), which is probably why this serious problem was
not fixed until now.

Kristian Nielsen added a comment - 2014-12-01 13:45 Ok, so the problem is when a transaction has committed and goes to update the group relay log position. The relay log file name to set is taken from Relay_log_info, so it is whatever position the SQL driver thread has reached. Thus, if the SQL driver thread has scheduled a transaction in a following relay log file, then position updates for transaction in the previous relay log files will be with the wrong relay log file name. The result is that if the SQL thread is stopped (by STOP SLAVE or due to an error) near the end of a relay log, then the slave position can have the wrong relay log file. Then when the SQL thread is later started again, it will start in the wrong relay log file, probably getting a relay log read error. If we are unlucky, the position in the previous relay log file might happen to work in the following relay log file, and then the SQL thread will skip a number of events, causing serious replication corruption. This problem does not happen as easy in GTID mode (stopping both slave threads resets the relay log position), which is probably why this serious problem was not fixed until now.

Kristian Nielsen added a comment - 2014-12-01 14:54

Patch:

http://lists.askmonty.org/pipermail/commits/2014-December/007104.html

Kristian Nielsen added a comment - 2014-12-01 14:54 Patch: http://lists.askmonty.org/pipermail/commits/2014-December/007104.html

Kristian Nielsen made changes - 2014-12-01 15:51

Fix Version/s

10.0.16 [ 17900 ]

Kristian Nielsen made changes - 2014-12-01 16:16

Resolution		Fixed [ 1 ]
Status	In Progress [ 3 ]	Closed [ 6 ]

Arjen Lentz added a comment - 2014-12-02 00:54

Sounds like our problem. Good catch Kristian!
And indeed we already had the suspicion that when it did work, it was actually skipping (lots of) events.

Arjen Lentz added a comment - 2014-12-02 00:54 Sounds like our problem. Good catch Kristian! And indeed we already had the suspicion that when it did work, it was actually skipping (lots of) events.

Rasmus Johansson (Inactive) made changes - 2015-05-18 17:50

Workflow

MariaDB v2 [ 58800 ]

MariaDB v3 [ 62362 ]

Sergei Golubchik made changes - 2021-12-06 21:40

Workflow

MariaDB v3 [ 62362 ]

MariaDB v4 [ 148568 ]

People

Assignee:: Kristian Nielsen

Reporter:: Kristian Nielsen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2014-12-01 10:07

Updated:: 2014-12-02 00:54

Resolved:: 2014-12-01 16:16

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server