Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.0.15
Description
141201 8:59:25 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000116' at position 1146, relay log './frigg-relay-bin.000256' position: 1780
|
141201 8:59:25 [ERROR] Error in Log_event::read_log_event(): 'Event too big', data_len: 1734964078, event_type: 105
|
141201 8:59:25 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error
|
141201 8:59:25 [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594
|
141201 8:59:25 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000116' position 1146
|
141201 8:59:25 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:3310',replication started in log 'mysql-bin.000323' at position 4
|
I reproduced this by first preparing a binlog using RQG replication-dml. I
then setup a master-slave using this binlog. While the slave was running, I
executed STOP SLAVE and START SLAVE a couple of times, and eventually got this
error.
This was reproduced in non-GTID mode.
I had a couple reports of a similar problem from two users. Given the ease of
reproducing this problem, it seems likely that it is the same as at least some
of the cases experienced by users.
Attachments
Activity
Field | Original Value | New Value |
---|---|---|
Status | Open [ 1 ] | In Progress [ 3 ] |
Fix Version/s | 10.0.16 [ 17900 ] |
Resolution | Fixed [ 1 ] | |
Status | In Progress [ 3 ] | Closed [ 6 ] |
Workflow | MariaDB v2 [ 58800 ] | MariaDB v3 [ 62362 ] |
Workflow | MariaDB v3 [ 62362 ] | MariaDB v4 [ 148568 ] |
Ok, so the problem is when a transaction has committed and goes to update the
group relay log position. The relay log file name to set is taken from
Relay_log_info, so it is whatever position the SQL driver thread has
reached. Thus, if the SQL driver thread has scheduled a transaction in a
following relay log file, then position updates for transaction in the
previous relay log files will be with the wrong relay log file name.
The result is that if the SQL thread is stopped (by STOP SLAVE or due to an
error) near the end of a relay log, then the slave position can have the wrong
relay log file. Then when the SQL thread is later started again, it will start
in the wrong relay log file, probably getting a relay log read error. If we
are unlucky, the position in the previous relay log file might happen to work
in the following relay log file, and then the SQL thread will skip a number of
events, causing serious replication corruption.
This problem does not happen as easy in GTID mode (stopping both slave threads
resets the relay log position), which is probably why this serious problem was
not fixed until now.