[MDEV-5217] Assorted failures with parallel replication on concurrent DML/DDL flow and periodic FLUSH LOGS - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 10.0.6
Component/s: None
Labels:
None

Description

I've got several types of failures by running the provided test with the same parameters over and over again. Here is the statistics, more details on each type of the failure below:

Out of 10 consequent runs,

Error in Log_event::read_log_event(): 'Event too big'	1
Error in Log_event::read_log_event(): 'read error'	4
Error in Log_event::read_log_event(): 'Event too small'	2
Error reading relay log event: Aborting slave SQL thread because of partial event read	1
Could not execute .... event (Can't find record)	2

In addition to "Could not execute .... event", I also at least once had a subsequent crash with "safe_mutex: Trying to destroy a mutex LOCK_wait_commit that was locked".

Examples of error messages:

131031 11:25:27 [ERROR] Error in Log_event::read_log_event(): 'read error', data_len: 2818048, event_type: 0

131031 11:25:27 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error

131031 11:25:27 [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594

131031 11:25:27 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000019' position 287

131031 11:19:17 [ERROR] Slave SQL thread: I/O error reading event(errno: 175  cur_log->error: 12)

131031 11:19:17 [ERROR] Error reading relay log event: Aborting slave SQL thread because of partial event read

131031 11:19:17 [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594

131031 11:19:17 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000048' position 287

131031 10:54:58 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.DD; Can't find record in 'DD', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000033, end_log_pos 688, Internal MariaDB error code: 1032

131031 10:54:58 [ERROR] Slave SQL: Can't find record in 'DD', Internal MariaDB error code: 1032

131031 10:54:58 [Warning] Slave: Can't find record in 'DD' Error_code: 1032

131031 10:54:58 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000033' position 326

131031 10:51:49 [ERROR] Error in Log_event::read_log_event(): 'Event too small', data_len: 0, event_type: 0

131031 10:51:49 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error

131031 10:51:49 [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594

131031 10:51:49 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000052' position 326

Interestingly, in this case slave didn't stop but later also issued another error:

131031 10:31:22 [ERROR] Error in Log_event::read_log_event(): 'Event too big', data_len: 1987208563, event_type: 100

131031 10:31:22 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error

131031 10:31:22 [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594

131031 10:31:22 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000022' position 4

131031 10:35:55 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.CC; Can't find record in 'CC', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000081, end_log_pos 1184, Internal MariaDB error code: 1032

131031 10:35:55 [Warning] Slave: Can't find record in 'CC' Error_code: 1032

131031 10:35:55 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000081' position 287

131031 10:35:55 [Note] Slave SQL thread exiting, replication stopped in log 'mysql-bin.000081' at position 326

safe_mutex: Trying to destroy a mutex LOCK_wait_commit that was locked at /home/elenst/bzr/10.0-knielsen/sql/sql_class.cc, line 5805 at /home/elenst/bzr/10.0-knielsen/sql/sql_class.cc, line 5631

131031  3:45:35 [ERROR] mysqld got signal 6 ;

RQG grammar (parallel-replication-6.yy):

query:

	dml_ddl | dml_ddl | dml_ddl | dml_ddl | dml_ddl |

	FLUSH LOGS

create_drop_table:

	CREATE TABLE IF NOT EXISTS { 't_'.abs($$) } LIKE _table |

	DROP TABLE IF EXISTS { 't_'.abs($$) } ;

dml_ddl:

	delete | insert | create_drop_table ;

delete:

	DELETE FROM _table WHERE `col_int_nokey` BETWEEN _tinyint[invariant] AND _tinyint[invariant] + 2 ;

insert:

	INSERT INTO _table ( `col_int_key` , `col_int_nokey` )   VALUES ( _tinyint, _tinyint ), ( _tinyint, _tinyint ) ;

RQG command line (works on the standard lp:randgen tree):

perl ./runall-new.pl --grammar=parallel-replication-6.yy  --threads=12 --queries=100M --duration=300  --mysqld=--slave-parallel-threads=8 --rpl_mode=row --basedir=<your basedir> --vardir=<your vardir>

revision-id: knielsen@knielsen-hq.org-20131030065230-kp8dykgyeth6ma55

revno: 3690

branch-nick: 10.0-knielsen

Attachments

Issue Links

relates to

MDEV-4506 MWL#184: Parallel replication of group-committed transactions

Closed

Activity

Ascending order - Click to sort in descending order

Elena Stepanova added a comment - 2013-10-31 15:18 - edited

I'll run the test again with the bugfix for ~~MDEV-5206~~, maybe it will help here as well.

Upd: it didn't help, I'm still getting the errors on revno 3691.

Upd2: Same happens if I remove FLUSH LOGS from the flow at all, and instead reduce the value of max-binlog-size that that it rotates frequently enough (e.g. set it to 4096).

Elena Stepanova added a comment - 2013-10-31 15:18 - edited I'll run the test again with the bugfix for MDEV-5206 , maybe it will help here as well. Upd: it didn't help, I'm still getting the errors on revno 3691. Upd2: Same happens if I remove FLUSH LOGS from the flow at all, and instead reduce the value of max-binlog-size that that it rotates frequently enough (e.g. set it to 4096).

Kristian Nielsen added a comment - 2013-11-06 16:23

Ok, this was a great testcase, there were a lot of bugs uncovered. Everytime I
fixed one and re-ran the test, another turned up

I seem to have fixed all issues seen with this particular test now, at least I
was able to run the runall command 10 times or so in a row without any
failures. I also think I have fixes for all the problems you mentioned (or at
least similar-looking ones):

I fixed several things in error handling, where an error stop would be
handled incorrectly and cause the server to hang or otherwise misbehave.

I fixed that an error set by one thread could be cleared by another thread,
leaving the SQL thread stopped with no apparent error in SHOW SLAVE STATUS
(this is probably ~~MDEV-5219~~, but I will check that to be sure).

I fixed that the current relay log position for the driver SQL thread was
not correctly updated, which could cause the SQL thread to start reading
events at the wrong position around relay log rotates, seen as various
kinds of corrupt events during event read.

I fixed a race in wait_for_commit that could cause unlock of uninitialised
mutex.

I fixed a missing wait between threads that could cause incorrect parallel
execution of conflicting MyISAM queries (when --log-slave-updates=0), seen
as "Could not execute .... event (Can't find record)" or similar.

Kristian Nielsen added a comment - 2013-11-06 16:23 Ok, this was a great testcase, there were a lot of bugs uncovered. Everytime I fixed one and re-ran the test, another turned up I seem to have fixed all issues seen with this particular test now, at least I was able to run the runall command 10 times or so in a row without any failures. I also think I have fixes for all the problems you mentioned (or at least similar-looking ones): I fixed several things in error handling, where an error stop would be handled incorrectly and cause the server to hang or otherwise misbehave. I fixed that an error set by one thread could be cleared by another thread, leaving the SQL thread stopped with no apparent error in SHOW SLAVE STATUS (this is probably MDEV-5219 , but I will check that to be sure). I fixed that the current relay log position for the driver SQL thread was not correctly updated, which could cause the SQL thread to start reading events at the wrong position around relay log rotates, seen as various kinds of corrupt events during event read. I fixed a race in wait_for_commit that could cause unlock of uninitialised mutex. I fixed a missing wait between threads that could cause incorrect parallel execution of conflicting MyISAM queries (when --log-slave-updates=0), seen as "Could not execute .... event (Can't find record)" or similar.

j added a comment - 2014-11-19 22:52

Was this applied to MariaDB 5.5.x? If not, see here: https://mariadb.atlassian.net/browse/MDEV-7124

j added a comment - 2014-11-19 22:52 Was this applied to MariaDB 5.5.x? If not, see here: https://mariadb.atlassian.net/browse/MDEV-7124

People

Assignee:: Kristian Nielsen

Reporter:: Elena Stepanova

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2013-10-31 09:50

Updated:: 2014-11-19 22:52

Resolved:: 2013-11-06 16:23

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server