[MDEV-5217] Assorted failures with parallel replication on concurrent DML/DDL flow and periodic FLUSH LOGS Created: 2013-10-31  Updated: 2014-11-19  Resolved: 2013-11-06

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: None
Fix Version/s: 10.0.6

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-4506 MWL#184: Parallel replication of grou... Closed

 Description   

I've got several types of failures by running the provided test with the same parameters over and over again. Here is the statistics, more details on each type of the failure below:

Out of 10 consequent runs,

Error in Log_event::read_log_event(): 'Event too big' 1
Error in Log_event::read_log_event(): 'read error' 4
Error in Log_event::read_log_event(): 'Event too small' 2
Error reading relay log event: Aborting slave SQL thread because of partial event read 1
Could not execute .... event (Can't find record) 2

In addition to "Could not execute .... event", I also at least once had a subsequent crash with "safe_mutex: Trying to destroy a mutex LOCK_wait_commit that was locked".

Examples of error messages:

131031 11:25:27 [ERROR] Error in Log_event::read_log_event(): 'read error', data_len: 2818048, event_type: 0
131031 11:25:27 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error
131031 11:25:27 [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594
131031 11:25:27 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000019' position 287

131031 11:19:17 [ERROR] Slave SQL thread: I/O error reading event(errno: 175  cur_log->error: 12)
131031 11:19:17 [ERROR] Error reading relay log event: Aborting slave SQL thread because of partial event read
131031 11:19:17 [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594
131031 11:19:17 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000048' position 287

131031 10:54:58 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.DD; Can't find record in 'DD', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000033, end_log_pos 688, Internal MariaDB error code: 1032
131031 10:54:58 [ERROR] Slave SQL: Can't find record in 'DD', Internal MariaDB error code: 1032
131031 10:54:58 [Warning] Slave: Can't find record in 'DD' Error_code: 1032
131031 10:54:58 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000033' position 326

131031 10:51:49 [ERROR] Error in Log_event::read_log_event(): 'Event too small', data_len: 0, event_type: 0
131031 10:51:49 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error
131031 10:51:49 [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594
131031 10:51:49 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000052' position 326

Interestingly, in this case slave didn't stop but later also issued another error:

131031 10:31:22 [ERROR] Error in Log_event::read_log_event(): 'Event too big', data_len: 1987208563, event_type: 100
131031 10:31:22 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error
131031 10:31:22 [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594
131031 10:31:22 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000022' position 4
131031 10:35:55 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.CC; Can't find record in 'CC', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000081, end_log_pos 1184, Internal MariaDB error code: 1032
131031 10:35:55 [Warning] Slave: Can't find record in 'CC' Error_code: 1032
131031 10:35:55 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000081' position 287
131031 10:35:55 [Note] Slave SQL thread exiting, replication stopped in log 'mysql-bin.000081' at position 326

safe_mutex: Trying to destroy a mutex LOCK_wait_commit that was locked at /home/elenst/bzr/10.0-knielsen/sql/sql_class.cc, line 5805 at /home/elenst/bzr/10.0-knielsen/sql/sql_class.cc, line 5631
131031  3:45:35 [ERROR] mysqld got signal 6 ;

RQG grammar (parallel-replication-6.yy):

query:
	dml_ddl | dml_ddl | dml_ddl | dml_ddl | dml_ddl |
	FLUSH LOGS
;
 
create_drop_table:
	CREATE TABLE IF NOT EXISTS { 't_'.abs($$) } LIKE _table | 
	DROP TABLE IF EXISTS { 't_'.abs($$) } ;
 
dml_ddl:
	delete | insert | create_drop_table ;
 
delete:
	DELETE FROM _table WHERE `col_int_nokey` BETWEEN _tinyint[invariant] AND _tinyint[invariant] + 2 ;
 
insert:
	INSERT INTO _table ( `col_int_key` , `col_int_nokey` )   VALUES ( _tinyint, _tinyint ), ( _tinyint, _tinyint ) ;

RQG command line (works on the standard lp:randgen tree):

perl ./runall-new.pl --grammar=parallel-replication-6.yy  --threads=12 --queries=100M --duration=300  --mysqld=--slave-parallel-threads=8 --rpl_mode=row --basedir=<your basedir> --vardir=<your vardir>

revision-id: knielsen@knielsen-hq.org-20131030065230-kp8dykgyeth6ma55
revno: 3690
branch-nick: 10.0-knielsen



 Comments   
Comment by Elena Stepanova [ 2013-10-31 ]

I'll run the test again with the bugfix for MDEV-5206, maybe it will help here as well.

Upd: it didn't help, I'm still getting the errors on revno 3691.

Upd2: Same happens if I remove FLUSH LOGS from the flow at all, and instead reduce the value of max-binlog-size that that it rotates frequently enough (e.g. set it to 4096).

Comment by Kristian Nielsen [ 2013-11-06 ]

Ok, this was a great testcase, there were a lot of bugs uncovered. Everytime I
fixed one and re-ran the test, another turned up

I seem to have fixed all issues seen with this particular test now, at least I
was able to run the runall command 10 times or so in a row without any
failures. I also think I have fixes for all the problems you mentioned (or at
least similar-looking ones):

  • I fixed several things in error handling, where an error stop would be
    handled incorrectly and cause the server to hang or otherwise misbehave.
  • I fixed that an error set by one thread could be cleared by another thread,
    leaving the SQL thread stopped with no apparent error in SHOW SLAVE STATUS
    (this is probably MDEV-5219, but I will check that to be sure).
  • I fixed that the current relay log position for the driver SQL thread was
    not correctly updated, which could cause the SQL thread to start reading
    events at the wrong position around relay log rotates, seen as various
    kinds of corrupt events during event read.
  • I fixed a race in wait_for_commit that could cause unlock of uninitialised
    mutex.
  • I fixed a missing wait between threads that could cause incorrect parallel
    execution of conflicting MyISAM queries (when --log-slave-updates=0), seen
    as "Could not execute .... event (Can't find record)" or similar.
Comment by j [ 2014-11-19 ]

Was this applied to MariaDB 5.5.x? If not, see here: https://mariadb.atlassian.net/browse/MDEV-7124

Generated at Thu Feb 08 07:02:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.