[MDEV-5189] Replication failure and subsequent assertion failure on concurrent DML flow with slave-parallel-threads > 1 with gtid_domain_id per thread Created: 2013-10-25  Updated: 2013-10-28  Resolved: 2013-10-28

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: None
Fix Version/s: 10.0.5

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-4506 MWL#184: Parallel replication of grou... Closed

 Description   

The DML flow consists of simple INSERT / UPDATE / DELETE and also BEGIN/COMMIT which are executed in several threads each of which has set a session value of gtid_domain_id equal to CONNECTION_ID (hence it's unique for every thread). Replication (row-based) promptly fails, and after that an assertion failure happens. I assume that the assertion failure is one of those known issues with error handling, but the replication failure itself shouldn't be happening at the first place.

The result is the same whether the slave uses GTID or not. In the standard RQG it does not, but if you want to try with GTID, you can either start the servers separately, or apply the patch for RQG provided below.

RQG grammar (parallel-replication-2.yy):

query_init:
	SET gtid_domain_id = CONNECTION_ID() ;
 
query:
	transaction |
	insert_replace | update | delete |
	insert_replace | update | delete |
	insert_replace | update | delete |
	insert_replace | update | delete |
	insert_replace | update | delete |
	insert_replace | update | delete |
	insert_replace | update | delete |
	insert_replace | update | delete |
	insert_replace | update | delete ;
 
set_domain_id:
	SET gtid_domain_id = _digit ;
 
transaction:
	START TRANSACTION |
	COMMIT ;
 
insert_replace:
	INSERT INTO _table (`pk`) VALUES (NULL) ;
 
update:
	UPDATE _table SET _field_no_pk = value where ORDER BY _field_list LIMIT large_digit ;
 
delete:
	DELETE FROM _table where_delete ORDER BY _field_list LIMIT small_digit ;
 
where:
	|
	WHERE _field_key < value | 	
	WHERE _field_key IN ( value , value , value , value , value ) |
	WHERE _field_key BETWEEN small_digit AND large_digit |
	WHERE _field_key BETWEEN _tinyint_unsigned AND _int_unsigned ;
 
where_delete:
	|
	WHERE _field_key = value |
	WHERE _field_key IN ( value , value , value , value , value ) |
	WHERE _field_key BETWEEN small_digit AND large_digit ;
 
large_digit:
	5 | 6 | 7 | 8 ;
 
small_digit:
	1 | 2 | 3 | 4 ;
 
value:
	_digit | _tinyint_unsigned | _varchar(1) | _int_unsigned ;

RQG command line:

perl ./runall-new.pl --grammar=parallel-replication-2.yy --threads=10 --duration=600 --queries=100M --basedir=<your basedir> --engine=InnoDB --vardir=<your location for logs> --rpl_mode=row --mysqld=--slave-parallel-threads=5

RQG patch to use GTID:

=== modified file 'lib/DBServer/MySQL/ReplMySQLd.pm'
--- lib/DBServer/MySQL/ReplMySQLd.pm	2012-06-11 08:23:46 +0000
+++ lib/DBServer/MySQL/ReplMySQLd.pm	2013-10-25 10:24:51 +0000
@@ -192,6 +192,7 @@
                    " MASTER_PORT = ".$self->master->port.",".
                    " MASTER_HOST = '127.0.0.1',".
                    " MASTER_USER = 'root',".
+                   " MASTER_USE_GTID = current_pos,".
                    " MASTER_CONNECT_RETRY = 1");
     
 	$slave_dbh->do("START SLAVE");
 

131025 14:26:32 [ERROR] Slave SQL: Could not execute Update_rows event on table test.AA; Can't find record in 'AA', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000001, end_log_pos 105548, Internal MariaDB error code: 1032
mysqld: /sql/sql_base.cc:5731: bool lock_tables(THD*, TABLE_LIST*, uint, uint): Assertion `thd->lock == 0' failed.
131025 14:26:32 [ERROR] mysqld got signal 6 ;

#4  0x00007fddba5e7425 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#5  0x00007fddba5eab8b in __GI_abort () at abort.c:91
#6  0x00007fddba5e00ee in __assert_fail_base (fmt=<optimized out>, assertion=0xd5ac18 "thd->lock == 0", file=0xd59b00 "/sql/sql_base.cc", line=<optimized out>, function=<optimized out>) at assert.c:94
#7  0x00007fddba5e0192 in __GI___assert_fail (assertion=0xd5ac18 "thd->lock == 0", file=0xd59b00 "/sql/sql_base.cc", line=5731, function=0xd5c2a0 "bool lock_tables(THD*, TABLE_LIST*, uint, uint)") at assert.c:103
#8  0x00000000005c16a1 in lock_tables (thd=0x7fdd5c000b00, tables=0x7fdd98ff8680, count=1, flags=0) at /sql/sql_base.cc:5731
#9  0x00000000005c1229 in open_and_lock_tables (thd=0x7fdd5c000b00, tables=0x7fdd98ff8680, derived=false, flags=0, prelocking_strategy=0x7fdd98ff8510) at /sql/sql_base.cc:5572
#10 0x00000000005b46b1 in open_and_lock_tables (thd=0x7fdd5c000b00, tables=0x7fdd98ff8680, derived=false, flags=0) at /sql/sql_base.h:562
#11 0x0000000000783944 in rpl_slave_state::record_gtid (this=0x15155a0, thd=0x7fdd5c000b00, gtid=0x7fdd98ff8cb0, sub_id=10, in_transaction=true, in_statement=false) at /sql/rpl_gtid.cc:342
#12 0x00000000008e0c3a in Xid_log_event::do_apply_event (this=0x7fdd5801c280, rgi=0x7fdd5801abb0) at /sql/log_event.cc:6932
#13 0x0000000000597096 in Log_event::apply_event (this=0x7fdd5801c280, rgi=0x7fdd5801abb0) at /sql/log_event.h:1322
#14 0x000000000058dfff in apply_event_and_update_pos (ev=0x7fdd5801c280, thd=0x7fdd5c000b00, rgi=0x7fdd5801abb0, rpt=0x3d60cc8) at /sql/slave.cc:3102
#15 0x0000000000786888 in rpt_handle_event (qev=0x7fdd5801c370, rpt=0x3d60cc8) at /sql/rpl_parallel.cc:62
#16 0x0000000000786e77 in handle_rpl_parallel_thread (arg=0x3d60cc8) at /sql/rpl_parallel.cc:223
#17 0x00007fddbb3b0e9a in start_thread (arg=0x7fdd98ff9700) at pthread_create.c:308
#18 0x00007fddba6a4cbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

revision-id: knielsen@knielsen-hq.org-20131024065348-t37zcjiw9mdta4kd
date: 2013-10-24 08:53:48 +0200
build-date: 2013-10-25 14:25:12 +0400
revno: 3683
branch-nick: 10.0-knielsen



 Comments   
Comment by Kristian Nielsen [ 2013-10-25 ]

I pushed a fix of two separate issues. Now the RQG command line works for me
in the non-GTID case (no errors occur during replication).

The problems were: 1) In non-GTID mode, we should not attempt to do different
domains in parallel; 2) when we do group-committed transactions in parallel,
we did not correctly wait for the previous group of transactions to complete
before starting on the next one.

This fix does not fix the assertion in case of replication error. I will look
into that next.

Note that I think that in GTID mode, this test is supposed to cause
replication to fail (but not to assert, of course). Because if understand
correctly, the queries in different domains are not guaranteed to be
independent, so it is not necessarily valid to put them in different domains
and replicate them in parallel (correct me if I am wrong). This is not a
critique of the test case, which clearly was efficient in finding numerous
bugs, just a note of something to be aware of.

Comment by Elena Stepanova [ 2013-10-25 ]

>> the queries in different domains are not guaranteed to be independent, so it is not necessarily valid to put them in different domains and replicate them in parallel (correct me if I am wrong)

You are totally right, my bad. I'll take it into account while creating and running further tests, it should be easy enough to fix, e.g. to make different threads work with different default databases.

Comment by Kristian Nielsen [ 2013-10-28 ]

I've now fixed error handling, so that the replication should stop as expected
and no assertion happens.

As mentioned, the fact that an error happens (in GTID mode) is expected due to
conflicts between events in different replication domains.

When run in non-GTID mode, the test succeeds for me.

Thanks again for a great testcase that was very useful to find a number of
important issues.

Generated at Thu Feb 08 07:02:22 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.