[MDEV-21107] Assertion `!tmp_gco->next_gco || tmp_gco->last_sub_id > sub_id' failed in finish_event_group Created: 2019-11-21  Updated: 2023-09-28

Status: Confirmed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.1, 10.2, 10.3, 10.4, 10.5, 10.6
Fix Version/s: 10.4, 10.5, 10.6

Type: Bug Priority: Major
Reporter: Alice Sherepa Assignee: Andrei Elkin
Resolution: Unresolved Votes: 0
Labels: affects-tests

Attachments: File dt_slave.7z     File mysql-bin.7z    
Issue Links:
Relates
relates to MDEV-8125 Assertion `!tmp_gco->next_gco || tmp_... Closed

 Description   

10.4 f9ceb0a67ffb20631c936a7e

#3  <signal handler called>
#4  0x00007f4c6bdb6428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#5  0x00007f4c6bdb802a in __GI_abort () at abort.c:89
#6  0x00007f4c6bdaebd7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x55f4a2f009d8 "!tmp_gco->next_gco || tmp_gco->last_sub_id > sub_id", file=file@entry=0x55f4a2f008d8 "/10.4/sql/rpl_parallel.cc", line=line@entry=228, function=function@entry=0x55f4a2f014e0 <finish_event_group(rpl_parallel_thread*, unsigned long long, rpl_parallel_entry*, rpl_group_info*)::__PRETTY_FUNCTION__> "void finish_event_group(rpl_parallel_thread*, uint64, rpl_parallel_entry*, rpl_group_info*)") at assert.c:92
#7  0x00007f4c6bdaec82 in __GI___assert_fail (assertion=0x55f4a2f009d8 "!tmp_gco->next_gco || tmp_gco->last_sub_id > sub_id", file=0x55f4a2f008d8 "/10.4/sql/rpl_parallel.cc", line=228, function=0x55f4a2f014e0 <finish_event_group(rpl_parallel_thread*, unsigned long long, rpl_parallel_entry*, rpl_group_info*)::__PRETTY_FUNCTION__> "void finish_event_group(rpl_parallel_thread*, uint64, rpl_parallel_entry*, rpl_group_info*)") at assert.c:101
#8  0x000055f4a2399aa1 in finish_event_group (rpt=0x7f4bf80098d8, sub_id=26, entry=0x7f4bf801cbe0, rgi=0x7f4bf8035500) at /10.4/sql/rpl_parallel.cc:228
#9  0x000055f4a239cbf7 in handle_rpl_parallel_thread (arg=0x7f4bf80098d8) at /10.4/sql/rpl_parallel.cc:1379
#10 0x00007f4c6d2a56ba in start_thread (arg=0x7f4c555f8700) at pthread_create.c:333
#11 0x00007f4c6be8841d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

reproducible with optimistic/aggressive slave_parallel_mode and >=3 slave_parallel_threads
with following rqg grammar:

query_init:
	CREATE TABLE t (i1 int, i2 int, pk int) ;  
query:
	a | b;
a:
CREATE TABLE t3 LIKE t ; ALTER TABLE t3 ENGINE = MERGE UNION (t1,t2);  COMMIT ;
b:
DELETE FROM mysql.tables_priv WHERE user = 'aaa' ; FLUSH PRIVILEGES ;

 perl ./runall-new.pl  --no-mask --seed=time --duration=300 --queries=100M  --mysqld=--log_bin=mysql-bin  --mysqld=--slave_parallel_mode=\"optimistic\"  --rpl_mode=mixed --grammar=conf/mariadb/1.yy --skip-gendata --threads=1 --mysqld=--slave_parallel_threads=3 --vardir= --basedir=/git/10.4

there is a bug MDEV-21149 about Engine=MERGE (maybe it is a reason of this one)



 Comments   
Comment by Alice Sherepa [ 2022-01-12 ]

to reproduce with mtr - please replay binlog with:

./mtr 1_my.test --mysqld=--log-bin=mysql-bin  --mysqld=--default-storage-engine=innodb --mysqld=--innodb  --mysqld=--partition --mysqld=--slave-skip-errors=all --mysqld=--slave_parallel_threads=5

Comment by Roel Van de Paar [ 2023-07-19 ]

I ran into this bug also in bb-10.4-MDEV-30619:

bb-10.4-MDEV-30619 10.4.31 07745e9e6a00ff6089c68f48d453dfa368f064f4

2023-07-19 17:20:19 11 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './relaylog.000001' position: 4; GTID position ''
mysqld: /test/bb-10.4-MDEV-30619_190723_0825_dbg/sql/rpl_parallel.cc:228: void finish_event_group(rpl_parallel_thread*, uint64, rpl_parallel_entry*, rpl_group_info*): Assertion `!tmp_gco->next_gco || tmp_gco->last_sub_id > sub_id' failed.

bb-10.4-MDEV-30619 10.4.31 07745e9e6a00ff6089c68f48d453dfa368f064f4 (Debug)

Core was generated by `/test/MDEV-30619_postfix_MD190723-mariadb-10.4.31-linux-x86_64-dbg/bin/mysqld -'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=22540725028416)
    at ./nptl/pthread_kill.c:44
[Current thread is 1 (Thread 0x14802be88640 (LWP 3811263))]
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=22540725028416) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=22540725028416) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=22540725028416, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00001480f3442476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00001480f34287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00001480f342871b in __assert_fail_base (fmt=0x1480f35dd150 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55d78ba47f18 "!tmp_gco->next_gco || tmp_gco->last_sub_id > sub_id", file=0x55d78ba47738 "/test/bb-10.4-MDEV-30619_190723_0825_dbg/sql/rpl_parallel.cc", line=228, function=<optimized out>) at ./assert/assert.c:92
#6  0x00001480f3439e96 in __GI___assert_fail (assertion=0x55d78ba47f18 "!tmp_gco->next_gco || tmp_gco->last_sub_id > sub_id", file=0x55d78ba47738 "/test/bb-10.4-MDEV-30619_190723_0825_dbg/sql/rpl_parallel.cc", line=228, function=0x55d78ba47eb8 "void finish_event_group(rpl_parallel_thread*, uint64, rpl_parallel_entry*, rpl_group_info*)") at ./assert/assert.c:101
#7  0x000055d78b0475b4 in finish_event_group (rpt=rpt@entry=0x14807c010ba0, sub_id=sub_id@entry=129, entry=entry@entry=0x14807c024c98, rgi=rgi@entry=0x14807c0606f0) at /test/bb-10.4-MDEV-30619_190723_0825_dbg/sql/rpl_parallel.cc:228
#8  0x000055d78b049711 in handle_rpl_parallel_thread (arg=0x14807c010ba0) at /test/bb-10.4-MDEV-30619_190723_0825_dbg/sql/rpl_parallel.cc:1420
#9  0x00001480f3494b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#10 0x00001480f3526a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Comment by Roel Van de Paar [ 2023-09-28 ]

I am seeing this bug regularly during replication testing runs. Would be great to have a fix if possible. Thank you

10.6.16 d13a57ae8181f2a8fbee86838d5476740e050d50 (Debug)

Core was generated by `/test/BASE_MD260923-mariadb-10.6.16-linux-x86_64-dbg/bin/mariadbd --no-defaults'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=22894726837824)
    at ./nptl/pthread_kill.c:44
[Current thread is 1 (Thread 0x14d2980f7640 (LWP 2397370))]
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=22894726837824) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=22894726837824) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=22894726837824, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x000014d2ce442476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x000014d2ce4287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x000014d2ce42871b in __assert_fail_base (fmt=0x14d2ce5dd150 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55bc28d9a6d8 "!tmp_gco->next_gco || tmp_gco->last_sub_id > sub_id", file=0x55bc28d99db8 "/test/10.6_dbg/sql/rpl_parallel.cc", line=262, function=<optimized out>) at ./assert/assert.c:92
#6  0x000014d2ce439e96 in __GI___assert_fail (assertion=0x55bc28d9a6d8 "!tmp_gco->next_gco || tmp_gco->last_sub_id > sub_id", file=0x55bc28d99db8 "/test/10.6_dbg/sql/rpl_parallel.cc", line=262, function=0x55bc28d9a678 "void finish_event_group(rpl_parallel_thread*, uint64, rpl_parallel_entry*, rpl_group_info*)") at ./assert/assert.c:101
#7  0x000055bc2834dba2 in finish_event_group (rpt=rpt@entry=0x14d24c026008, sub_id=sub_id@entry=640, entry=entry@entry=0x14d24c063938, rgi=rgi@entry=0x14d24c0f47f0) at /test/10.6_dbg/sql/rpl_parallel.cc:262
#8  0x000055bc28350899 in handle_rpl_parallel_thread (arg=arg@entry=0x14d24c026008) at /test/10.6_dbg/sql/rpl_parallel.cc:1556
#9  0x000055bc28716c3e in pfs_spawn_thread (arg=0x14d24c037f88) at /test/10.6_dbg/storage/perfschema/pfs.cc:2201
#10 0x000014d2ce494b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#11 0x000014d2ce526a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

The issue is highly sporadic, i.e. lot of iterations required to reproduce it.
I saw this for various reduced testcases also. The issue seems less related to the SQL used and more to the handling of parallel threads.

Note to self: kept 4 occurrences in /data/KEEP/MDEV-21107

Generated at Thu Feb 08 09:04:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.