[MDEV-15785] Test case galera_as_slave_replication_bundle caused debug assertion Created: 2018-04-05  Updated: 2022-10-28  Resolved: 2022-10-27

Status: Closed
Project: MariaDB Server
Component/s: Galera, Replication, Tests
Affects Version/s: 10.1
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Jan Lindström (Inactive) Assignee: Seppo Jaakola
Resolution: Won't Fix Votes: 0
Labels: Test_disabled, test_disabled_ES

Attachments: File bundle.diff    

 Description   

2018-04-05 10:48:04 140629028526848 [Note] WSREP: ready state reached
2018-04-05 10:48:04 140629028526848 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './mysqld-relay-bin.000001' position: 4
2018-04-05 10:48:04 140629028834048 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication started in log 'FIRST' at position 4
mysqld: /home/jan/mysql/10.1-galera/sql/log_event.cc:6784: virtual int Gtid_log_event::do_apply_event(rpl_group_info*): Assertion `(bits & (1ULL << 25)) == 0' failed.
180405 10:48:04 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.1.33-MariaDB-debug
key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=2
max_threads=153
thread_count=4
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63044 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7fe6ab06a070
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fe6bf4a48f0 thread_stack 0x48400
mysys/stacktrace.c:267(my_print_stacktrace)[0x559f11653ab0]
sql/signal_handler.cc:168(handle_fatal_signal)[0x559f10fed82b]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11f50)[0x7fe6c2a45f50]
linux/raise.c:51(__GI_raise)[0x7fe6bfcb0e7b]
stdlib/abort.c:81(__GI_abort)[0x7fe6bfcb2231]
assert/assert.c:89(__assert_fail_base)[0x7fe6bfca99da]
/lib/x86_64-linux-gnu/libc.so.6(+0x2da52)[0x7fe6bfca9a52]
sql/log_event.cc:6785(Gtid_log_event::do_apply_event(rpl_group_info*))[0x559f110eb487]
sql/log_event.h:1343(Log_event::apply_event(rpl_group_info*))[0x559f10d1afcb]
sql/slave.cc:3479(apply_event_and_update_pos_apply(Log_event*, THD*, rpl_group_info*, int))[0x559f10d10729]
sql/slave.cc:3601(apply_event_and_update_pos(Log_event*, THD*, rpl_group_info*))[0x559f10d10bb2]
sql/slave.cc:3885(exec_relay_log_event(THD*, Relay_log_info*, rpl_group_info*))[0x559f10d113b6]
sql/slave.cc:4981(handle_slave_sql)[0x559f10d145ff]
perfschema/pfs.cc:1863(pfs_spawn_thread)[0x559f11314c3b]
nptl/pthread_create.c:463(start_thread)[0x7fe6c2a3b5aa]
x86_64/clone.S:97(clone)[0x7fe6bfd72cbf]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0): 
Connection ID (thread ID): 9
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=off
 
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file
----------SERVER LOG END-------------
 
 
 - saving '/home/jan/mysql/10.1-galera/mysql-test/var/log/galera.galera_as_slave_replication_bundle-innodb_plugin/' to '/home/jan/mysql/10.1-galera/mysql-test/var/log/galera.galera_as_slave_replication_bundle-innodb_plugin/'
 
Only  1  of 2 completed.
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 0.000 of 12 seconds executing testcases
 
Failure: Failed 1/1 tests, 0.00% were successful.
 
Failing test(s): galera.galera_as_slave_replication_bundle
 
The log files in var/log may give you some hint of what went wrong.
 
If you want to report this error, please read first the documentation
at http://dev.mysql.com/doc/mysql/en/mysql-test-suite.html
 
mysql-test-run: *** ERROR: there were failing test cases



 Comments   
Comment by Jan Lindström (Inactive) [ 2018-09-13 ]

Elkin Can you have a look this assertion happens on middle of replication code that is not familiar to me.

Comment by Andrei Elkin [ 2018-09-13 ]

This looks like caused by skipped resetting of the flag by a preceding transaction. You can find its end position in the master binlog
in file: rli->group_master_log_name at pos:rli->group_master_log_pos.

It should have clear the flag at execution
of Xid_log_event::do_apply_event (or Query_log_event's method for Commit query). As you have a test that reproduces the issue
it must be straightforward to locate that miss.
I would do this:

set a gdb br for this SQL thread before it takes on a preceding Gtid group and watch
how the flag is set, and then how the group of events terminates (if it does). E.g absence of Xid_log_event at its end
would trigger the assert. You can

If you will need any help to process in gdb I am always there.

Comment by Jan Lindström (Inactive) [ 2018-10-02 ]

After fixing original assertion a new one surfaced from wsrep InnoDB checkpoint. First checkpoint is written as follows (using instrumented code to get assertion where we want):

2018-10-02 13:03:33 140472161303296 [Note] WSREP: Xid_log_event: Skipping wsrep commit replicated 1 binlog ./mysqld-relay-bin.000002 pos 871
2018-10-02 13:03:33 140472161303296 [Note] WSREP: apply_event_and_update_pos() result: 0
2018-10-02 13:03:33 140472161303296 [Note] WSREP: Wsrep checkpoint writing for uuid 66f554e0-c62a-11e8-97f5-9262cfb8a273 xid_seqno 2 trx_sys_cur_xid_seqno 1 trx_state 2
2018-10-02 13:03:33 140472161303296 [Note] WSREP: apply_event_and_update_pos() result: 0
2018-10-02 13:03:33 140472161303296 [Note] WSREP: apply_event_and_update_pos() result: 0
2018-10-02 13:03:33 140472161303296 [Note] WSREP: apply_event_and_update_pos() result: 0
2018-10-02 13:03:33 7fc239467b00  InnoDB: Assertion failure in thread 140472161303296 in file trx0sys.cc line 391
InnoDB: Failing assertion: 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
181002 13:03:33 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.1.37-MariaDB-debug
key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=2
max_threads=153
thread_count=4
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63026 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7fc221c70070
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fc2394668f0 thread_stack 0x48400
2018-10-02 13:03:33 140472161917696 [Note] WSREP: wsrep_sync_wait: thd->variables.wsrep_sync_wait = 15, mask = 1
/home/jan/mysql/10.1-galera/sql/mysqld(my_print_stacktrace+0x38)[0x55fd4be68365]
mysys/stacktrace.c:267(my_print_stacktrace)[0x55fd4b803e39]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x128e0)[0x7fc23b6858e0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7fc239ad5f3b]
linux/raise.c:51(__GI_raise)[0x7fc239ad72f1]
/home/jan/mysql/10.1-galera/sql/mysqld(+0xbe82e2)[0x55fd4bcdd2e2]
trx/trx0sys.cc:397(trx_sys_update_wsrep_checkpoint(xid_t const*, unsigned char*, mtr_t*, trx_t const*))[0x55fd4bce49a2]
trx/trx0trx.cc:1243(trx_write_serialisation_history(trx_t*, mtr_t*))[0x55fd4bce59c1]
trx/trx0trx.cc:1639(trx_commit_low(trx_t*, mtr_t*))[0x55fd4bce5a86]
trx/trx0trx.cc:1668(trx_commit(trx_t*))[0x55fd4bce6441]
trx/trx0trx.cc:1932(trx_commit_for_mysql(trx_t*))[0x55fd4bb40d0d]
handler/ha_innodb.cc:4704(innobase_commit_low(trx_t*))[0x55fd4bb40feb]
handler/ha_innodb.cc:4846(innobase_commit_ordered_2(trx_t*, THD*))[0x55fd4bb41191]
handler/ha_innodb.cc:4906(innobase_commit_ordered(handlerton*, THD*, bool))[0x55fd4b8ea1b4]
sql/log.cc:8794(TC_LOG::run_commit_ordered(THD*, bool))[0x55fd4b8e8369]
sql/log.cc:8024(MYSQL_BIN_LOG::trx_group_commit_leader(MYSQL_BIN_LOG::group_commit_entry*))[0x55fd4b8e7345]
sql/log.cc:7616(MYSQL_BIN_LOG::write_transaction_to_binlog_events(MYSQL_BIN_LOG::group_commit_entry*))[0x55fd4b8e6a58]
sql/log.cc:7290(MYSQL_BIN_LOG::write_transaction_to_binlog(THD*, binlog_cache_mngr*, Log_event*, bool, bool, bool))[0x55fd4b8d8a29]
sql/log.cc:1751(binlog_flush_cache(THD*, binlog_cache_mngr*, Log_event*, bool, bool, bool))[0x55fd4b8d8ee8]
sql/log.cc:1859(binlog_commit_flush_xid_caches(THD*, binlog_cache_mngr*, bool, unsigned long long))[0x55fd4b8ec2c6]
sql/log.cc:9576(MYSQL_BIN_LOG::log_and_order(THD*, unsigned long long, bool, bool, bool))[0x55fd4b807474]
sql/handler.cc:1463(ha_commit_trans(THD*, bool))[0x55fd4b71093b]
sql/transaction.cc:235(trans_commit(THD*))[0x55fd4b902ea5]
sql/log_event.cc:7675(Xid_log_event::do_apply_event(rpl_group_info*))[0x55fd4b530f3f]
sql/log_event.h:1343(Log_event::apply_event(rpl_group_info*))[0x55fd4b5267d9]
sql/slave.cc:3479(apply_event_and_update_pos_apply(Log_event*, THD*, rpl_group_info*, int))[0x55fd4b526c62]
sql/slave.cc:3601(apply_event_and_update_pos(Log_event*, THD*, rpl_group_info*))[0x55fd4b527466]
sql/slave.cc:3885(exec_relay_log_event(THD*, Relay_log_info*, rpl_group_info*))[0x55fd4b52a6bb]
sql/slave.cc:4981(handle_slave_sql)[0x55fd4bb29ace]
nptl/pthread_create.c:463(start_thread)[0x7fc23b67af2a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fc239b97edf]

This does not yet break any existing rules. However, we have a second commit for the same xid_seqno as follows (again using instrumented code) that does break current expectations i.e. the fact that xid_seqno should be increasing:

2018-10-02 13:14:57 140216945548032 [Note] WSREP: Xid_log_event: Skipping wsrep commit replicated 1 binlog ./mysqld-relay-bin.000002 pos 871
2018-10-02 13:14:57 140216945548032 [Note] WSREP: apply_event_and_update_pos() result: 0
2018-10-02 13:14:57 140216945548032 [Note] WSREP: Wsrep checkpoint writing for uuid fe20fa80-c62b-11e8-b401-2e57249066f2 xid_seqno 2 trx_sys_cur_xid_seqno 1 trx_state 2
2018-10-02 13:14:57 140216945548032 [Note] WSREP: apply_event_and_update_pos() result: 0
2018-10-02 13:14:57 140216945548032 [Note] WSREP: apply_event_and_update_pos() result: 0
2018-10-02 13:14:57 140216945548032 [Note] WSREP: apply_event_and_update_pos() result: 0
2018-10-02 13:14:57 140216945548032 [Note] WSREP: WRONG: wsrep checkpoint writing for uuid fe20fa80-c62b-11e8-b401-2e57249066f2 xid_seqno 2 trx_sys_cur_xid_seqno 2 trx_state 2
2018-10-02 13:14:57 7f86cd3b1b00  InnoDB: Assertion failure in thread 140216945548032 in file trx0sys.cc line 395
InnoDB: Failing assertion: xid_seqno > trx_sys_cur_xid_seqno
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
181002 13:14:57 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.1.37-MariaDB-debug
key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=2
max_threads=153
thread_count=4
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63026 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f86c1474070
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f86cd3b08f0 thread_stack 0x48400
2018-10-02 13:14:57 140217000942336 [Note] WSREP: wsrep_sync_wait: thd->variables.wsrep_sync_wait = 15, mask = 1
/home/jan/mysql/10.1-galera/sql/mysqld(my_print_stacktrace+0x38)[0x55b9c460c409]
mysys/stacktrace.c:267(my_print_stacktrace)[0x55b9c3fa7e39]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x128e0)[0x7f86da3c68e0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f86d8816f3b]
linux/raise.c:51(__GI_raise)[0x7f86d88182f1]
/home/jan/mysql/10.1-galera/sql/mysqld(+0xbe83a9)[0x55b9c44813a9]
trx/trx0sys.cc:398(trx_sys_update_wsrep_checkpoint(xid_t const*, unsigned char*, mtr_t*, trx_t const*))[0x55b9c4488a45]
trx/trx0trx.cc:1243(trx_write_serialisation_history(trx_t*, mtr_t*))[0x55b9c4489a64]
trx/trx0trx.cc:1639(trx_commit_low(trx_t*, mtr_t*))[0x55b9c4489b29]
trx/trx0trx.cc:1668(trx_commit(trx_t*))[0x55b9c448a4e4]
trx/trx0trx.cc:1932(trx_commit_for_mysql(trx_t*))[0x55b9c42e4d0d]
handler/ha_innodb.cc:4704(innobase_commit_low(trx_t*))[0x55b9c42e4feb]
handler/ha_innodb.cc:4846(innobase_commit_ordered_2(trx_t*, THD*))[0x55b9c42e5191]
handler/ha_innodb.cc:4906(innobase_commit_ordered(handlerton*, THD*, bool))[0x55b9c408e1b4]
sql/log.cc:8794(TC_LOG::run_commit_ordered(THD*, bool))[0x55b9c408c369]
sql/log.cc:8024(MYSQL_BIN_LOG::trx_group_commit_leader(MYSQL_BIN_LOG::group_commit_entry*))[0x55b9c408b345]
sql/log.cc:7616(MYSQL_BIN_LOG::write_transaction_to_binlog_events(MYSQL_BIN_LOG::group_commit_entry*))[0x55b9c408aa58]
sql/log.cc:7290(MYSQL_BIN_LOG::write_transaction_to_binlog(THD*, binlog_cache_mngr*, Log_event*, bool, bool, bool))[0x55b9c407ca29]
sql/log.cc:1751(binlog_flush_cache(THD*, binlog_cache_mngr*, Log_event*, bool, bool, bool))[0x55b9c407cee8]
sql/log.cc:1859(binlog_commit_flush_xid_caches(THD*, binlog_cache_mngr*, bool, unsigned long long))[0x55b9c40902c6]
sql/log.cc:9576(MYSQL_BIN_LOG::log_and_order(THD*, unsigned long long, bool, bool, bool))[0x55b9c3fab474]
sql/handler.cc:1463(ha_commit_trans(THD*, bool))[0x55b9c3eb493b]
sql/transaction.cc:235(trans_commit(THD*))[0x55b9c40a6ea5]
sql/log_event.cc:7675(Xid_log_event::do_apply_event(rpl_group_info*))[0x55b9c3cd4f3f]
sql/log_event.h:1343(Log_event::apply_event(rpl_group_info*))[0x55b9c3cca7d9]
sql/slave.cc:3479(apply_event_and_update_pos_apply(Log_event*, THD*, rpl_group_info*, int))[0x55b9c3ccac62]
sql/slave.cc:3601(apply_event_and_update_pos(Log_event*, THD*, rpl_group_info*))[0x55b9c3ccb466]
sql/slave.cc:3885(exec_relay_log_event(THD*, Relay_log_info*, rpl_group_info*))[0x55b9c3cce6bb]
sql/slave.cc:4981(handle_slave_sql)[0x55b9c42cdace]
nptl/pthread_create.c:463(start_thread)[0x7f86da3bbf2a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f86d88d8edf]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0): 
Connection ID (thread ID): 9
Status: NOT_KILLED

Patch for instrumented code attached on this ticket. We could just ignore these same xid_seqno writes on InnoDB, but in my opinion it is not correct way. How to repeat:

./mtr galera_as_slave_replication_bundle,xtradb --enable-disabled --mysqld=--wsrep-debug=ON

Comment by Jan Lindström (Inactive) [ 2018-10-02 ]

seppo I do not know how to proceed from this state. To me it looks like there is two participants on xa group commit on the same node.

Generated at Thu Feb 08 08:24:00 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.