CentOS 7.9.2009
mysqld would have been started with the following arguments:
--datadir=/data/mariadb --socket=/var/lib/mysql/mysql.sock --user=mysql --symbolic-links=0 --max_allowed_packet=16M --thread_cache_size=8 --max_connections=1500 --slow_query_log=1 --log_error=/data/mariadb/mariadb.log --innodb_buffer_pool_size=49152M --innodb_log_file_size=2048M --innodb_log_buffer_size=16M --innodb_print_all_deadlocks=on --log-warnings=2 --plugin_load_add=auth_socket --default_storage_engine=InnoDB --innodb_file_per_table=1 --innodb_flush_log_at_trx_commit=0 --innodb_doublewrite=1 --log_slave_updates=1 --log_bin=bin-log --server_id=6666 --binlog_format=ROW --innodb_autoinc_lock_mode=2 --expire_logs_days=1 --wsrep_provider=/usr/lib64/galera/libgalera_smm.so --wsrep_cluster_address=gcomm://[redacted] --wsrep_node_address=[redacted] --wsrep_sst_method=mariabackup --wsrep_sst_auth=[redacted] --wsrep_provider_options=evs.keepalive_period = PT1S; evs.inactive_check_period = PT1S; evs.suspect_timeout = PT5S; evs.inactive_timeout = PT15S; evs.install_timeout = PT15S; gcache.size=2G --wsrep_on=ON --wsrep_log_conflicts=1
Description
About 29 hours after updating a previously very stable 3 server MariaDB Galera cluster from 10.3.27 to 10.3.28, one of the nodes crashed with the following message:
2021-03-10 18:22:48 0 [ERROR] WSREP: invalid state ROLLED_BACK (FATAL)
at /home/buildbot/buildbot/build/galera/src/replicator_smm.cpp:abort_trx():735
Attached is a bit more log (coincidentally had been running with conflict logging enabled), but I don't think there's much more I can provide right now. I'm still logging this since before the update to 10.3.28 the cluster had been running very stable for months, and this could be somehow related to MDEV-25111, which we also encountered first time right after updating to 10.3.28.
This wasn't an isolated occurrence. Another server crashed a couple of hours later:
2021-03-10 21:01:10 0 [ERROR] WSREP: invalid state ROLLED_BACK (FATAL)
at /home/buildbot/buildbot/build/galera/src/replicator_smm.cpp:abort_trx():735
2021-03-10 21:01:10 0 [ERROR] WSREP: cancel commit bad exit: 7 34182537224
210310 21:01:10 [ERROR] mysqld got signal 6 ;
Ere Maijala
added a comment - This wasn't an isolated occurrence. Another server crashed a couple of hours later:
2021-03-10 21:01:10 0 [ERROR] WSREP: invalid state ROLLED_BACK (FATAL)
at /home/buildbot/buildbot/build/galera/src/replicator_smm.cpp:abort_trx():735
2021-03-10 21:01:10 0 [ERROR] WSREP: cancel commit bad exit: 7 34182537224
210310 21:01:10 [ERROR] mysqld got signal 6 ;
2021-03-23 11:11:32 0 [ERROR] WSREP: invalid state ROLLED_BACK (FATAL)
at galera/src/replicator_smm.cpp:abort_trx():727
2021-03-23 11:11:32 0 [ERROR] WSREP: cancel commit bad exit: 7 21859814393
210323 11:11:32 [ERROR] mysqld got signal 6 ;
it happened when doing ALTER table - adding partitions to a table (with intensive INSERTs)
Martin Vit
added a comment - - edited same issue here -
2021-03-23 11:11:32 0 [ERROR] WSREP: invalid state ROLLED_BACK (FATAL)
at galera/src/replicator_smm.cpp:abort_trx():727
2021-03-23 11:11:32 0 [ERROR] WSREP: cancel commit bad exit: 7 21859814393
210323 11:11:32 [ERROR] mysqld got signal 6 ;
it happened when doing ALTER table - adding partitions to a table (with intensive INSERTs)
I just did an upgrade from 10.2 to 10.3 and by chance hit this release 10.3.28 and immediately the nodes started crashing. 3 node galera cluster, in total they've crashed over 10 times in the 15 hours since the upgrade finished.
I'm afraid it's urgent for us to downgrade to 10.3.27 so I have no more diagnostics info to give you.
OS: CentOS 7
Log is same as previous reporters.
As far as I know we do not have any ALTER TABLE use.
Very heavy write use and a very large DB, 330GB.
Stefan Midjich
added a comment - - edited I just did an upgrade from 10.2 to 10.3 and by chance hit this release 10.3.28 and immediately the nodes started crashing. 3 node galera cluster, in total they've crashed over 10 times in the 15 hours since the upgrade finished.
I'm afraid it's urgent for us to downgrade to 10.3.27 so I have no more diagnostics info to give you.
OS: CentOS 7
Log is same as previous reporters.
As far as I know we do not have any ALTER TABLE use.
Very heavy write use and a very large DB, 330GB.
I can reproduce similar crash with mtr test suite, the smoking gun seems to be the new bg_wsrep_kill_trx () background killer execution, debug log messages before assert give:
This is probably a regression of MDEV-23328 , which brought in the background killer thread. Galera suite mtr tests can reproduce this issue, e.g.: galera.MW-369
at /home/seppo/work/wsrep/mariadb-server/mysys/stacktrace.c:386
#20x000055b643c1e72f in handle_fatal_signal (sig=6)
at /home/seppo/work/wsrep/mariadb-server/sql/signal_handler.cc:355
#3 <signal handler called>
#4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#50x00007f7e5fc8a859 in __GI_abort () at abort.c:79
#60x000055b643e7833e in bg_wsrep_kill_trx (void_arg=0x7f7e3c0c9360)
at /home/seppo/work/wsrep/mariadb-server/storage/innobase/handler/ha_innodb.cc:18978
#70x000055b6438ec1c4 in handle_manager (arg=0x0)
at /home/seppo/work/wsrep/mariadb-server/sql/sql_manager.cc:112
#80x000055b644443337 in pfs_spawn_thread (arg=0x55b6479e4b20)
at /home/seppo/work/wsrep/mariadb-server/storage/perfschema/pfs.cc:1869
#90x00007f7e5fe62609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#100x00007f7e5fd87293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Seppo Jaakola
added a comment - - edited This is probably a regression of MDEV-23328 , which brought in the background killer thread. Galera suite mtr tests can reproduce this issue, e.g.: galera.MW-369
stack trace of crashing thread:
# 0 __pthread_kill (threadid=<optimized out>, signo= 6 )
at ../sysdeps/unix/sysv/linux/pthread_kill.c: 56
# 1 0x000055b6444b5731 in my_write_core (sig= 6 )
at /home/seppo/work/wsrep/mariadb-server/mysys/stacktrace.c: 386
# 2 0x000055b643c1e72f in handle_fatal_signal (sig= 6 )
at /home/seppo/work/wsrep/mariadb-server/sql/signal_handler.cc: 355
# 3 <signal handler called>
# 4 __GI_raise (sig=sig @entry = 6 ) at ../sysdeps/unix/sysv/linux/raise.c: 50
# 5 0x00007f7e5fc8a859 in __GI_abort () at abort.c: 79
# 6 0x000055b643e7833e in bg_wsrep_kill_trx (void_arg= 0x7f7e3c0c9360 )
at /home/seppo/work/wsrep/mariadb-server/storage/innobase/handler/ha_innodb.cc: 18978
# 7 0x000055b6438ec1c4 in handle_manager (arg= 0x0 )
at /home/seppo/work/wsrep/mariadb-server/sql/sql_manager.cc: 112
# 8 0x000055b644443337 in pfs_spawn_thread (arg= 0x55b6479e4b20 )
at /home/seppo/work/wsrep/mariadb-server/storage/perfschema/pfs.cc: 1869
# 9 0x00007f7e5fe62609 in start_thread (arg=<optimized out>) at pthread_create.c: 477
# 10 0x00007f7e5fd87293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S: 95
serg, this task has been discussed in a call with seppo. This seems to be a regression related to changes done for "kill query", and it is hard for Galera to deal with a kill query as such. So the question came up if kill query can be supported at all for Galera. What is your opinion, should we disallow kill query for Galera?
Ralf Gebhardt
added a comment - - edited serg , this task has been discussed in a call with seppo . This seems to be a regression related to changes done for "kill query", and it is hard for Galera to deal with a kill query as such. So the question came up if kill query can be supported at all for Galera. What is your opinion, should we disallow kill query for Galera?
How can KILL be disallowed for galera? Simply if a server is part of a cluster, then KILL is not allowed? That seems to be a bit too drastic, don't you think?
Sergei Golubchik
added a comment - How can KILL be disallowed for galera? Simply if a server is part of a cluster, then KILL is not allowed? That seems to be a bit too drastic, don't you think?
#20 0x0000562c3e539b22 in mysql_execute_command (thd=thd@entry=0x1455c8000d90)
at /test/mtest/10.3_dbg/sql/sql_parse.cc:5913
#21 0x0000562c3e53c17f in mysql_parse (thd=thd@entry=0x1455c8000d90,
rawbuf=rawbuf@entry=0x1455c8011628 "CREATE TRIGGER t3_trg_err_1 BEFORE INSERT ON t1 FOR EACH ROW SET @_var = NEW.col", length=length@entry=80, parser_state=parser_state@entry=0x77463330e530,
is_next_command=is_next_command@entry=false) at /test/mtest/10.3_dbg/sql/sql_parse.cc:7662
#23 0x0000562c3e53e959 in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x1455c8000d90,
packet=packet@entry=0x1455c8008e41 "CREATE TRIGGER t3_trg_err_1 BEFORE INSERT ON t1 FOR EACH ROW SET @_var = NEW.col;", packet_length=packet_length@entry=81, is_com_multi=is_com_multi@entry=false,
is_next_command=is_next_command@entry=false) at /test/mtest/10.3_dbg/sql/sql_class.h:1152
#24 0x0000562c3e541ace in do_command (thd=0x1455c8000d90) at /test/mtest/10.3_dbg/sql/sql_parse.cc:1398
#25 0x0000562c3e669d54 in do_handle_one_connection (connect=connect@entry=0x562c42177370)
at /test/mtest/10.3_dbg/sql/sql_connect.cc:1403
#26 0x0000562c3e669f8b in handle_one_connection (arg=0x562c42177370)
at /test/mtest/10.3_dbg/sql/sql_connect.cc:1308
#27 0x0000562c403d1609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#28 0x00001fd34bada293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
From these Bug 2 seems to be easy to fix. However, bug 1 looks complicated.
Jan Lindström (Inactive)
added a comment - Unfortunately, PR https://github.com/MariaDB/server/pull/1843 is not correct based on pquery QA executed by ramesh . There is at least two bugs found on this version.
Bug 1:
(rr) where
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x0000145a0820e859 in __GI_abort () at abort.c:79
#2 0x0000145a0820e729 in __assert_fail_base (fmt=0x145a083a4588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=0x55c620df5f79 "!is_owned()",
file=0x55c620df5f88 "/test/mtest/10.3_dbg/storage/innobase/include/sync0policy.ic", line=64,
function=<optimized out>) at assert.c:92
#3 0x0000145a0821ff36 in __GI___assert_fail (assertion=assertion@entry=0x55c620df5f79 "!is_owned()",
file=file@entry=0x55c620df5f88 "/test/mtest/10.3_dbg/storage/innobase/include/sync0policy.ic",
line=line@entry=64,
function=function@entry=0x55c620df5fc8 "void MutexDebug<Mutex>::enter(const Mutex*, const char*, unsigned int) [with Mutex = TTASEventMutex<GenericPolicy>]") at assert.c:101
#4 0x000055c6205aa5c1 in MutexDebug<TTASEventMutex<GenericPolicy> >::enter (this=0x55c6214cf510 <lock_sys+80>,
mutex=0x55c6214cf500 <lock_sys+64>,
name=name@entry=0x55c620de7c58 "/test/mtest/10.3_dbg/storage/innobase/handler/ha_innodb.cc",
line=line@entry=5186) at /test/mtest/10.3_dbg/storage/innobase/include/sync0types.h:979
#5 0x000055c620599e94 in GenericPolicy<TTASEventMutex<GenericPolicy> >::enter (line=5186,
filename=0x55c620de7c58 "/test/mtest/10.3_dbg/storage/innobase/handler/ha_innodb.cc", mutex=...,
this=<optimized out>) at /test/mtest/10.3_dbg/storage/innobase/include/sync0policy.h:339
#6 PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (line=5186,
name=0x55c620de7c58 "/test/mtest/10.3_dbg/storage/innobase/handler/ha_innodb.cc", n_delay=4, n_spins=30,
this=<optimized out>) at /test/mtest/10.3_dbg/storage/innobase/include/ib0mutex.h:590
#7 innobase_kill_query (thd=<optimized out>) at /test/mtest/10.3_dbg/storage/innobase/handler/ha_innodb.cc:5186
#8 0x000055c620358b27 in kill_handlerton (thd=thd@entry=0x7d1080001fe0, plugin=<optimized out>,
level=level@entry=0x2a1c4ce28d8c) at /test/mtest/10.3_dbg/sql/handler.cc:841
#9 0x000055c6200dc552 in plugin_foreach_with_mask (thd=thd@entry=0x7d1080001fe0,
func=func@entry=0x55c620358aae <kill_handlerton(THD*, plugin_ref, void*)>, type=type@entry=1,
state_mask=state_mask@entry=8, arg=arg@entry=0x2a1c4ce28d8c) at /test/mtest/10.3_dbg/sql/sql_plugin.cc:2435
#10 0x000055c620359f34 in ha_kill_query (thd=thd@entry=0x7d1080001fe0, level=<optimized out>)
--Type <RET> for more, q to quit, c to continue without paging--
at /test/mtest/10.3_dbg/sql/handler.cc:849
#11 0x000055c62006d27e in THD::awake_no_mutex (this=this@entry=0x7d1080001fe0,
state_to_set=state_to_set@entry=KILL_QUERY) at /test/mtest/10.3_dbg/include/mysql/psi/mysql_thread.h:717
#12 0x000055c6202a3022 in THD::awake (state_to_set=KILL_QUERY, this=0x7d1080001fe0)
at /test/mtest/10.3_dbg/sql/sql_class.h:3332
#13 wsrep_thd_awake (thd=thd@entry=0x7d1080001fe0, signal=signal@entry=1 '\001', kill_signal=kill_signal@entry=4)
at /test/mtest/10.3_dbg/sql/wsrep_mysqld.cc:2750
#14 0x000055c620596fd7 in wsrep_innobase_kill_one_trx (bf_thd=bf_thd@entry=0x145a18001250,
bf_trx=bf_trx@entry=0x49602d628380, victim_trx=victim_trx@entry=0x49602d60d0d0, signal=signal@entry=1,
kill_signal=kill_signal@entry=4) at /test/mtest/10.3_dbg/storage/innobase/handler/ha_innodb.cc:19071
#15 0x000055c62059cea0 in wsrep_abort_transaction (bf_thd=0x145a18001250, victim_thd=<optimized out>,
signal=1 '\001', kill_signal=4) at /test/mtest/10.3_dbg/storage/innobase/handler/ha_innodb.cc:19170
#16 0x000055c62036956d in ha_abort_transaction (bf_thd=bf_thd@entry=0x145a18001250,
victim_thd=victim_thd@entry=0x7d1080001fe0, signal=signal@entry=1 '\001', kill_signal=kill_signal@entry=4)
at /test/mtest/10.3_dbg/sql/handler.cc:6769
#17 0x000055c6202b0945 in wsrep_abort_thd (bf_thd_ptr=bf_thd_ptr@entry=0x145a18001250,
victim_thd_ptr=victim_thd_ptr@entry=0x7d1080001fe0, signal=signal@entry=1 '\001',
kill_signal=kill_signal@entry=KILL_QUERY) at /test/mtest/10.3_dbg/sql/wsrep_thd.cc:828
#18 0x000055c6202a0ccf in wsrep_grant_mdl_exception (requestor_ctx=requestor_ctx@entry=0x145a18001370,
ticket=ticket@entry=0x7d10800083d0, key=key@entry=0x7d1080068630)
at /test/mtest/10.3_dbg/sql/wsrep_mysqld.cc:2049
#19 0x000055c620206939 in MDL_lock::can_grant_lock (this=this@entry=0x7d1080068630, type_arg=MDL_EXCLUSIVE,
requestor_ctx=requestor_ctx@entry=0x145a18001370, ignore_lock_priority=ignore_lock_priority@entry=false)
at /test/mtest/10.3_dbg/sql/mdl.cc:1594
#20 0x000055c620207f66 in MDL_context::try_acquire_lock_impl (this=this@entry=0x145a18001370,
mdl_request=mdl_request@entry=0x145a18012188, out_ticket=out_ticket@entry=0x2a1c4ce292d8)
at /test/mtest/10.3_dbg/sql/mdl.cc:1925
#21 0x000055c620208179 in MDL_context::acquire_lock (this=this@entry=0x145a18001370, mdl_request=0x145a18012188,
--Type <RET> for more, q to quit, c to continue without paging--
lock_wait_timeout=lock_wait_timeout@entry=0) at /test/mtest/10.3_dbg/sql/mdl.cc:2069
#22 0x000055c620208ce8 in MDL_context::acquire_locks (this=this@entry=0x145a18001370,
mdl_requests=mdl_requests@entry=0x2a1c4ce29430, lock_wait_timeout=0) at /test/mtest/10.3_dbg/sql/mdl.cc:2268
#23 0x000055c62004a15b in lock_table_names (thd=thd@entry=0x145a18001250, options=...,
tables_start=0x145a18011d20, tables_end=<optimized out>, lock_wait_timeout=<optimized out>,
flags=flags@entry=0) at /test/mtest/10.3_dbg/sql/sql_base.cc:3923
#24 0x000055c62004cd3e in open_tables (thd=thd@entry=0x145a18001250, options=...,
start=start@entry=0x2a1c4ce297c8, counter=counter@entry=0x2a1c4ce297dc, flags=flags@entry=0,
prelocking_strategy=prelocking_strategy@entry=0x2a1c4ce29878) at /test/mtest/10.3_dbg/sql/sql_lex.h:1585
#25 0x000055c62004e279 in open_and_lock_tables (thd=thd@entry=0x145a18001250, options=..., tables=<optimized out>,
tables@entry=0x145a18011d20, derived=derived@entry=false, flags=flags@entry=0,
prelocking_strategy=prelocking_strategy@entry=0x2a1c4ce29878) at /test/mtest/10.3_dbg/sql/sql_base.cc:5109
#26 0x000055c6201886cc in open_and_lock_tables (flags=0, derived=false, tables=0x145a18011d20, options=...,
thd=0x145a18001250) at /test/mtest/10.3_dbg/sql/sql_base.h:495
#27 mysql_create_table (thd=thd@entry=0x145a18001250, create_table=create_table@entry=0x145a18011d20,
create_info=create_info@entry=0x2a1c4ce29a00, alter_info=alter_info@entry=0x2a1c4ce29940)
at /test/mtest/10.3_dbg/sql/sql_table.cc:5282
#28 0x000055c62018a0c5 in Sql_cmd_create_table_like::execute (this=0x145a18011d00, thd=0x145a18001250)
at /test/mtest/10.3_dbg/sql/sql_table.cc:11378
#29 0x000055c6200ce17c in mysql_execute_command (thd=thd@entry=0x145a18001250)
at /test/mtest/10.3_dbg/sql/sql_parse.cc:6075
#30 0x000055c6200d017f in mysql_parse (thd=thd@entry=0x145a18001250,
rawbuf=rawbuf@entry=0x145a18011ae8 "CREATE TABLE t1(c1 INTEGER NULL, c2 BINARY(25) NOT NULL, c3 SMALLINT(4) NULL, c4 BINARY(15) NOT NULL PRIMARY KEY, c5 INTEGER NOT NULL UNIQUE KEY,c6 FIXED(10,8) NOT NULL DEFAULT 3.141592)",
length=length@entry=186, parser_state=parser_state@entry=0x2a1c4ce2b530,
is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false)
at /test/mtest/10.3_dbg/sql/sql_parse.cc:7870
#31 0x000055c6200d183e in wsrep_mysql_parse (thd=thd@entry=0x145a18001250,
--Type <RET> for more, q to quit, c to continue without paging--
rawbuf=0x145a18011ae8 "CREATE TABLE t1(c1 INTEGER NULL, c2 BINARY(25) NOT NULL, c3 SMALLINT(4) NULL, c4 BINARY(15) NOT NULL PRIMARY KEY, c5 INTEGER NOT NULL UNIQUE KEY,c6 FIXED(10,8) NOT NULL DEFAULT 3.141592)", length=186,
parser_state=parser_state@entry=0x2a1c4ce2b530, is_com_multi=is_com_multi@entry=false,
is_next_command=is_next_command@entry=false) at /test/mtest/10.3_dbg/sql/sql_parse.cc:7662
#32 0x000055c6200d2959 in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x145a18001250,
packet=packet@entry=0x145a18009301 "CREATE TABLE t1(c1 INTEGER NULL, c2 BINARY(25) NOT NULL, c3 SMALLINT(4) NULL, c4 BINARY(15) NOT NULL PRIMARY KEY, c5 INTEGER NOT NULL UNIQUE KEY,c6 FIXED(10,8) NOT NULL DEFAULT 3.141592);",
packet_length=packet_length@entry=187, is_com_multi=is_com_multi@entry=false,
is_next_command=is_next_command@entry=false) at /test/mtest/10.3_dbg/sql/sql_class.h:1152
#33 0x000055c6200d5ace in do_command (thd=0x145a18001250) at /test/mtest/10.3_dbg/sql/sql_parse.cc:1398
#34 0x000055c6201fdd54 in do_handle_one_connection (connect=connect@entry=0x55c6231056e0)
at /test/mtest/10.3_dbg/sql/sql_connect.cc:1403
#35 0x000055c6201fdf8b in handle_one_connection (arg=0x55c6231056e0)
at /test/mtest/10.3_dbg/sql/sql_connect.cc:1308
#36 0x000055c621e06609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#37 0x0000145a0830b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Bug 2:
(rr) p file
$2 = 0x562c3f102d68 "/test/mtest/10.3_dbg/sql/wsrep_mysqld.cc"
(rr) p line
$3 = 2664
(rr) p mp->file
$4 = 0x562c3f0aea20 "/test/mtest/10.3_dbg/sql/sql_class.cc"
(rr) p mp->line
$5 = 1980
(rr) where
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00001fd34b9dd859 in __GI_abort () at abort.c:79
#2 0x0000562c3f0158be in safe_mutex_lock (mp=mp@entry=0x1d60100028b8, my_flags=my_flags@entry=0,
file=file@entry=0x562c3f102d68 "/test/mtest/10.3_dbg/sql/wsrep_mysqld.cc", line=line@entry=2664)
at /test/mtest/10.3_dbg/mysys/thr_mutex.c:265
#3 0x0000562c3e70c51f in inline_mysql_mutex_lock (src_line=2664,
src_file=0x562c3f102d68 "/test/mtest/10.3_dbg/sql/wsrep_mysqld.cc", that=0x1d60100028b8)
at /test/mtest/10.3_dbg/include/mysql/psi/mysql_thread.h:717
#4 wsrep_thd_LOCK (thd=thd@entry=0x1d6010000d90) at /test/mtest/10.3_dbg/sql/wsrep_mysqld.cc:2664
#5 0x0000562c3ea029c1 in wsrep_innobase_kill_one_trx (bf_thd=bf_thd@entry=0x1455c8000d90,
bf_trx=bf_trx@entry=0x30e64a067468, victim_trx=victim_trx@entry=0x30e64a064240, signal=signal@entry=0,
kill_signal=kill_signal@entry=4) at /test/mtest/10.3_dbg/storage/innobase/handler/ha_innodb.cc:18937
#6 0x0000562c3ea08ea0 in wsrep_abort_transaction (bf_thd=0x1455c8000d90, victim_thd=<optimized out>,
signal=0 '\000', kill_signal=4) at /test/mtest/10.3_dbg/storage/innobase/handler/ha_innodb.cc:19170
#7 0x0000562c3e7d556d in ha_abort_transaction (bf_thd=bf_thd@entry=0x1455c8000d90,
victim_thd=victim_thd@entry=0x1d6010000d90, signal=signal@entry=0 '\000', kill_signal=kill_signal@entry=4)
at /test/mtest/10.3_dbg/sql/handler.cc:6769
#8 0x0000562c3e71c945 in wsrep_abort_thd (bf_thd_ptr=bf_thd_ptr@entry=0x1455c8000d90,
victim_thd_ptr=victim_thd_ptr@entry=0x1d6010000d90, signal=signal@entry=0 '\000',
kill_signal=kill_signal@entry=KILL_QUERY) at /test/mtest/10.3_dbg/sql/wsrep_thd.cc:828
#9 0x0000562c3e4ce83c in THD::notify_shared_lock (this=0x1455c8000d90, ctx_in_use=<optimized out>,
needs_thr_lock_abort=<optimized out>) at /test/mtest/10.3_dbg/sql/sql_class.cc:2002
#10 0x0000562c3e674325 in MDL_lock::notify_conflicting_locks (ctx=0x1455c8000eb0, this=0xde604090870)
at /test/mtest/10.3_dbg/sql/mdl.h:848
#11 MDL_context::acquire_lock (this=this@entry=0x1455c8000eb0, mdl_request=0x1455c8011c18,
lock_wait_timeout=lock_wait_timeout@entry=86400) at /test/mtest/10.3_dbg/sql/mdl.cc:2114
#12 0x0000562c3e674ce8 in MDL_context::acquire_locks (this=this@entry=0x1455c8000eb0,
mdl_requests=mdl_requests@entry=0x77463330c5b0, lock_wait_timeout=86400)
--Type <RET> for more, q to quit, c to continue without paging--
at /test/mtest/10.3_dbg/sql/mdl.cc:2268
#13 0x0000562c3e4b615b in lock_table_names (thd=thd@entry=0x1455c8000d90, options=...,
tables_start=0x1455c80117b0, tables_end=<optimized out>, lock_wait_timeout=<optimized out>,
flags=flags@entry=0) at /test/mtest/10.3_dbg/sql/sql_base.cc:3923
#14 0x0000562c3e4b8d3e in open_tables (thd=thd@entry=0x1455c8000d90, options=...,
start=start@entry=0x77463330c948, counter=counter@entry=0x77463330c95c, flags=flags@entry=0,
prelocking_strategy=prelocking_strategy@entry=0x77463330ca90) at /test/mtest/10.3_dbg/sql/sql_lex.h:1585
#15 0x0000562c3e4ba279 in open_and_lock_tables (thd=thd@entry=0x1455c8000d90, options=..., tables=<optimized out>,
tables@entry=0x1455c80117b0, derived=derived@entry=false, flags=flags@entry=0,
prelocking_strategy=prelocking_strategy@entry=0x77463330ca90) at /test/mtest/10.3_dbg/sql/sql_base.cc:5109
#16 0x0000562c3e4ba418 in open_and_lock_tables (prelocking_strategy=0x77463330ca90, flags=0, derived=false,
tables=0x1455c80117b0, thd=0x1455c8000d90) at /test/mtest/10.3_dbg/sql/sql_base.h:263
#17 open_n_lock_single_table (thd=thd@entry=0x1455c8000d90, table_l=table_l@entry=0x1455c80117b0,
lock_type=lock_type@entry=TL_READ_NO_INSERT, flags=flags@entry=0,
prelocking_strategy=prelocking_strategy@entry=0x77463330ca90) at /test/mtest/10.3_dbg/sql/sql_base.cc:4952
#18 0x0000562c3e602e23 in open_n_lock_single_table (flags=0, lock_type=TL_READ_NO_INSERT, table_l=0x1455c80117b0,
thd=0x1455c8000d90) at /test/mtest/10.3_dbg/sql/sql_base.h:482
#19 mysql_create_or_drop_trigger (thd=thd@entry=0x1455c8000d90, tables=<optimized out>, create=create@entry=true)
at /test/mtest/10.3_dbg/sql/sql_trigger.cc:558
#20 0x0000562c3e539b22 in mysql_execute_command (thd=thd@entry=0x1455c8000d90)
at /test/mtest/10.3_dbg/sql/sql_parse.cc:5913
#21 0x0000562c3e53c17f in mysql_parse (thd=thd@entry=0x1455c8000d90,
rawbuf=rawbuf@entry=0x1455c8011628 "CREATE TRIGGER t3_trg_err_1 BEFORE INSERT ON t1 FOR EACH ROW SET @_var = NEW.col", length=length@entry=80, parser_state=parser_state@entry=0x77463330e530,
is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false)
at /test/mtest/10.3_dbg/sql/sql_parse.cc:7870
#22 0x0000562c3e53d83e in wsrep_mysql_parse (thd=thd@entry=0x1455c8000d90,
rawbuf=0x1455c8011628 "CREATE TRIGGER t3_trg_err_1 BEFORE INSERT ON t1 FOR EACH ROW SET @_var = NEW.col",
--Type <RET> for more, q to quit, c to continue without paging--
length=80, parser_state=parser_state@entry=0x77463330e530, is_com_multi=is_com_multi@entry=false,
is_next_command=is_next_command@entry=false) at /test/mtest/10.3_dbg/sql/sql_parse.cc:7662
#23 0x0000562c3e53e959 in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x1455c8000d90,
packet=packet@entry=0x1455c8008e41 "CREATE TRIGGER t3_trg_err_1 BEFORE INSERT ON t1 FOR EACH ROW SET @_var = NEW.col;", packet_length=packet_length@entry=81, is_com_multi=is_com_multi@entry=false,
is_next_command=is_next_command@entry=false) at /test/mtest/10.3_dbg/sql/sql_class.h:1152
#24 0x0000562c3e541ace in do_command (thd=0x1455c8000d90) at /test/mtest/10.3_dbg/sql/sql_parse.cc:1398
#25 0x0000562c3e669d54 in do_handle_one_connection (connect=connect@entry=0x562c42177370)
at /test/mtest/10.3_dbg/sql/sql_connect.cc:1403
#26 0x0000562c3e669f8b in handle_one_connection (arg=0x562c42177370)
at /test/mtest/10.3_dbg/sql/sql_connect.cc:1308
#27 0x0000562c403d1609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#28 0x00001fd34bada293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
From these Bug 2 seems to be easy to fix. However, bug 1 looks complicated.
jplindst both these crashes happen when resolving MDL conflicts. The PR is not changing MDL conflict resolving execution, other than passing the kill signal as new argument, and from the stack traces above, it seems that MDL conflict protesting high priority transaction is passing the correct KILL_QUERY signal.
The test carried out here, therefore, is probably surfacing new issue, which you would encounter also with 10.3 HEAD version.
As you seem to have rr traces there, please dig out the BF victim thread's status. Here is only shown the BF thread trace, which is only half of the picture.
Seppo Jaakola
added a comment - jplindst both these crashes happen when resolving MDL conflicts. The PR is not changing MDL conflict resolving execution, other than passing the kill signal as new argument, and from the stack traces above, it seems that MDL conflict protesting high priority transaction is passing the correct KILL_QUERY signal.
The test carried out here, therefore, is probably surfacing new issue, which you would encounter also with 10.3 HEAD version.
As you seem to have rr traces there, please dig out the BF victim thread's status. Here is only shown the BF thread trace, which is only half of the picture.
For bug2 I agree this could be repeatable with 10.3 HEAD and in there victim has following information
(rr) p thd->wsrep_query_state
$1 = QUERY_EXEC
(rr) p thd->wsrep_conflict_state
$2 = NO_CONFLICT
(rr) p thd->wsrep_exec_mode
$3 = LOCAL_STATE
(rr) p victim_trx->state
$4 = TRX_STATE_ACTIVE
(rr) p bf_trx
$5 = (const trx_t * const) 0x30e64a067468
(rr) p bf_trx->state
$6 = TRX_STATE_NOT_STARTED
(rr) p bf_thd
$7 = (THD * const) 0x1455c8000d90
(rr) p bf_thd->wsrep_query_state
$8 = QUERY_EXEC
(rr) p bf_thd->wsrep_conflict_state
$9 = NO_CONFLICT
(rr) p bf_thd->wsrep_exec_mode
$10 = TOTAL_ORDER
Jan Lindström (Inactive)
added a comment - For bug2 I agree this could be repeatable with 10.3 HEAD and in there victim has following information
(rr) p thd->wsrep_query_state
$1 = QUERY_EXEC
(rr) p thd->wsrep_conflict_state
$2 = NO_CONFLICT
(rr) p thd->wsrep_exec_mode
$3 = LOCAL_STATE
(rr) p victim_trx->state
$4 = TRX_STATE_ACTIVE
(rr) p bf_trx
$5 = (const trx_t * const) 0x30e64a067468
(rr) p bf_trx->state
$6 = TRX_STATE_NOT_STARTED
(rr) p bf_thd
$7 = (THD * const) 0x1455c8000d90
(rr) p bf_thd->wsrep_query_state
$8 = QUERY_EXEC
(rr) p bf_thd->wsrep_conflict_state
$9 = NO_CONFLICT
(rr) p bf_thd->wsrep_exec_mode
$10 = TOTAL_ORDER
For bug1 I do not agree as 10.3 HEAD does not contain wsrep_innobase_kill_one_trx function. In this case victim has following state:
(rr) p trx->id
$1 = 280
(rr) p trx->state
$2 = TRX_STATE_ACTIVE
(rr) p thd->wsrep_query_state
value has been optimized out
(rr) p trx->mysql_thd->wsrep_query_state
$3 = QUERY_EXEC
(rr) p trx->mysql_thd->wsrep_exec_mode
$4 = LOCAL_STATE
(rr) p trx->mysql_thd->wsrep_conflict_state
$5 = NO_CONFLICT
(rr) p trx->lock.wait_lock
$7 = (ib_lock_t *) 0x0
Jan Lindström (Inactive)
added a comment - For bug1 I do not agree as 10.3 HEAD does not contain wsrep_innobase_kill_one_trx function. In this case victim has following state:
(rr) p trx->id
$1 = 280
(rr) p trx->state
$2 = TRX_STATE_ACTIVE
(rr) p thd->wsrep_query_state
value has been optimized out
(rr) p trx->mysql_thd->wsrep_query_state
$3 = QUERY_EXEC
(rr) p trx->mysql_thd->wsrep_exec_mode
$4 = LOCAL_STATE
(rr) p trx->mysql_thd->wsrep_conflict_state
$5 = NO_CONFLICT
(rr) p trx->lock.wait_lock
$7 = (ib_lock_t *) 0x0
One of my nodes from My 3-node 10.3.28 cluster crashed in this way yesterday, same messages logged in the syslog:
Jun 6 21:23:45 ta-eng-gal1mdb5 mysqld[29212]: 2021-06-06 21:23:44 0 [ERROR] WSREP: invalid state ROLLED_BACK (FATAL)
Jun 6 21:23:45 ta-eng-gal1mdb5 mysqld[29212]: #011 at /home/buildbot/buildbot/build/galera/src/replicator_smm.cpp:abort_trx():735
Jun 6 21:23:45 ta-eng-gal1mdb5 mysqld[29212]: 2021-06-06 21:23:44 0 [ERROR] WSREP: cancel commit bad exit: 7 370256912
Jun 6 21:23:45 ta-eng-gal1mdb5 mysqld[29212]: 210606 21:23:44 [ERROR] mysqld got signal 6 ;
We have been running 10.3.28 for just over 2 weeks and this is the first time we have seen this crash, we upgraded from 10.3.9 (which has been very stable for the last 3 years).
Mark Anstice
added a comment - One of my nodes from My 3-node 10.3.28 cluster crashed in this way yesterday, same messages logged in the syslog:
Jun 6 21:23:45 ta-eng-gal1mdb5 mysqld [29212] : 2021-06-06 21:23:44 0 [ERROR] WSREP: invalid state ROLLED_BACK (FATAL)
Jun 6 21:23:45 ta-eng-gal1mdb5 mysqld [29212] : #011 at /home/buildbot/buildbot/build/galera/src/replicator_smm.cpp:abort_trx():735
Jun 6 21:23:45 ta-eng-gal1mdb5 mysqld [29212] : 2021-06-06 21:23:44 0 [ERROR] WSREP: cancel commit bad exit: 7 370256912
Jun 6 21:23:45 ta-eng-gal1mdb5 mysqld [29212] : 210606 21:23:44 [ERROR] mysqld got signal 6 ;
We have been running 10.3.28 for just over 2 weeks and this is the first time we have seen this crash, we upgraded from 10.3.9 (which has been very stable for the last 3 years).
Looks that the PR is already merged, and review done. However, assigned for review just to get the state of this issue correct in Jira
Seppo Jaakola
added a comment - Looks that the PR is already merged, and review done. However, assigned for review just to get the state of this issue correct in Jira
Hello,
Just to comment that this affects 10.3.29 (running on Debian 10) that we recently upgraded on 2 galera clusters. Nodes keep crashing at random times with the exact error message noted in this issue.
Vasilis G
added a comment - - edited Hello,
Just to comment that this affects 10.3.29 (running on Debian 10) that we recently upgraded on 2 galera clusters. Nodes keep crashing at random times with the exact error message noted in this issue.
I submitted some review comments for the 10.3 version (2 commits), identifying possible problems. I noticed that the 10.4 and 10.5 versions include some changes to row_upd_sec_index_entry() that are not mentioned in the commit message. It is not clear how they are related to these changes.
Given that most changes are outside InnoDB and this is replacing MDEV-23328 with another implementation, I think that the final review must be conducted by serg, who authored that change. Ideally, the test galera.galera_bf_kill_debug would not be removed.
Marko Mäkelä
added a comment - I submitted some review comments for the 10.3 version (2 commits), identifying possible problems. I noticed that the 10.4 and 10.5 versions include some changes to row_upd_sec_index_entry() that are not mentioned in the commit message. It is not clear how they are related to these changes.
Given that most changes are outside InnoDB and this is replacing MDEV-23328 with another implementation, I think that the final review must be conducted by serg , who authored that change. Ideally, the test galera.galera_bf_kill_debug would not be removed.
About 29 hours after updating a previously very stable 3 server MariaDB Galera cluster from 10.3.27 to 10.3.28, one of the nodes crashed with the following message:
2021-03-10 18:22:48 0 [ERROR] WSREP: invalid state ROLLED_BACK (FATAL)
at /home/buildbot/buildbot/build/galera/src/replicator_smm.cpp:abort_trx():735
2021-03-10 18:22:48 0 [ERROR] WSREP: cancel commit bad exit: 7 33792346039
210310 18:22:48 [ERROR] mysqld got signal 6 ;
Attached is a bit more log (coincidentally had been running with conflict logging enabled), but I don't think there's much more I can provide right now. I'm still logging this since before the update to 10.3.28 the cluster had been running very stable for months, and this could be somehow related to MDEV-25111, which we also encountered first time right after updating to 10.3.28.
About 29 hours after updating a previously very stable 3 server MariaDB Galera cluster from 10.3.27 to 10.3.28, one of the nodes crashed with the following message:
{noformat}
2021-03-10 18:22:48 0 [ERROR] WSREP: invalid state ROLLED_BACK (FATAL)
at /home/buildbot/buildbot/build/galera/src/replicator_smm.cpp:abort_trx():735
2021-03-10 18:22:48 0 [ERROR] WSREP: cancel commit bad exit: 7 33792346039
210310 18:22:48 [ERROR] mysqld got signal 6 ;
{noformat}
Attached is a bit more log (coincidentally had been running with conflict logging enabled), but I don't think there's much more I can provide right now. I'm still logging this since before the update to 10.3.28 the cluster had been running very stable for months, and this could be somehow related to MDEV-25111, which we also encountered first time right after updating to 10.3.28.
Shall I review the PR? At the moment the last comment is yours “Unfortunately, this is not correct.” and the status is “Changes requested”
Also, was it tested on 10.6 ? In 10.6 InnoDB mutexes are covered by server's deadlock detector. You might want to put this fix into 10.6 and run all galera tests on a debug build. To be sure it didn't introduce any deadlocks.
Sergei Golubchik
added a comment - Shall I review the PR? At the moment the last comment is yours “Unfortunately, this is not correct.” and the status is “Changes requested”
Also, was it tested on 10.6 ? In 10.6 InnoDB mutexes are covered by server's deadlock detector. You might want to put this fix into 10.6 and run all galera tests on a debug build. To be sure it didn't introduce any deadlocks.
serg Please review. I have applied review comments and other fixes on above commits. These fixes are tested by Ramesh using pquery. There is no need to apply this to 10.6 as this background killing is already removed there.
Jan Lindström (Inactive)
added a comment - serg Please review. I have applied review comments and other fixes on above commits. These fixes are tested by Ramesh using pquery. There is no need to apply this to 10.6 as this background killing is already removed there.
#8 0x00005647f24947dc in MutexDebug<TTASEventMutex<GenericPolicy> >::enter (this=0x7f92c800f060, mutex=0x7f92c800f050, name=0x5647f2b6dfc8 "/home/jan/mysql/10.2/storage/innobase/handler/ha_innodb.cc", line=5263) at /home/jan/mysql/10.2/storage/innobase/include/sync0policy.ic:64
#9 0x00005647f24930f4 in GenericPolicy<TTASEventMutex<GenericPolicy> >::enter (this=0x7f92c800f060, mutex=..., filename=0x5647f2b6dfc8 "/home/jan/mysql/10.2/storage/innobase/handler/ha_innodb.cc", line=5263) at /home/jan/mysql/10.2/storage/innobase/include/sync0policy.h:346
#10 0x00005647f249120f in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0x7f92c800f050, n_spins=30, n_delay=4, name=0x5647f2b6dfc8 "/home/jan/mysql/10.2/storage/innobase/handler/ha_innodb.cc", line=5263) at /home/jan/mysql/10.2/storage/innobase/include/ib0mutex.h:635
#11 0x00005647f24691cb in innobase_kill_query (thd=0x7f9264000d90) at /home/jan/mysql/10.2/storage/innobase/handler/ha_innodb.cc:5263
#12 0x00005647f2263830 in kill_handlerton (thd=0x7f9264000d90, plugin=0x7f92b403ac50, level=0x7f92c8afbe74) at /home/jan/mysql/10.2/sql/handler.cc:843
#13 0x00005647f1ff6bb3 in plugin_foreach_with_mask (thd=0x7f9264000d90, func=0x5647f22637aa <kill_handlerton(THD*, plugin_ref, void*)>, type=1, state_mask=8, arg=0x7f92c8afbe74) at /home/jan/mysql/10.2/sql/sql_plugin.cc:2411
#14 0x00005647f226388e in ha_kill_query (thd=0x7f9264000d90, level=THD_ABORT_SOFTLY) at /home/jan/mysql/10.2/sql/handler.cc:850
#15 0x00005647f1f904f2 in THD::awake (this=0x7f9264000d90, state_to_set=KILL_QUERY) at /home/jan/mysql/10.2/sql/sql_class.cc:1733
#16 0x00005647f21c34b6 in wsrep_thd_awake (thd=0x7f9264000d90, signal=1 '\001') at /home/jan/mysql/10.2/sql/wsrep_mysqld.cc:2764
#17 0x00005647f248534b in wsrep_innobase_kill_one_trx (bf_thd=0x7f92b4001fe0, bf_trx=0x7f92abc00120, victim_trx=0x7f92abc01228, signal=1) at /home/jan/mysql/10.2/storage/innobase/handler/ha_innodb.cc:19722
#18 0x00005647f2485795 in wsrep_abort_transaction (hton=0x5647f479e730, bf_thd=0x7f92b4001fe0, victim_thd=0x7f9264000d90, signal=1 '\001') at /home/jan/mysql/10.2/storage/innobase/handler/ha_innodb.cc:19797
#19 0x00005647f22716fb in ha_abort_transaction (bf_thd=0x7f92b4001fe0, victim_thd=0x7f9264000d90, signal=1 '\001') at /home/jan/mysql/10.2/sql/handler.cc:6326
#20 0x00005647f21d2ddc in wsrep_abort_thd (bf_thd_ptr=0x7f92b4001fe0, victim_thd_ptr=0x7f9264000d90, signal=1 '\001') at /home/jan/mysql/10.2/sql/wsrep_thd.cc:837
#21 0x00005647f21c1679 in wsrep_grant_mdl_exception (requestor_ctx=0x7f92b4002100, ticket=0x7f926418abc0, key=0x7f9264158f80) at /home/jan/mysql/10.2/sql/wsrep_mysqld.cc:2057
#22 0x00005647f213d493 in MDL_lock::can_grant_lock (this=0x7f9264158f80, type_arg=MDL_SHARED_WRITE, requestor_ctx=0x7f92b4002100, ignore_lock_priority=false) at /home/jan/mysql/10.2/sql/mdl.cc:1592
#23 0x00005647f213dd34 in MDL_context::try_acquire_lock_impl (this=0x7f92b4002100, mdl_request=0x7f92b48d50c8, out_ticket=0x7f92c8afc308) at /home/jan/mysql/10.2/sql/mdl.cc:1923
#24 0x00005647f213e0c0 in MDL_context::acquire_lock (this=0x7f92b4002100, mdl_request=0x7f92b48d50c8, lock_wait_timeout=86400) at /home/jan/mysql/10.2/sql/mdl.cc:2067
#25 0x00005647f1f664c5 in open_table_get_mdl_lock (thd=0x7f92b4001fe0, ot_ctx=0x7f92c8afc9d0, mdl_request=0x7f92b48d50c8, flags=0, mdl_ticket=0x7f92c8afc630) at /home/jan/mysql/10.2/sql/sql_base.cc:1461
#26 0x00005647f1f66e90 in open_table (thd=0x7f92b4001fe0, table_list=0x7f92b48d4c70, ot_ctx=0x7f92c8afc9d0) at /home/jan/mysql/10.2/sql/sql_base.cc:1760
#27 0x00005647f1f6a46c in open_and_process_table (thd=0x7f92b4001fe0, tables=0x7f92b48d4c70, counter=0x7f92c8afca64, flags=0, prelocking_strategy=0x7f92c8afcae8, has_prelocking_list=false, ot_ctx=0x7f92c8afc9d0) at /home/jan/mysql/10.2/sql/sql_base.cc:3589
#28 0x00005647f1f6b370 in open_tables (thd=0x7f92b4001fe0, options=..., start=0x7f92c8afca48, counter=0x7f92c8afca64, flags=0, prelocking_strategy=0x7f92c8afcae8) at /home/jan/mysql/10.2/sql/sql_base.cc:4062
#29 0x00005647f1f6c9c0 in open_and_lock_tables (thd=0x7f92b4001fe0, options=..., tables=0x7f92b48d4c70, derived=false, flags=0, prelocking_strategy=0x7f92c8afcae8) at /home/jan/mysql/10.2/sql/sql_base.cc:4861
#30 0x00005647f1f3251b in open_and_lock_tables (thd=0x7f92b4001fe0, tables=0x7f92b48d4c70, derived=false, flags=0) at /home/jan/mysql/10.2/sql/sql_base.h:508
#31 0x00005647f239a995 in Rows_log_event::do_apply_event (this=0x7f92b4028840, rgi=0x7f92b4021740) at /home/jan/mysql/10.2/sql/log_event.cc:10854
#32 0x00005647f1f22f47 in Log_event::apply_event (this=0x7f92b4028840, rgi=0x7f92b4021740) at /home/jan/mysql/10.2/sql/log_event.h:1462
#33 0x00005647f21cf5bf in wsrep_apply_events (thd=0x7f92b4001fe0, events_buf=0x7f92cbd57150, buf_len=493589) at /home/jan/mysql/10.2/sql/wsrep_applier.cc:164
#34 0x00005647f21cfb8b in wsrep_apply_cb (ctx=0x7f92b4001fe0, buf=0x7f92cbd57150, buf_len=502024, flags=1, meta=0x7f92c8afd350) at /home/jan/mysql/10.2/sql/wsrep_applier.cc:267
#35 0x00007f92d3cc99cc in galera::TrxHandle::apply (this=0x7f92b403e060, recv_ctx=0x7f92b4001fe0, apply_cb=0x5647f21cf87f <wsrep_apply_cb(void*, void const*, size_t, uint32_t, wsrep_trx_meta_t const*)>, meta=...) at /home/jan/mysql/mariadb-galeralib/galera/src/trx_handle.cpp:316
#37 0x00007f92d3cd3b15 in galera::ReplicatorSMM::apply_trx (this=0x5647f4736f50, recv_ctx=0x7f92b4001fe0, trx=0x7f92b403e060) at /home/jan/mysql/mariadb-galeralib/galera/src/replicator_smm.cpp:455
#38 0x00007f92d3cd82a6 in galera::ReplicatorSMM::process_trx (this=0x5647f4736f50, recv_ctx=0x7f92b4001fe0, trx=0x7f92b403e060) at /home/jan/mysql/mariadb-galeralib/galera/src/replicator_smm.cpp:1259
#39 0x00007f92d3d13781 in galera::GcsActionSource::dispatch (this=0x5647f47376d8, recv_ctx=0x7f92b4001fe0, act=..., exit_loop=@0x7f92c8afda7b: false) at /home/jan/mysql/mariadb-galeralib/galera/src/gcs_action_source.cpp:115
#40 0x00007f92d3d13fb9 in galera::GcsActionSource::process (this=0x5647f47376d8, recv_ctx=0x7f92b4001fe0, exit_loop=@0x7f92c8afda7b: false) at /home/jan/mysql/mariadb-galeralib/galera/src/gcs_action_source.cpp:180
#41 0x00007f92d3cd3141 in galera::ReplicatorSMM::async_recv (this=0x5647f4736f50, recv_ctx=0x7f92b4001fe0) at /home/jan/mysql/mariadb-galeralib/galera/src/replicator_smm.cpp:363
#42 0x00007f92d3ca86b3 in galera_recv (gh=0x5647f47040f0, recv_ctx=0x7f92b4001fe0) at /home/jan/mysql/mariadb-galeralib/galera/src/wsrep_provider.cpp:244
#43 0x00005647f21d1df8 in wsrep_replication_process (thd=0x7f92b4001fe0) at /home/jan/mysql/10.2/sql/wsrep_thd.cc:486
#44 0x00005647f21c1ba4 in start_wsrep_THD (arg=0x5647f476aaf0) at /home/jan/mysql/10.2/sql/wsrep_mysqld.cc:2169
#45 0x00005647f295c085 in pfs_spawn_thread (arg=0x5647f47595f0) at /home/jan/mysql/10.2/storage/perfschema/pfs.cc:1869
#46 0x00007f92d91de450 in start_thread (arg=0x7f92c8afe640) at pthread_create.c:473
#47 0x00007f92d8d7ed53 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Problem found with pquery:
(rr) where
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x000073fe62f61859 in __GI_abort () at abort.c:79
#2 0x0000562f2a5ffabe in safe_mutex_lock (mp=mp@entry=0x146fb4002c28, my_flags=my_flags@entry=0,
file=file@entry=0x562f2a6d7ec0 "/test/mtest/10.2_dbg/sql/wsrep_mysqld.cc", line=line@entry=2689) at /test/mtest/10.2_dbg/mysys/thr_mutex.c:264
#3 0x0000562f29e70f47 in inline_mysql_mutex_lock (src_line=2689, src_file=0x562f2a6d7ec0 "/test/mtest/10.2_dbg/sql/wsrep_mysqld.cc", that=0x146fb4002c28)
at /test/mtest/10.2_dbg/include/mysql/psi/mysql_thread.h:688
#4 wsrep_thd_LOCK (thd=thd@entry=0x146fb4001230) at /test/mtest/10.2_dbg/sql/wsrep_mysqld.cc:2689
#5 0x0000562f2a0e351b in wsrep_innobase_kill_one_trx (bf_thd=bf_thd@entry=0x146fb8001260, bf_trx=bf_trx@entry=0x1d1f69a22120,
victim_trx=victim_trx@entry=0x1d1f69a3cae8, signal=signal@entry=0) at /test/mtest/10.2_dbg/storage/innobase/handler/ha_innodb.cc:19607
Jan Lindström (Inactive)
added a comment - - edited User KILL as a TOI
https://github.com/MariaDB/server/pull/1894
10.2
http://buildbot.askmonty.org/buildbot/grid?category=main&branch=bb-10.2-galera
https://github.com/MariaDB/server/commit/5236b6995560b1f4d04c6e6d7c9188785b07bdb1
10.3
http://buildbot.askmonty.org/buildbot/grid?category=main&branch=bb-10.3-MDEV-25114
https://github.com/MariaDB/server/commit/568b7b64cd427321dc3513b6b8d1db88f7e3d59f
10.4 (NOT CORRECT)
https://github.com/MariaDB/server/commit/cd999b6f790531533bb92366fba976093aecc518
#8 0x000055bd66eb265a in MutexDebug<TTASEventMutex<GenericPolicy> >::enter (this=0x55bd680a09d0 <lock_sys+80>, mutex=..., filename=0x55bd677cfbb0 "/home/jan/mysql/10.4/storage/innobase/handler/ha_innodb.cc", line=5195) at /home/jan/mysql/10.4/storage/innobase/include/sync0policy.h:121
#9 0x000055bd66eb0599 in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0x55bd680a09c0 <lock_sys+64>, n_spins=30, n_delay=4, name=0x55bd677cfbb0 "/home/jan/mysql/10.4/storage/innobase/handler/ha_innodb.cc", line=5195) at /home/jan/mysql/10.4/storage/innobase/include/ib0mutex.h:592
#10 0x000055bd66e843dc in innobase_kill_query (thd=0x7f5fa0002120) at /home/jan/mysql/10.4/storage/innobase/handler/ha_innodb.cc:5195
#11 0x000055bd66c2b196 in kill_handlerton (thd=0x7f5fa0002120, plugin=0x7f5fb0034500, level=0x7f5fd43dd974) at /home/jan/mysql/10.4/sql/handler.cc:865
#12 0x000055bd668d4eff in plugin_foreach_with_mask (thd=0x7f5fa0002120, func=0x55bd66c2b056 <kill_handlerton(THD*, plugin_ref, void*)>, type=1, state_mask=8, arg=0x7f5fd43dd974) at /home/jan/mysql/10.4/sql/sql_plugin.cc:2474
#13 0x000055bd66c2b1f8 in ha_kill_query (thd=0x7f5fa0002120, level=THD_ABORT_SOFTLY) at /home/jan/mysql/10.4/sql/handler.cc:872
#14 0x000055bd6683e273 in THD::awake_no_mutex (this=0x7f5fa0002120, state_to_set=KILL_QUERY) at /home/jan/mysql/10.4/sql/sql_class.cc:1899
#15 0x000055bd66b4dcc9 in wsrep_thd_bf_abort (bf_thd=0x7f5fb0000d90, victim_thd=0x7f5fa0002120, signal=1 '\001') at /home/jan/mysql/10.4/sql/service_wsrep.cc:252
#16 0x000055bd66ea1888 in wsrep_innobase_kill_one_trx (bf_thd=0x7f5fb0000d90, victim_trx=0x7f5fc4bd5218, signal=1 '\001') at /home/jan/mysql/10.4/storage/innobase/handler/ha_innodb.cc:18725
#17 0x000055bd66ea1c48 in wsrep_abort_transaction (bf_thd=0x7f5fb0000d90, victim_thd=0x7f5fa0002120, signal=1 '\001') at /home/jan/mysql/10.4/storage/innobase/handler/ha_innodb.cc:18773
#18 0x000055bd66c3df05 in ha_abort_transaction (bf_thd=0x7f5fb0000d90, victim_thd=0x7f5fa0002120, signal=1 '\001') at /home/jan/mysql/10.4/sql/handler.cc:7129
#19 0x000055bd66b30040 in wsrep_handle_mdl_conflict (requestor_ctx=0x7f5fb0000eb0, ticket=0x7f5fa0143f20, key=0x7f5fa00944d0) at /home/jan/mysql/10.4/sql/wsrep_mysqld.cc:2386
#20 0x000055bd66a510ed in MDL_lock::can_grant_lock (this=0x7f5fa00944d0, type_arg=MDL_SHARED_WRITE, requestor_ctx=0x7f5fb0000eb0, ignore_lock_priority=false) at /home/jan/mysql/10.4/sql/mdl.cc:1798
#21 0x000055bd66a51a5e in MDL_context::try_acquire_lock_impl (this=0x7f5fb0000eb0, mdl_request=0x7f5fb002c6b0, out_ticket=0x7f5fd43dddd8) at /home/jan/mysql/10.4/sql/mdl.cc:2129
#22 0x000055bd66a51e6e in MDL_context::acquire_lock (this=0x7f5fb0000eb0, mdl_request=0x7f5fb002c6b0, lock_wait_timeout=86400) at /home/jan/mysql/10.4/sql/mdl.cc:2276
#23 0x000055bd6680e7cb in open_table_get_mdl_lock (thd=0x7f5fb0000d90, ot_ctx=0x7f5fd43de4a0, mdl_request=0x7f5fb002c6b0, flags=0, mdl_ticket=0x7f5fd43de100) at /home/jan/mysql/10.4/sql/sql_base.cc:1647
#24 0x000055bd6680f15b in open_table (thd=0x7f5fb0000d90, table_list=0x7f5fb002c260, ot_ctx=0x7f5fd43de4a0) at /home/jan/mysql/10.4/sql/sql_base.cc:1915
#25 0x000055bd66813523 in open_and_process_table (thd=0x7f5fb0000d90, tables=0x7f5fb002c260, counter=0x7f5fd43de534, flags=0, prelocking_strategy=0x7f5fd43de5b8, has_prelocking_list=false, ot_ctx=0x7f5fd43de4a0) at /home/jan/mysql/10.4/sql/sql_base.cc:3890
#26 0x000055bd668147f7 in open_tables (thd=0x7f5fb0000d90, options=..., start=0x7f5fd43de518, counter=0x7f5fd43de534, flags=0, prelocking_strategy=0x7f5fd43de5b8) at /home/jan/mysql/10.4/sql/sql_base.cc:4371
#27 0x000055bd6681676d in open_and_lock_tables (thd=0x7f5fb0000d90, options=..., tables=0x7f5fb002c260, derived=false, flags=0, prelocking_strategy=0x7f5fd43de5b8) at /home/jan/mysql/10.4/sql/sql_base.cc:5317
#28 0x000055bd667ce32f in open_and_lock_tables (thd=0x7f5fb0000d90, tables=0x7f5fb002c260, derived=false, flags=0) at /home/jan/mysql/10.4/sql/sql_base.h:503
#29 0x000055bd66d96de7 in Rows_log_event::do_apply_event (this=0x7f5fb002cb90, rgi=0x7f5fb0022700) at /home/jan/mysql/10.4/sql/log_event.cc:11409
#30 0x000055bd667b6b41 in Log_event::apply_event (this=0x7f5fb002cb90, rgi=0x7f5fb0022700) at /home/jan/mysql/10.4/sql/log_event.h:1492
#31 0x000055bd66b40a1b in wsrep_apply_events (thd=0x7f5fb0000d90, rli=0x7f5fb0016fd0, events_buf=0x7f5fc7fffbf8, buf_len=0) at /home/jan/mysql/10.4/sql/wsrep_applier.cc:200
#32 0x000055bd66b2183e in Wsrep_applier_service::apply_write_set (this=0x7f5fd43dfca0, ws_meta=..., data=...) at /home/jan/mysql/10.4/sql/wsrep_high_priority_service.cc:550
#33 0x000055bd675613b9 in apply_write_set (server_state=..., high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at /home/jan/mysql/10.4/wsrep-lib/src/server_state.cpp:327
#34 0x000055bd6756532c in wsrep::server_state::on_apply (this=0x55bd694aafe0, high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at /home/jan/mysql/10.4/wsrep-lib/src/server_state.cpp:1137
#35 0x000055bd675803e5 in wsrep::high_priority_service::apply (this=0x7f5fd43dfca0, ws_handle=..., ws_meta=..., data=...) at /home/jan/mysql/10.4/wsrep-lib/include/wsrep/high_priority_service.hpp:47
#36 0x000055bd6757d343 in (anonymous namespace)::apply_cb (ctx=0x7f5fd43dfca0, wsh=0x7f5fd43deec0, flags=65, buf=0x7f5fd43deed0, meta=0x7f5fd43df190, exit_loop=0x7f5fd43df14f) at /home/jan/mysql/10.4/wsrep-lib/src/wsrep_provider_v26.cpp:504
#38 0x00007f5fd6de6216 in galera::ReplicatorSMM::apply_trx (this=0x55bd694df1a0, recv_ctx=0x7f5fd43dfca0, ts=...) at /home/jan/mysql/mariadb-galeralib/galera/src/replicator_smm.cpp:504
#39 0x00007f5fd6de9fd0 in galera::ReplicatorSMM::process_trx (this=0x55bd694df1a0, recv_ctx=0x7f5fd43dfca0, ts_ptr=...) at /home/jan/mysql/mariadb-galeralib/galera/src/replicator_smm.cpp:2123
#40 0x00007f5fd6e1dd20 in galera::GcsActionSource::process_writeset (this=0x55bd69511270, recv_ctx=0x7f5fd43dfca0, act=..., exit_loop=@0x7f5fd43df87f: false) at /home/jan/mysql/mariadb-galeralib/galera/src/gcs_action_source.cpp:62
#41 0x00007f5fd6e1e9d2 in galera::GcsActionSource::process (this=0x55bd69511270, recv_ctx=0x7f5fd43dfca0, exit_loop=@0x7f5fd43df87f: false) at /home/jan/mysql/mariadb-galeralib/galera/src/gcs_action_source.cpp:183
#42 0x00007f5fd6dea4c0 in galera::ReplicatorSMM::async_recv (this=0x55bd694df1a0, recv_ctx=0x7f5fd43dfca0) at /home/jan/mysql/mariadb-galeralib/galera/src/replicator_smm.cpp:390
#43 0x00007f5fd6dc31f1 in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at /home/jan/mysql/mariadb-galeralib/galera/src/wsrep_provider.cpp:236
#44 0x000055bd6757e510 in wsrep::wsrep_provider_v26::run_applier (this=0x55bd6939c410, applier_ctx=0x7f5fd43dfca0) at /home/jan/mysql/10.4/wsrep-lib/src/wsrep_provider_v26.cpp:796
#45 0x000055bd66b41359 in wsrep_replication_process (thd=0x7f5fb0000d90, arg=0x55bd69517260) at /home/jan/mysql/10.4/sql/wsrep_thd.cc:57
#46 0x000055bd66b31d01 in start_wsrep_THD (arg=0x55bd69517260) at /home/jan/mysql/10.4/sql/wsrep_mysqld.cc:2915
#47 0x000055bd6746e4bf in pfs_spawn_thread (arg=0x55bd69501450) at /home/jan/mysql/10.4/storage/perfschema/pfs.cc:1869
#48 0x00007f5fdc3db450 in start_thread (arg=0x7f5fd43e0640) at pthread_create.c:473
#49 0x00007f5fdbf72d53 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
- saving '/home/jan/mysql/10.4/mysql-test/var/log/galera.MW-286-innodb/' to '/home/jan/mysql/10.4/mysql-test/var/log/galera.MW-286-innodb/'
Jan Lindström (Inactive)
added a comment - Problem in 10.4 as follows
branch : bb-10.4- MDEV-25114
test case : galera.MW-286
stack trace:
Thread 1 (Thread 0x7f5fd43e0640 (LWP 924119)):
#0 __pthread_kill (threadid=<optimized out>, signo=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:54
#1 0x000055bd674e2f2e in my_write_core (sig=6) at /home/jan/mysql/10.4/mysys/stacktrace.c:386
#2 0x000055bd66c26a9d in handle_fatal_signal (sig=6) at /home/jan/mysql/10.4/sql/signal_handler.cc:356
#3 <signal handler called>
#4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#5 0x00007f5fdbe81864 in __GI_abort () at abort.c:79
#6 0x00007f5fdbe81749 in __assert_fail_base (fmt=0x7f5fdc00af78 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55bd677df9ee "!is_owned()", file=0x55bd677dfa00 "/home/jan/mysql/10.4/storage/innobase/include/sync0policy.h", line=121, function=<optimized out>) at assert.c:92
#7 0x00007f5fdbe933d6 in __GI___assert_fail (assertion=0x55bd677df9ee "!is_owned()", file=0x55bd677dfa00 "/home/jan/mysql/10.4/storage/innobase/include/sync0policy.h", line=121, function=0x55bd677dfa40 "void MutexDebug<Mutex>::enter(const Mutex&, const char*, unsigned int) [with Mutex = TTASEventMutex<GenericPolicy>]") at assert.c:101
#8 0x000055bd66eb265a in MutexDebug<TTASEventMutex<GenericPolicy> >::enter (this=0x55bd680a09d0 <lock_sys+80>, mutex=..., filename=0x55bd677cfbb0 "/home/jan/mysql/10.4/storage/innobase/handler/ha_innodb.cc", line=5195) at /home/jan/mysql/10.4/storage/innobase/include/sync0policy.h:121
#9 0x000055bd66eb0599 in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0x55bd680a09c0 <lock_sys+64>, n_spins=30, n_delay=4, name=0x55bd677cfbb0 "/home/jan/mysql/10.4/storage/innobase/handler/ha_innodb.cc", line=5195) at /home/jan/mysql/10.4/storage/innobase/include/ib0mutex.h:592
#10 0x000055bd66e843dc in innobase_kill_query (thd=0x7f5fa0002120) at /home/jan/mysql/10.4/storage/innobase/handler/ha_innodb.cc:5195
#11 0x000055bd66c2b196 in kill_handlerton (thd=0x7f5fa0002120, plugin=0x7f5fb0034500, level=0x7f5fd43dd974) at /home/jan/mysql/10.4/sql/handler.cc:865
#12 0x000055bd668d4eff in plugin_foreach_with_mask (thd=0x7f5fa0002120, func=0x55bd66c2b056 <kill_handlerton(THD*, plugin_ref, void*)>, type=1, state_mask=8, arg=0x7f5fd43dd974) at /home/jan/mysql/10.4/sql/sql_plugin.cc:2474
#13 0x000055bd66c2b1f8 in ha_kill_query (thd=0x7f5fa0002120, level=THD_ABORT_SOFTLY) at /home/jan/mysql/10.4/sql/handler.cc:872
#14 0x000055bd6683e273 in THD::awake_no_mutex (this=0x7f5fa0002120, state_to_set=KILL_QUERY) at /home/jan/mysql/10.4/sql/sql_class.cc:1899
#15 0x000055bd66b4dcc9 in wsrep_thd_bf_abort (bf_thd=0x7f5fb0000d90, victim_thd=0x7f5fa0002120, signal=1 '\001') at /home/jan/mysql/10.4/sql/service_wsrep.cc:252
#16 0x000055bd66ea1888 in wsrep_innobase_kill_one_trx (bf_thd=0x7f5fb0000d90, victim_trx=0x7f5fc4bd5218, signal=1 '\001') at /home/jan/mysql/10.4/storage/innobase/handler/ha_innodb.cc:18725
#17 0x000055bd66ea1c48 in wsrep_abort_transaction (bf_thd=0x7f5fb0000d90, victim_thd=0x7f5fa0002120, signal=1 '\001') at /home/jan/mysql/10.4/storage/innobase/handler/ha_innodb.cc:18773
#18 0x000055bd66c3df05 in ha_abort_transaction (bf_thd=0x7f5fb0000d90, victim_thd=0x7f5fa0002120, signal=1 '\001') at /home/jan/mysql/10.4/sql/handler.cc:7129
#19 0x000055bd66b30040 in wsrep_handle_mdl_conflict (requestor_ctx=0x7f5fb0000eb0, ticket=0x7f5fa0143f20, key=0x7f5fa00944d0) at /home/jan/mysql/10.4/sql/wsrep_mysqld.cc:2386
#20 0x000055bd66a510ed in MDL_lock::can_grant_lock (this=0x7f5fa00944d0, type_arg=MDL_SHARED_WRITE, requestor_ctx=0x7f5fb0000eb0, ignore_lock_priority=false) at /home/jan/mysql/10.4/sql/mdl.cc:1798
#21 0x000055bd66a51a5e in MDL_context::try_acquire_lock_impl (this=0x7f5fb0000eb0, mdl_request=0x7f5fb002c6b0, out_ticket=0x7f5fd43dddd8) at /home/jan/mysql/10.4/sql/mdl.cc:2129
#22 0x000055bd66a51e6e in MDL_context::acquire_lock (this=0x7f5fb0000eb0, mdl_request=0x7f5fb002c6b0, lock_wait_timeout=86400) at /home/jan/mysql/10.4/sql/mdl.cc:2276
#23 0x000055bd6680e7cb in open_table_get_mdl_lock (thd=0x7f5fb0000d90, ot_ctx=0x7f5fd43de4a0, mdl_request=0x7f5fb002c6b0, flags=0, mdl_ticket=0x7f5fd43de100) at /home/jan/mysql/10.4/sql/sql_base.cc:1647
#24 0x000055bd6680f15b in open_table (thd=0x7f5fb0000d90, table_list=0x7f5fb002c260, ot_ctx=0x7f5fd43de4a0) at /home/jan/mysql/10.4/sql/sql_base.cc:1915
#25 0x000055bd66813523 in open_and_process_table (thd=0x7f5fb0000d90, tables=0x7f5fb002c260, counter=0x7f5fd43de534, flags=0, prelocking_strategy=0x7f5fd43de5b8, has_prelocking_list=false, ot_ctx=0x7f5fd43de4a0) at /home/jan/mysql/10.4/sql/sql_base.cc:3890
#26 0x000055bd668147f7 in open_tables (thd=0x7f5fb0000d90, options=..., start=0x7f5fd43de518, counter=0x7f5fd43de534, flags=0, prelocking_strategy=0x7f5fd43de5b8) at /home/jan/mysql/10.4/sql/sql_base.cc:4371
#27 0x000055bd6681676d in open_and_lock_tables (thd=0x7f5fb0000d90, options=..., tables=0x7f5fb002c260, derived=false, flags=0, prelocking_strategy=0x7f5fd43de5b8) at /home/jan/mysql/10.4/sql/sql_base.cc:5317
#28 0x000055bd667ce32f in open_and_lock_tables (thd=0x7f5fb0000d90, tables=0x7f5fb002c260, derived=false, flags=0) at /home/jan/mysql/10.4/sql/sql_base.h:503
#29 0x000055bd66d96de7 in Rows_log_event::do_apply_event (this=0x7f5fb002cb90, rgi=0x7f5fb0022700) at /home/jan/mysql/10.4/sql/log_event.cc:11409
#30 0x000055bd667b6b41 in Log_event::apply_event (this=0x7f5fb002cb90, rgi=0x7f5fb0022700) at /home/jan/mysql/10.4/sql/log_event.h:1492
#31 0x000055bd66b40a1b in wsrep_apply_events (thd=0x7f5fb0000d90, rli=0x7f5fb0016fd0, events_buf=0x7f5fc7fffbf8, buf_len=0) at /home/jan/mysql/10.4/sql/wsrep_applier.cc:200
#32 0x000055bd66b2183e in Wsrep_applier_service::apply_write_set (this=0x7f5fd43dfca0, ws_meta=..., data=...) at /home/jan/mysql/10.4/sql/wsrep_high_priority_service.cc:550
#33 0x000055bd675613b9 in apply_write_set (server_state=..., high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at /home/jan/mysql/10.4/wsrep-lib/src/server_state.cpp:327
#34 0x000055bd6756532c in wsrep::server_state::on_apply (this=0x55bd694aafe0, high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at /home/jan/mysql/10.4/wsrep-lib/src/server_state.cpp:1137
#35 0x000055bd675803e5 in wsrep::high_priority_service::apply (this=0x7f5fd43dfca0, ws_handle=..., ws_meta=..., data=...) at /home/jan/mysql/10.4/wsrep-lib/include/wsrep/high_priority_service.hpp:47
#36 0x000055bd6757d343 in (anonymous namespace)::apply_cb (ctx=0x7f5fd43dfca0, wsh=0x7f5fd43deec0, flags=65, buf=0x7f5fd43deed0, meta=0x7f5fd43df190, exit_loop=0x7f5fd43df14f) at /home/jan/mysql/10.4/wsrep-lib/src/wsrep_provider_v26.cpp:504
#37 0x00007f5fd6dd6bf5 in galera::TrxHandleSlave::apply (this=this@entry=0x7f5fb0030700, recv_ctx=recv_ctx@entry=0x7f5fd43dfca0, apply_cb=0x55bd6757d110 <(anonymous namespace)::apply_cb(void*, wsrep_ws_handle_t const*, uint32_t, wsrep_buf_t const*, wsrep_trx_meta_t const*, wsrep_bool_t*)>, meta=..., exit_loop=exit_loop@entry=@0x7f5fd43df14f: false) at /home/jan/mysql/mariadb-galeralib/galera/src/trx_handle.cpp:391
#38 0x00007f5fd6de6216 in galera::ReplicatorSMM::apply_trx (this=0x55bd694df1a0, recv_ctx=0x7f5fd43dfca0, ts=...) at /home/jan/mysql/mariadb-galeralib/galera/src/replicator_smm.cpp:504
#39 0x00007f5fd6de9fd0 in galera::ReplicatorSMM::process_trx (this=0x55bd694df1a0, recv_ctx=0x7f5fd43dfca0, ts_ptr=...) at /home/jan/mysql/mariadb-galeralib/galera/src/replicator_smm.cpp:2123
#40 0x00007f5fd6e1dd20 in galera::GcsActionSource::process_writeset (this=0x55bd69511270, recv_ctx=0x7f5fd43dfca0, act=..., exit_loop=@0x7f5fd43df87f: false) at /home/jan/mysql/mariadb-galeralib/galera/src/gcs_action_source.cpp:62
#41 0x00007f5fd6e1e9d2 in galera::GcsActionSource::process (this=0x55bd69511270, recv_ctx=0x7f5fd43dfca0, exit_loop=@0x7f5fd43df87f: false) at /home/jan/mysql/mariadb-galeralib/galera/src/gcs_action_source.cpp:183
#42 0x00007f5fd6dea4c0 in galera::ReplicatorSMM::async_recv (this=0x55bd694df1a0, recv_ctx=0x7f5fd43dfca0) at /home/jan/mysql/mariadb-galeralib/galera/src/replicator_smm.cpp:390
#43 0x00007f5fd6dc31f1 in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at /home/jan/mysql/mariadb-galeralib/galera/src/wsrep_provider.cpp:236
#44 0x000055bd6757e510 in wsrep::wsrep_provider_v26::run_applier (this=0x55bd6939c410, applier_ctx=0x7f5fd43dfca0) at /home/jan/mysql/10.4/wsrep-lib/src/wsrep_provider_v26.cpp:796
#45 0x000055bd66b41359 in wsrep_replication_process (thd=0x7f5fb0000d90, arg=0x55bd69517260) at /home/jan/mysql/10.4/sql/wsrep_thd.cc:57
#46 0x000055bd66b31d01 in start_wsrep_THD (arg=0x55bd69517260) at /home/jan/mysql/10.4/sql/wsrep_mysqld.cc:2915
#47 0x000055bd6746e4bf in pfs_spawn_thread (arg=0x55bd69501450) at /home/jan/mysql/10.4/storage/perfschema/pfs.cc:1869
#48 0x00007f5fdc3db450 in start_thread (arg=0x7f5fd43e0640) at pthread_create.c:473
#49 0x00007f5fdbf72d53 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
- saving '/home/jan/mysql/10.4/mysql-test/var/log/galera.MW-286-innodb/' to '/home/jan/mysql/10.4/mysql-test/var/log/galera.MW-286-innodb/'
Jan Lindström (Inactive)
added a comment - - edited serg marko This is now ready for review.
Latest versions:
10.2: https://github.com/MariaDB/server/commit/8f075c6d3dde8a7eec3ce901a0699d1478e5c065
10.3: https://github.com/MariaDB/server/commit/1eb067c5a29cc7f7245e8e7056e687b74494476e
10.4: https://github.com/MariaDB/server/commit/f79227a61f80e6590d967c110b0f9cf6f3d8bffe
10.5: https://github.com/MariaDB/server/commit/d683b6ece75a07d8cfaf9301532184cde9e3a1f7
10.6: https://github.com/MariaDB/server/commit/14b125c8b54439360e2277e725c56af0ee52d029
Next step is to do pquery QA runs on these.
jplindst pquery run looks good. Did not hit any new crashes related to this issue.
pquery run test case:
multi-threaded
multi master
SQL file prepared with KILL CONNECTION/QUERY statements
Ramesh Sivaraman
added a comment - jplindst pquery run looks good. Did not hit any new crashes related to this issue.
pquery run test case:
multi-threaded
multi master
SQL file prepared with KILL CONNECTION/QUERY statements
I posted a few comments to the 10.2 version of this:
The claimed revert of MDEV-23328 includes some strange changes, including introducing a potential race condition to wsrep_abort_transaction().
The proposed fix appears to introduce a race condition in THD::notify_shared_lock(). I am not very familiar with this code, but I suspect that as soon as THD::LOCK_thd_data on the to-be-aborted session is released, that session object may be concurrently aborted by some other thread, and could be assigned to another transaction before wsrep_abort_thd() is invoked. I think that this (as well as the above mentioned change to wsrep_abort_transaction() needs to be tested with an extra sleep or yield added between the mutex release and the call.
If I look at the total across the 2 commits, I see that the function wsrep_innobase_kill_one_trx() has been moved somewhere else in the file, which makes it harder to read the actual changes. Please try to generate a minimal diff, so that I can review deeper.
I see that there is a 10.6 version as well, not affecting InnoDB, and seemingly not affecting anything outside Galera.
Marko Mäkelä
added a comment - I posted a few comments to the 10.2 version of this:
The claimed revert of MDEV-23328 includes some strange changes, including introducing a potential race condition to wsrep_abort_transaction() .
The proposed fix appears to introduce a race condition in THD::notify_shared_lock() . I am not very familiar with this code, but I suspect that as soon as THD::LOCK_thd_data on the to-be-aborted session is released, that session object may be concurrently aborted by some other thread, and could be assigned to another transaction before wsrep_abort_thd() is invoked. I think that this (as well as the above mentioned change to wsrep_abort_transaction() needs to be tested with an extra sleep or yield added between the mutex release and the call.
If I look at the total across the 2 commits, I see that the function wsrep_innobase_kill_one_trx() has been moved somewhere else in the file, which makes it harder to read the actual changes. Please try to generate a minimal diff, so that I can review deeper.
I see that there is a 10.6 version as well, not affecting InnoDB, and seemingly not affecting anything outside Galera.
The latest revision for 10.2 looks mostly OK to me. My only concern is a potential race condition in wsrep_close_client_connections(), but because I am not familiar with that code, I may be mistaken there.
Marko Mäkelä
added a comment - The latest revision for 10.2 looks mostly OK to me. My only concern is a potential race condition in wsrep_close_client_connections() , but because I am not familiar with that code, I may be mistaken there.
OMG, we are awaiting for this fix, thank you! After the September's 10.3.24 -> 10.3.31 upgrade, our three nodes dev cluster (10.3.31) has its node's mariadb service crashing, each node, every one or two hours. Wanted to revert, but we have to much data to restore. A festival of mariadb restarting nodes! Thanks again for your great efforts! Fingers crossed!
Steve Baroti
added a comment - OMG, we are awaiting for this fix, thank you! After the September's 10.3.24 -> 10.3.31 upgrade, our three nodes dev cluster (10.3.31) has its node's mariadb service crashing, each node, every one or two hours. Wanted to revert, but we have to much data to restore. A festival of mariadb restarting nodes! Thanks again for your great efforts! Fingers crossed!
This wasn't an isolated occurrence. Another server crashed a couple of hours later:
2021-03-10 21:01:10 0 [ERROR] WSREP: invalid state ROLLED_BACK (FATAL)
at /home/buildbot/buildbot/build/galera/src/replicator_smm.cpp:abort_trx():735
2021-03-10 21:01:10 0 [ERROR] WSREP: cancel commit bad exit: 7 34182537224
210310 21:01:10 [ERROR] mysqld got signal 6 ;