|
I believe that this could be proof that the attempted fix of MDEV-21910 in 10.4 and 10.5 does not really work. MDEV-23328 had already been filed for something similar.
|
|
#6 0x000055f2e557d6cc in ut_dbg_assertion_failed (expr=expr@entry=0x0, file=file@entry=0x55f2e5b18098 "/mariadb/10.4/storage/innobase/lock/lock0lock.cc", line=line@entry=687) at /mariadb/10.4/storage/innobase/ut/ut0dbg.cc:60
|
#7 0x000055f2e53a1ddd in wsrep_assert_no_bf_bf_wait (lock_rec1=lock_rec1@entry=0x0, lock_rec2=lock_rec2@entry=0x7fa0373ff678, trx1=trx1@entry=0x7fa0374035e0) at /mariadb/10.4/storage/innobase/lock/lock0lock.cc:687
|
#8 0x000055f2e53a1f64 in lock_rec_has_to_wait (for_locking=for_locking@entry=true, trx=trx@entry=0x7fa0374035e0, type_mode=type_mode@entry=1027, lock2=lock2@entry=0x7fa0373ff678, lock_is_on_supremum=lock_is_on_supremum@entry=false) at /mariadb/10.4/storage/innobase/lock/lock0lock.cc:799
|
#9 0x000055f2e53b433a in lock_rec_other_has_conflicting (mode=mode@entry=1027, block=block@entry=0x7fa036c2cf90, heap_no=heap_no@entry=3, trx=trx@entry=0x7fa0374035e0) at /mariadb/10.4/storage/innobase/lock/lock0lock.cc:1159
|
#10 0x000055f2e53b8e4c in lock_rec_lock (impl=impl@entry=false, mode=mode@entry=1027, block=block@entry=0x7fa036c2cf90, heap_no=heap_no@entry=3, index=index@entry=0x7f9fc4023f70, thr=thr@entry=0x7f9fc402bfc0) at /mariadb/10.4/storage/innobase/lock/lock0lock.cc:1932
|
#11 0x000055f2e53ba365 in lock_clust_rec_read_check_and_lock (flags=flags@entry=0, block=block@entry=0x7fa036c2cf90, rec=rec@entry=0x7fa03715c099 "\200", index=index@entry=0x7f9fc4023f70, offsets=offsets@entry=0x7fa036a195e0, mode=mode@entry=LOCK_X, gap_mode=1024, thr=0x7f9fc402bfc0) at /mariadb/10.4/storage/innobase/lock/lock0lock.cc:5808
|
#12 0x000055f2e54ebb70 in sel_set_rec_lock (pcur=pcur@entry=0x7f9fc402b7a0, rec=0x7fa03715c099 "\200", index=index@entry=0x7f9fc4023f70, offsets=0x7fa036a195e0, mode=3, type=1024, thr=0x7f9fc402bfc0, mtr=0x7fa036a199c0) at /mariadb/10.4/storage/innobase/row/row0sel.cc:1252
|
#13 0x000055f2e54f0427 in row_search_mvcc (buf=buf@entry=0x7f9fc402a3a8 "\375\002", mode=<optimized out>, mode@entry=PAGE_CUR_GE, prebuilt=0x7f9fc402b5d0, match_mode=match_mode@entry=1, direction=direction@entry=0) at /mariadb/10.4/storage/innobase/row/row0sel.cc:5090
|
#14 0x000055f2e531f37b in ha_innobase::index_read (this=this@entry=0x7f9fc4029640, buf=buf@entry=0x7f9fc402a3a8 "\375\002", key_ptr=key_ptr@entry=0x7f9fc402a688 "\002", key_len=<optimized out>, find_flag=find_flag@entry=HA_READ_KEY_EXACT) at /mariadb/10.4/storage/innobase/handler/ha_innodb.cc:9216
|
#15 0x000055f2e531f818 in ha_innobase::rnd_pos (this=0x7f9fc4029640, buf=0x7f9fc402a3a8 "\375\002", pos=0x7f9fc402a688 "\002") at /mariadb/10.4/storage/innobase/handler/ha_innodb.cc:9718
|
#16 0x000055f2e50a38b8 in handler::ha_rnd_pos (this=this@entry=0x7f9fc4029640, buf=buf@entry=0x7f9fc402a3a8 "\375\002", pos=0x7f9fc402a688 "\002") at /mariadb/10.4/sql/handler.cc:2894
|
#17 0x000055f2e50b3319 in handler::rnd_pos_by_record (this=0x7f9fc4029640, record=0x7f9fc402a3a8 "\375\002") at /mariadb/10.4/sql/handler.h:3762
|
#18 0x000055f2e5242ffb in handler::ha_rnd_pos_by_record (buf=<optimized out>, this=0x7f9fc4029640) at /mariadb/10.4/sql/sql_class.h:6597
|
#19 Rows_log_event::find_row (this=this@entry=0x7f9fb801adb8, rgi=rgi@entry=0x7f9fb803f7f0) at /mariadb/10.4/sql/log_event.cc:14125
|
#20 0x000055f2e52440a8 in Update_rows_log_event::do_exec_row (this=0x7f9fb801adb8, rgi=0x7f9fb803f7f0) at /mariadb/10.4/sql/log_event.cc:14672
|
#21 0x000055f2e523258e in Rows_log_event::do_apply_event (this=0x7f9fb801adb8, rgi=0x7f9fb803f7f0) at /mariadb/10.4/sql/log_event.cc:11635
|
#22 0x000055f2e4fe9fa5 in Log_event::apply_event (rgi=0x7f9fb803f7f0, this=0x7f9fb801adb8) at /mariadb/10.4/sql/log_event.h:1484
|
#23 wsrep_apply_events (thd=thd@entry=0x7f9fb802d778, rli=<optimized out>, events_buf=<optimized out>, buf_len=<optimized out>) at /mariadb/10.4/sql/wsrep_applier.cc:200
|
#24 0x000055f2e4fd06b5 in Wsrep_replayer_service::apply_write_set (this=0x7fa036a1aed0, ws_meta=@0x7fa036a1a520: {gtid_ = {id_ = {data_ = {buf = "\364KQY\373\352\021\352\230\360\356\206\255\370Q\265"}}, seqno_ = {seqno_ = 9}}, stid_ = {server_id_ = {data_ = {buf = "\364K\030\t\373\352\021\352\234I\022J\364\326\277\301"}}, transaction_id_ = {id_ = 46}, client_id_ = {id_ = 17}}, depends_on_ = {seqno_ = 8}, flags_ = 3}, data=@0x7fa036a1a4c0: {ptr_ = 0x7fa03f600260, size_ = 203}) at /mariadb/10.4/wsrep-lib/include/wsrep/buffer.hpp:47
|
#25 0x000055f2e59162e3 in apply_write_set (server_state=Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer':
|
Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer':
|
@0x55f2e831cde0: {_vptr.server_state = 0x55f2e615d8c0 <vtable for Wsrep_server_state+16>, mutex_ = @0x55f2e831d090, cond_ = @0x55f2e831d0a0, server_service_ = @0x55f2e831d0b0, encryption_service_ = 0x0, state_ = wsrep::server_state::s_synced, state_hist_ = std::vector of length 9, capacity 16 = {wsrep::server_state::s_disconnected, wsrep::server_state::s_connected, wsrep::server_state::s_joiner, wsrep::server_state::s_initializing, wsrep::server_state::s_initialized, wsrep::server_state::s_joined, wsrep::server_state::s_synced, wsrep::server_state::s_donor, wsrep::server_state::s_joined}, state_waiters_ = std::vector of length 9, capacity 9 = {0, 0, 0, 0, 0, 0, 0, 0, 0}, bootstrap_ = false, initial_position_ = {id_ = {data_ = {buf = '\000' <repeats 15 times>}}, seqno_ = {seqno_ = -1}}, init_initialized_ = true, init_synced_ = true, sst_gtid_ = {id_ = {data_ = {buf = '\000' <repeats 15 times>}}, seqno_ = {seqno_ = -1}}, desync_count_ = 0, desynced_on_pause_ = false, pause_count_ = 0, pause_seqno_ = {seqno_ = -1}, streaming_clients_ = std::map with 0 elements, streaming_appliers_ = std::map with 0 elements, streaming_appliers_recovered_ = true, provider_ = 0x55f2e82456f0, name_ = "jyty", id_ = {data_ = {buf = "\364K\030\t\373\352\021\352\234I\022J\364\326\277\301"}}, incoming_address_ = "127.0.0.1:16640", address_ = "127.0.0.1", working_dir_ = "/dev/shm/10.4/mysql-test/var/33/mysqld.1/data/", encryption_key_ = std::vector of length 0, capacity 0, max_protocol_version_ = 4, rollback_mode_ = wsrep::server_state::rm_sync, connected_gtid_ = {id_ = {data_ = {buf = "\364KQY\373\352\021\352\230\360\356\206\255\370Q\265"}}, seqno_ = {seqno_ = 1}}, previous_primary_view_ = {state_id_ = {id_ = {data_ = {buf = "\364KQY\373\352\021\352\230\360\356\206\255\370Q\265"}}, seqno_ = {seqno_ = 1}}, view_seqno_ = {seqno_ = 1}, status_ = wsrep::view::primary, capabilities_ = 184703, own_index_ = 0, protocol_version_ = 4, members_ = std::vector of length 1, capacity 1 = {{id_ = {data_ = {buf = "\364K\030\t\373\352\021\352\234I\022J\364\326\277\301"}}, name_ = "jyty", incoming_ = "127.0.0.1:16640"}}}, current_view_ = {state_id_ = {id_ = {data_ = {buf = "\364KQY\373\352\021\352\230\360\356\206\255\370Q\265"}}, seqno_ = {seqno_ = 2}}, view_seqno_ = {seqno_ = 2}, status_ = wsrep::view::primary, capabilities_ = 184703, own_index_ = 0, protocol_version_ = 4, members_ = std::vector of length 2, capacity 2 = {{id_ = {data_ = {buf = "\364K\030\t\373\352\021\352\234I\022J\364\326\277\301"}}, name_ = "jyty", incoming_ = "127.0.0.1:16640"}, {id_ = {data_ = {buf = "\366\n\372#\373\352\021\352\203\335\022\004\312g0?"}}, name_ = "jyty", incoming_ = "127.0.0.1:16641"}}}, last_committed_gtid_ = {id_ = {data_ = {buf = '\000' <repeats 15 times>}}, seqno_ = {seqno_ = -1}}}, high_priority_service=@0x7fa036a1aed0: {_vptr.high_priority_service = 0x55f2e6076950 <vtable for Wsrep_replayer_service+16>, server_state_ = @0x55f2e831cde0, must_exit_ = false}, ws_handle=@0x7fa036a1a4d0: {transaction_id_ = {id_ = 46}, opaque_ = 0x7f9fb801ba00}, ws_meta=@0x7fa036a1a520: {gtid_ = {id_ = {data_ = {buf = "\364KQY\373\352\021\352\230\360\356\206\255\370Q\265"}}, seqno_ = {seqno_ = 9}}, stid_ = {server_id_ = {data_ = {buf = "\364K\030\t\373\352\021\352\234I\022J\364\326\277\301"}}, transaction_id_ = {id_ = 46}, client_id_ = {id_ = 17}}, depends_on_ = {seqno_ = 8}, flags_ = 3}, data=@0x7fa036a1a4c0: {ptr_ = 0x7fa03f600260, size_ = 203}) at /mariadb/10.4/wsrep-lib/src/server_state.cpp:327
|
#26 0x000055f2e5916de7 in wsrep::server_state::on_apply (this=<optimized out>, high_priority_service=@0x7fa036a1aed0: {_vptr.high_priority_service = 0x55f2e6076950 <vtable for Wsrep_replayer_service+16>, server_state_ = @0x55f2e831cde0, must_exit_ = false}, ws_handle=@0x7fa036a1a4d0: {transaction_id_ = {id_ = 46}, opaque_ = 0x7f9fb801ba00}, ws_meta=@0x7fa036a1a520: {gtid_ = {id_ = {data_ = {buf = "\364KQY\373\352\021\352\230\360\356\206\255\370Q\265"}}, seqno_ = {seqno_ = 9}}, stid_ = {server_id_ = {data_ = {buf = "\364K\030\t\373\352\021\352\234I\022J\364\326\277\301"}}, transaction_id_ = {id_ = 46}, client_id_ = {id_ = 17}}, depends_on_ = {seqno_ = 8}, flags_ = 3}, data=@0x7fa036a1a4c0: {ptr_ = 0x7fa03f600260, size_ = 203}) at /mariadb/10.4/wsrep-lib/src/server_state.cpp:1136
|
#27 0x000055f2e5927b85 in wsrep::high_priority_service::apply (data=@0x7fa036a1a4c0: {ptr_ = 0x7fa03f600260, size_ = 203}, ws_meta=@0x7fa036a1a520: {gtid_ = {id_ = {data_ = {buf = "\364KQY\373\352\021\352\230\360\356\206\255\370Q\265"}}, seqno_ = {seqno_ = 9}}, stid_ = {server_id_ = {data_ = {buf = "\364K\030\t\373\352\021\352\234I\022J\364\326\277\301"}}, transaction_id_ = {id_ = 46}, client_id_ = {id_ = 17}}, depends_on_ = {seqno_ = 8}, flags_ = 3}, ws_handle=@0x7fa036a1a4d0: {transaction_id_ = {id_ = 46}, opaque_ = 0x7f9fb801ba00}, this=0x7fa036a1aed0) at /mariadb/10.4/wsrep-lib/include/wsrep/high_priority_service.hpp:47
|
#28 (anonymous namespace)::apply_cb (ctx=0x7fa036a1aed0, wsh=<optimized out>, flags=<optimized out>, buf=<optimized out>, meta=<optimized out>, exit_loop=0x7fa036a1a9c0) at /mariadb/10.4/wsrep-lib/src/wsrep_provider_v26.cpp:502
|
#29 0x00007fa04434a8d1 in ?? () from /usr/lib/galera/libgalera_smm.so
|
#30 0x00007fa044395134 in ?? () from /usr/lib/galera/libgalera_smm.so
|
#31 0x00007fa0443af15e in ?? () from /usr/lib/galera/libgalera_smm.so
|
#32 0x000055f2e59289ed in wsrep::wsrep_provider_v26::replay (this=<optimized out>, ws_handle=<optimized out>, reply_service=<optimized out>) at /mariadb/10.4/wsrep-lib/src/wsrep_provider_v26.cpp:204
|
#33 0x000055f2e4fcf9d1 in Wsrep_client_service::replay (this=0x7f9fb8007fb0) at /mariadb/10.4/wsrep-lib/include/wsrep/transaction.hpp:219
|
#34 0x000055f2e59210a9 in wsrep::transaction::replay (this=this@entry=0x7f9fb8008030, lock=@0x7fa036a1b380: {mutex_ = @0x7f9fb8007f90, locked_ = false}) at /mariadb/10.4/wsrep-lib/src/transaction.cpp:1703
|
#35 0x000055f2e5923632 in wsrep::transaction::after_statement (this=this@entry=0x7f9fb8008030) at /mariadb/10.4/wsrep-lib/src/transaction.cpp:816
|
#36 0x000055f2e590d459 in wsrep::client_state::after_statement (this=this@entry=0x7f9fb8007fc8) at /mariadb/10.4/wsrep-lib/src/client_state.cpp:246
|
#37 0x000055f2e4dec561 in wsrep_after_statement (thd=0x7f9fb8001e28) at /mariadb/10.4/sql/sql_class.h:4913
|
#38 wsrep_mysql_parse (thd=thd@entry=0x7f9fb8001e28, rawbuf=0x7f9fb8012640 "commit", length=6, parser_state=parser_state@entry=0x7fa036a1b530, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /mariadb/10.4/sql/sql_parse.cc:7730
|
#39 0x000055f2e4dede27 in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x7f9fb8001e28, packet=packet@entry=0x7f9fb80092e9 "commit", packet_length=916567344, packet_length@entry=6, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /mariadb/10.4/sql/sql_class.h:1168
|
#40 0x000055f2e4df0710 in do_command (thd=0x7f9fb8001e28) at /mariadb/10.4/sql/sql_parse.cc:1352
|
#41 0x000055f2e4f16c04 in do_handle_one_connection (connect=connect@entry=0x55f2e87d8238) at /mariadb/10.4/sql/sql_connect.cc:1412
|
#42 0x000055f2e4f16d29 in handle_one_connection (arg=arg@entry=0x55f2e87d8238) at /mariadb/10.4/sql/sql_connect.cc:1316
|
#43 0x000055f2e58633a3 in pfs_spawn_thread (arg=0x55f2e88125c8) at /mariadb/10.4/storage/perfschema/pfs.cc:1869
|
#44 0x00007fa049b75ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
|
#45 0x00007fa04933feaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
|
|
|
seppo Test case might dig a hole for itself but it would be good to analyze this anyway. I could not repeat this crash, but it has been caught with
nice ./mtr --parallel=60 --big-test --force --suite=galera,galera_sr,galera_3nodes
|
. I just do not have access machine where
would not cause machine to slow down too much.
|
|
http://buildbot.askmonty.org/buildbot/builders/kvm-rpm-centos74-amd64-debug/builds/4938/steps/mtr-galera/logs/stdio
|
|
allen.lee@mariadb.com Yes, your assertion looks very similar.
|
|
Hi, my 3 node galera cluster just crashed in a similar way:
Nov 18 17:00:27 phobos-b mysqld[20103]: 2020-11-18 17:00:27 0x7f10301bf700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.4.17/storage/innobase/lock/lock0lock.cc line 687
Happened after a cron job did about 300 transactions massively deleting data and reimporting. On the node where it was connected, everything worked, but the replicas got corrupted by this error and then it lost quorum so the cluster went down.
Rebuilt from scratch, I can reproduce this everytime. Shuld I open a different issue or is the same problem as this? Thanks
|
|
@Gianni Your assertion sounds like the same problem we are currently facing. I created the issue MDEV-24229 for this assertion.
|
|
We also faced this issue (mariadb / galera, active/active clusters, 3 nodes), on mariadb 1:10.5.8+maria~stretch
We also found that version 1:10.3.27-0+deb10u1 is impacted (much less, but we had some crashes too), but 1:10.3.25-0+deb10u1 was NOT impacted.
We found this on the following sequence, always toward the same table
- one DELETE statement (possible multiple records)
- several INSERT statements
- All targeting a table with a PK (id) and a UNIQUE key (several fields)
- Crashing always relates to the UNIQUE key (random dates, random values)
All crashes refers to the same stuff. We made some tests to try workaround this on the 10.5.8 version
- Initially we were using requests with parameters : lot of crashes
- Then we moved to raw sql statement : less crashes, but still crashes
- Then we removed the DELETE statement and replaced the INSERT statement with a REPLACE statement (UNIQUE key based i assume) :
no more crashes until now => still crashes
- We also tested to target one node (instead of load balancing requests accross the 3 galera nodes) : got one crash in 24h (so less but still crashing)
Logs (extract, offuscated)
2020-12-04 9:05:15 9 [ERROR] InnoDB: Conflicting lock on table: `xxx`.`xxx` index: uniq_xxx that has lock
|
2020-12-04 09:05:15 0x7f5926a31700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.5.8/storage/innobase/lock/lock0lock.cc line 655
|
Server version: 10.5.8-MariaDB-1:10.5.8+maria~stretch-log
|
Connection ID (thread ID): 9
|
Status: NOT_KILLED
|
Stack
stack_bottom = 0x7f5926a30cb8 thread_stack 0x40000
|
/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x5591a9422cbe]
|
/usr/sbin/mariadbd(handle_fatal_signal+0x38f)[0x5591a8e48caf]
|
??:0(__restore_rt)[0x7f5961bf90e0]
|
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcf)[0x7f595fe0bfff]
|
linux/raise.c:51(__GI_raise)[0x7f595fe0d42a]
|
/usr/sbin/mariadbd(+0x60226e)[0x5591a8b3a26e]
|
??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x5591a91792bd]
|
??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x5591a9179f0f]
|
??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x5591a9180350]
|
??:0(std::_Rb_tree<unsigned int, unsigned int, std::_Identity<unsigned int>, std::less<unsigned int>, std::allocator<unsigned int> >::_M_erase(std::_Rb_tree_node<unsigned int>*))[0x5591a92a3e3c]
|
??:0(std::_Rb_tree<unsigned int, unsigned int, std::_Identity<unsigned int>, std::less<unsigned int>, std::allocator<unsigned int> >::_M_erase(std::_Rb_tree_node<unsigned int>*))[0x5591a92ada2a]
|
??:0(std::unique_lock<std::mutex>::unlock())[0x5591a91e730e]
|
??:0(std::unique_lock<std::mutex>::unlock())[0x5591a91ed62d]
|
??:0(std::unique_lock<std::mutex>::unlock())[0x5591a91edcb4]
|
??:0(std::unique_lock<std::mutex>::unlock())[0x5591a920129c]
|
??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x5591a913d815]
|
??:0(handler::ha_write_row(unsigned char const*))[0x5591a8e55185]
|
??:0(Rows_log_event::write_row(rpl_group_info*, bool))[0x5591a8f545a5]
|
??:0(Write_rows_log_event::do_exec_row(rpl_group_info*))[0x5591a8f5492d]
|
??:0(Rows_log_event::do_apply_event(rpl_group_info*))[0x5591a8f4990f]
|
??:0(wsrep_apply_events(THD*, Relay_log_info*, void const*, unsigned long))[0x5591a91066c1]
|
??:0(Wsrep_applier_service::apply_write_set(wsrep::ws_meta const&, wsrep::const_buffer const&, wsrep::mutable_buffer&))[0x5591a90eeae4]
|
??:0(wsrep::server_state::start_streaming_applier(wsrep::id const&, wsrep::transaction_id const&, wsrep::high_priority_service*))[0x5591a94af42a]
|
??:0(wsrep::server_state::on_apply(wsrep::high_priority_service&, wsrep::ws_handle const&, wsrep::ws_meta const&, wsrep::const_buffer const&))[0x5591a94affa8]
|
??:0(wsrep::wsrep_provider_v26::options[abi:cxx11]() const)[0x5591a94c40f2]
|
src/trx_handle.cpp:387(galera::TrxHandleSlave::apply(void*, wsrep_cb_status (*)(void*, wsrep_ws_handle const*, unsigned int, wsrep_buf const*, wsrep_trx_meta const*, bool*), wsrep_trx_meta const&, bool&))[0x7f595633dff3]
|
src/replicator_smm.cpp:504(galera::ReplicatorSMM::apply_trx(void*, galera::TrxHandleSlave&))[0x7f59563904d6]
|
src/replicator_smm.cpp:2145(galera::ReplicatorSMM::process_trx(void*, boost::shared_ptr<galera::TrxHandleSlave> const&))[0x7f5956398731]
|
src/gcs_action_source.cpp:62(galera::GcsActionSource::process_writeset(void*, gcs_action const&, bool&))[0x7f5956366484]
|
src/gcs_action_source.cpp:110(galera::GcsActionSource::dispatch(void*, gcs_action const&, bool&))[0x7f59563672df]
|
src/gcs_action_source.cpp:29(Release::~Release())[0x7f5956367839]
|
src/replicator_smm.cpp:390(galera::ReplicatorSMM::async_recv(void*))[0x7f59563930e3]
|
src/wsrep_provider.cpp:263(galera_recv)[0x7f59563b21db]
|
??:0(wsrep::wsrep_provider_v26::run_applier(wsrep::high_priority_service*))[0x5591a94c4cae]
|
??:0(wsrep_fire_rollbacker)[0x5591a9108713]
|
??:0(start_wsrep_THD(void*))[0x5591a90f871f]
|
??:0(MyCTX_nopad::finish(unsigned char*, unsigned int*))[0x5591a9082ce1]
|
nptl/pthread_create.c:456(start_thread)[0x7f5961bef4a4]
|
x86_64/clone.S:99(clone)[0x7f595fec1d0f]
|
May be this helps.
Regards
|
|
@Champax are you sure about your last point? For me, targeting a single node will crash the cluster anyway. It takes a little longer but eventually it will crash too. Also, my workload is very similar to what you have described.
|
|
Sorry, we still have crash in mono node (one in 24h, so much less but still present), i updated my comment
|
|
Another update today :
- The REPLACE INTO (raw sql mode) also cause conflicting locks... both on 10.5.8 and 10.3.27 (i updated my previous comment)
mysql-error RECORD LOCKS space id 1018 page no 21 n bits 240 index uniq_kpi of table `xxx`.`xxx` trx id 275445219 lock mode S
|
mysql-error Record lock, heap no 168 PHYSICAL RECORD: n_fields 6; compact format; info bits 0
|
mysql-error 0: len 5; hex 4630303031; asc XXXXX;;
|
mysql-error 1: len 3; hex 444159; asc DAY;;
|
mysql-error 2: len 3; hex 8fc988; asc ;;
|
mysql-error 3: len 25; hex 7469636b65742e67726f75702e636f756e742e6f70656e6564; asc xxx;;
|
mysql-error 4: len 1; hex 34; asc 4;;
|
mysql-error 5: len 30; hex 363333343936346135666261343134303933366531616634383835323733; asc 6334964a5fba4140936e1af4885273; (total 32 bytes);
|
mysql-error 2020-12-09 7:38:26 8 [ERROR] InnoDB: WSREP state:
|
mysql-error 2020-12-09 7:38:26 8 [ERROR] WSREP: Thread BF trx_id: 275445221 thread: 8 seqno: 121660287 client_state: exec client_mode: high priority transaction_mode: executing applier: 1 toi: 0 local: 0 query: REPLACE INTO xxxx (xxx, xxx, xxx, xxx, xxx, xxx, xxx, id) VALUES (TIMESTAMP'2020-12-09 03:46:07', 'xxx', 4.0, DATE'2020-12-08', 'DAY', 'xxx', '4', 'd5581785bc7f4295accf52f8508de9c5')H _
|
mysql-error 2020-12-09 7:38:26 8 [ERROR] WSREP: Thread BF trx_id: 275445219 thread: 11 seqno: 121660274 client_state: exec client_mode: high priority transaction_mode: ordered_commit applier: 1 toi: 0 local: 0 query: NULL
|
mysql-error 2020-12-09 07:38:26 0x7f85c41dd700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.5.8/storage/innobase/lock/lock0lock.cc line 655
|
|
|
Quick update:
we had major outage this night on the 10.5.8, with 2 members totally unable to rejoin galera cluster, looping on restart/crash related to this issue
and had also the same on the 10.3.27
We had to rollback binaries from 10.5.8, down to 10.3.25 (with full reinstall, cluster re-setup and databases restores...)
We will do the same for the 10.3.27, down to 10.3.25
We checked the debian change log between 10.3.25 and .27 by didnt find any clue on which fix may cause this (i suspect CVE-2020-14812 since it is locking related, but didnt find anything mariadb side about this)
mariadb-10.3 (1:10.3.27-0+deb10u1) buster; urgency=medium
|
|
* New upstream version 10.3.27. Includes fixes to serious regressions in
|
MariaDB 10.3.26 that corrupted data or made server unable to start.
|
* SECURITY UPDATE: Includes new upstream 10.3.26 which included fixes for
|
the following security vulnerabilities:
|
- CVE-2020-28912
|
- CVE-2020-14812
|
- CVE-2020-14789
|
- CVE-2020-14776
|
- CVE-2020-14765
|
* Upstream 10.3.26 included:
|
- Fix mytop shebang (Closes: #972780, Closes: #970681)
|
* Remove 3 patches applied upstream.
|
|
-- Otto Kekäläinen <otto@debian.org> Wed, 25 Nov 2020 11:50:19 +0200
|
|
mariadb-10.3 (1:10.3.25-0+deb10u1) buster-security; urgency=high
|
|
* SECURITY UPDATE: New upstream version 10.3.25. Includes fixes for the
|
following security vulnerabilities:
|
- CVE-2020-15180
|
* Salsa-CI: Update to be compatible with latest master/salsa-ci.yml
|
|
-- Otto Kekäläinen <otto@debian.org> Mon, 12 Oct 2020 20:46:30 +0300
|
From our end, this is major critical and we never saw that in our clusters since years, hope this will be correctly fixed in the next release.
Thanks by advance guys
|
|
Table schema (table name / fields renamed) on our end
CREATE TABLE `xxxxxxxxx` (
|
`id` char(32) NOT NULL,
|
`aaaaaaaaaaa` datetime NOT NULL DEFAULT current_timestamp(),
|
`bbbbbbbb` varchar(32) NOT NULL,
|
`ccccccccc` float NOT NULL,
|
`ddddddddddddddddd` date NOT NULL,
|
`eeeeeeeeeee` varchar(32) NOT NULL,
|
`ffffffffff` varchar(128) NOT NULL,
|
`ggggggggggggg` varchar(128) NOT NULL,
|
PRIMARY KEY (`id`),
|
UNIQUE KEY `uniq_kpi` (`bbbbbbbb`,`eeeeeeeeeee`,`ddddddddddddddddd`,`ffffffffff`,`ggggggggggggg`)
|
) ENGINE=InnoDB DEFAULT CHARSET=utf8
|
Record count are low : 1361 records
This request has cause conflicting lock 82 times yesterday (on a total of 424 conflicting lock accross 16 distinct requests involved)
82 REPLACE INTO xxxxxxxxx (aaaaaaaaaaa, bbbbbbbb, ccccccccc, ddddddddddddddddd, eeeeeeeeeee, ffffffffff, ggggggggggggg, id) VALUES (TIMESTAMP'2020-12-09 03:12:08', 'ZZZZZ', 2.0, DATE'2020-12-08', 'DAY', 'AAAAAAAAAAAAAAAAAAAAAAAAA', '1056', '2d06f9fe67014ef8b950dd731bcbe79a')
|
|
|
Hi! We have a manifestation of this issue on a new setup.
*Situation*
- 10.5 mariadb does (mysql) slaving from a relatively busy 10.3 setup
- after stopping the slaving, I can (galera) join a node to the started 10.5 cluster
- starting the (mysql) slaving again causes a partial crash on the joiner, and broken slaving on the donor (and other lockups)
*Logs*
joiner:
2020-12-17 17:01:32 2 [Note] WSREP: Server $JOINER connected to cluster at position ...
|
...
|
2020-12-17 17:01:33 2 [Note] WSREP: Requesting state transfer: success, donor: 1
|
...
|
2020-12-17 17:40:58 0 [Note] WSREP: Start replication
|
...
|
2020-12-17 17:41:01 2 [ERROR] WSREP: Application state transfer failed. This is unrecoverable condition, restart required.
|
2020-12-17 17:41:01 0 [Note] WSREP: Joiner monitor thread ended with total time 2 sec
|
...
|
2020-12-17 17:41:29 0 [Note] WSREP: Member 0.0 ($JOINER) requested state transfer from '$IP,'. Selected 1.0 ($DONOR)(SYNCED) as donor.
|
...
|
2020-12-17 18:07:08 2 [Note] WSREP: Synchronized with group, ready for connections
|
...
|
2020-12-17 18:08:16 0 [Note] InnoDB: Buffer pool(s) load completed at 201217 18:08:16
|
donor:
2020-12-17 17:07:01 14 [Note] Error reading relay log event: slave SQL thread was killed
|
2020-12-17 17:07:01 14 [Note] Slave SQL thread exiting, replication stopped in log 'mariadb-bin.001427' at position 440216782
|
2020-12-17 17:07:01 13 [Note] Slave I/O thread exiting, read up to log 'mariadb-bin.001427', position 440216782
|
...
|
2020-12-17 17:40:58 0 [Note] WSREP: SSL handshake successful, remote endpoint ...
|
...
|
2020-12-17 18:01:29 2 [Note] WSREP: Server $JOINER synced with group
|
2020-12-17 18:01:29 2 [Note] WSREP: Server status change joined -> synced
|
2020-12-17 18:01:29 2 [Note] WSREP: Synchronized with group, ready for connections
|
2020-12-17 18:01:29 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
|
2020-12-17 18:01:29 0 [Note] WSREP: Donor monitor thread ended with total time 1200 sec
|
2020-12-17 18:07:08 0 [Note] WSREP: async IST sender served
|
2020-12-17 18:07:08 0 [Note] WSREP: 0.0 ($JOINER): State transfer from 1.0 ($DONOR) complete.
|
2020-12-17 18:07:08 0 [Note] WSREP: Member 0.0 ($JOINER) synced with group.
|
All good so far.. or at least, appears to be good. Opening mysql connections to both nodes work fine.
But then, when starting slaving again:
donor:
2020-12-17 18:10:11 61 [Note] Slave I/O thread: Start asynchronous replication to master ...
|
2020-12-17 18:10:11 61 [Note] Slave I/O thread: connected to master ...
|
2020-12-17 18:10:11 62 [Note] Slave SQL thread initialized, starting replication in log ...
|
joiner:
2020-12-17 18:12:51 7 [ERROR] InnoDB: Conflicting lock on table: `$DB`.`$TABLE` index: $IDX that has lock
|
RECORD LOCKS space id 982458 page no 10 n bits 576 index $IFX of table `$DB`.`$TABLE` trx id 41747597472 lock mode S
|
Record lock, heap no 237 PHYSICAL RECORD: n_fields 3; compact format; info bits 32
|
0: len 4; hex 80e76b82; asc k ;;
|
1: len 4; hex 80000115; asc ;;
|
2: len 8; hex 000000000b55c220; asc U ;;
|
|
Record lock, heap no 238 PHYSICAL RECORD: n_fields 3; compact format; info bits 32
|
0: len 4; hex 80e76b82; asc k ;;
|
1: len 4; hex 80000287; asc ;;
|
2: len 8; hex 000000000b55c1c9; asc U ;;
|
|
2020-12-17 18:12:51 7 [ERROR] InnoDB: WSREP state:
|
2020-12-17 18:12:51 7 [ERROR] WSREP: Thread BF trx_id: 41747597473 thread: 7 seqno: 7577357 client_state: exec client_mode: high priority transaction_mode: executing applier: 1 toi: 0 local: 0 query: UPDATE `$OTHER_TABLE` SET `X` = 0 , `Y` = 0 , `Z` = 4 WHERE ( $IDX = 15166338 ) AND ( $OTHER_ID = 101 )<A4><8D><DB>_^S^A
|
2020-12-17 18:12:51 7 [ERROR] WSREP: Thread BF trx_id: 41747597472 thread: 10 seqno: 7577359 client_state: exec client_mode: high priority transaction_mode: committing applier: 1 toi: 0 local: 0 query: NULL
|
2020-12-17 18:12:51 0x7f1754c96700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.5.8/storage/innobase/lock/lock0lock.cc line 655
|
At this point:
- the joiner is unusable, no more connections can be initiated
- the donor is locked somehow/somewhere:
- show slave shows increasing SecondsBehindMaster (last command: "closing table")
- show engine innodb status locks up
*Notes*
No queries are done on the 10.5 nodes (other than the slaving and the galera replication, and possibly a monitoring ping).
I don't have a clue what the cause might be. I did a succesful mysqlrepair -a -A on the donor before the last join attempt. (In case the problem was a bad sync from 10.3.)
Without any (galera) joining nodes, the (single) donor slaves just fine from 10.3. The 10.3 data may have nothing to do with the problem at all. It could simply be the amount/type of traffic/queries.
I see six crap-characters after the query in the joiner log (a slight buffer overflow?):
2020-12-17 18:12:51 7 [ERROR] WSREP: Thread BF trx_id: 41747597473
|
thread: 7 seqno: 7577357 client_state:
|
exec client_mode: high priority transaction_mode:
|
executing applier: 1 toi: 0 local: 0 query:
|
UPDATE `$OTHER_TABLE` SET `X` = 0 , `Y` = 0 , `Z` = 4
|
WHERE ( $IDX = 15166338 ) AND
|
( $OTHER_ID = 101 )<A4><8D><DB>_^S^A
|
^- that <A4><8D><DB>_^S^A
I see those in the OPs log as well ({{ WHERE id > 0�Bs_ }}). They may very well be unrelated to the problem.
*Versions*
ii mariadb-server 1:10.5.8+maria~focal
|
|
mariadb Ver 15.1 Distrib 10.5.8-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2
|
|
|
Walter,
I've had to go back to 10.5.6 which has been stable for me in testing now for about a week. It was released around October 7th. The 10.5.8 release has some Galera issues due to some recent CVE remediation apparently.
|
|
Thanks for the feedback Larry. Good to know that there's a working 10.5.x.
|
|
PRs submitted for MDEV-24229 should fix this as well
|
|
Something was already pushed to 10.2 and merged up to 10.4 with a reference to this MDEV. In my opinion, it is not a good idea to use an assertion failure as a bug title, because there can be multiple independent reasons for an assertion failure.
The suggested subsequent change to 10.4 reminds me of MDEV-17073. What is the exact scenario of this? Would fixing MDEV-20605 help in this scenario? Could such INSERT be handled better in the SQL layer, as it was done in MDEV-17522 and MDEV-17614? Does this scenario affect 10.2 and 10.3 as well? If not, why not? I think that a separate ticket should be filed for the subsequent change, identifying the exact scenario and explaining why that change does not affect earlier versions.
I am not happy to see trx_t being expanded with one more Galera-related field. Could the field be added to ins_node_t instead? If we really need a new data member in trx_t, then a MEM_NOACCESS must be added to trx_t::free(), along with some debug assertion to check that the data member has been reset before the object is being released to the cache.
I would prefer to see the subsequent problem fixed at the SQL layer, because I do not think that anything changed with respect to locking in MariaDB’s InnoDB storage engine between 10.3 and 10.4. The only relevant difference would seem to be that Galera was upgraded from version 3 to 4.
|
|
In this case we have two write sets that have been already certified on other node and now applied on this node containing transaction replay with a scenario where two subsequent write sets in applying conflict with local transaction in commit phase. The conflict is "false positive" conflict on GAP lock in secondary unique index. The first applier will cause BF abort for the local committer, which starts replaying because of positive certification. In buggy version, scenario continues so that while the local transaction is replaying, the latter applier experiences similar UK GAP lock conflict and forces the re-player to abort second time. In fixed version, this latter BF abort should not happen.
Thus this case is not similar to MDEV-17073 or fixing MDEV-20605 would not help. This replay is already handled in the SQL layer but locking for replay is handled as normally inside InnoDB. All versions from 10.2 are effected i.e. both Galera 3 and Galera 4 are affected (there is minor differences on patch and test case).
In my opinion we can't add new Galera related field to ins_node_t as actual false positive GAP lock conflict is detected on lock_rec_has_to_wait() function that does not have access to it. I added MEM_NOACCESS and debug assertion. Updated and rebased versions.
|
|
Please address my comments regarding the InnoDB changes in the 10.2 version.
|
|
Review comments addressed on
|
|
I noticed that the 10.2 and 10.4 versions are somewhat out of sync with each other. This should be OK to push after addressing my comments for the 10.2 version.
|
|
10.4 : https://github.com/MariaDB/server/commit/ff49ea7b101dd15d153dfde0ca53cedc6462a1eb
|
|
10.3: https://github.com/MariaDB/server/commit/ca54dcdc3a911ba6f9aa0db77d8a34cc9ae42c80
|
|
is this fix included in the latest release of 10.4 ?
|
|
It will be available in upcoming 10.4.18...
|
|
We are currently using 10.4.17 in our production env and facing this issue very frequently. Can anyone please suggest on which release should we switch which provides us a stable 3 nodes HA env.
|
|
@usmanc9 From previous comments and our own tests it seems that the updates from 2020-11-04 in all ( ? ) minor versions seem to be affected by this. We were constantly getting the issue in 10.5.8, but downgraded to the October release of 10.5.6. Since then we have not seen the issue.
I would suggest to test 10.4.15 (I'm unable to confirm myself that it works fine in this version though).
|
|
Could someone confirm if this: MDEV-24892 is the same crash?
Are there any rough ideas when the fixed versions will get officially released? I know release estimates are hard.
Thanks for your time.
:edit: never mind, this is actually already released and already deployed in our environment (thanks to automatic CD), the question is still if these are the same crashes, but I will discuss that in the linked bug report.
|
|
@Usman What version are you currently using?
|
|
@Dianna We were using 10.4.17. Now we have downgraded to 10.4.15 (no issues so far)
|
|
Believe we've starting encountering this in 10.5.8. Have recently downgraded from 10.5.9 to 10.5.8 because of this bug MDEV-25368
If this persists then we'll likely downgrade further to 10.5.6, will update if that resolves it for us if it comes to that.
|
|
MDEV-25368 should really receive some priority as cluster hanging seems to be a common theme from 10.5.8 onwards.
|