[MDEV-23872] Crash in galera::TrxHandle::state() Created: 2020-10-02  Updated: 2021-04-26  Resolved: 2020-11-03

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.4.15, 10.5.5
Fix Version/s: 10.4.16, 10.5.7, 10.6.0

Type: Bug Priority: Critical
Reporter: Valerii Kravchuk Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File mariadb_ps.cpp    

 Description   

Server crashes regularly:

2020-09-30 19:01:16 1027 [Warning] Aborted connection 1027 to db: 'gleif_oais' user: 'oais.import' host: '10.138.101.50' (Got an error reading communication packets)
200930 19:10:24 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.5.5-MariaDB-log
key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=186
max_threads=65546
thread_count=271
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 344005449 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f6e44156ba8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fd58d909b18 thread_stack 0x49000
/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x55ac79a5bdee]
/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x55ac794e6ec5]
2020-09-30 19:10:28 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.6731S), skipping check
/lib64/libpthread.so.0(+0x12dd0)[0x7fd799a18dd0]
2020-09-30 19:10:31 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.82415S), skipping check
/usr/lib64/galera-4/libgalera_smm.so(+0x1c772b)[0x7fd795c8872b]
/usr/lib64/galera-4/libgalera_smm.so(+0x1de42e)[0x7fd795c9f42e]
2020-09-30 19:10:34 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.89535S), skipping check
/usr/sbin/mariadbd(_ZN5wsrep18wsrep_provider_v266replayERKNS_9ws_handleEPNS_21high_priority_serviceE+0x29)[0x55ac79af5df9]
/usr/sbin/mariadbd(_ZN20Wsrep_client_service6replayEv+0xf6)[0x55ac79767136]
/usr/sbin/mariadbd(_ZN5wsrep11transaction6replayERNS_11unique_lockINS_5mutexEEE+0x87)[0x55ac79af0ed7]
/usr/sbin/mariadbd(_ZN5wsrep11transaction15after_statementEv+0x1f0)[0x55ac79af2290]
/usr/sbin/mariadbd(_ZN5wsrep12client_state15after_statementEv+0xa3)[0x55ac79ad2c73]
/usr/sbin/mariadbd(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x1052)[0x55ac792f7547]
/usr/sbin/mariadbd(_Z10do_commandP3THD+0x110)[0x55ac792f924f]
/usr/sbin/mariadbd(_Z11tp_callbackP13TP_connection+0x68)[0x55ac79478678]
/usr/sbin/mariadbd(+0xac5da0)[0x55ac79663da0]
/usr/sbin/mariadbd(+0xb67d5a)[0x55ac79705d5a]
2020-09-30 19:10:36 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.89475S), skipping check
/lib64/libpthread.so.0(+0x82de)[0x7fd799a0e2de]
2020-09-30 19:10:39 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.91365S), skipping check
/lib64/libc.so.6(clone+0x43)[0x7fd79787be83]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0): 
Connection ID (thread ID): 1
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...

and in the full backtrace from the core we see:

Thread 1 (Thread 0x7ed4840a5700 (LWP 3200149)):
#0 0x00007f3b2c9fda65 in pthread_kill () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x000056049aec89a7 in my_write_core (sig=<optimized out>) at /usr/src/debug/MariaDB-10.5.5/src_0/mysys/stacktrace.c:519
No locals.
#2 0x000056049a953ee0 in handle_fatal_signal (sig=11) at /usr/src/debug/MariaDB-10.5.5/src_0/sql/signal_handler.cc:330
curr_time = 1601573573
tm = {tm_sec = 53, tm_min = 32, tm_hour = 19, tm_mday = 1, tm_mon = 9, tm_year = 120, tm_wday = 4, tm_yday = 274, tm_isdst = 1, tm_gmtoff = 7200, tm_zone = 0x56049c6642d0 "CEST"}
thd = 0x7ed3541b26f8
print_invalid_query_pointer = false
#3 <signal handler called>
No symbol table info available.
#4 galera::TrxHandle::state (this=<optimized out>) at galera/src/trx_handle.hpp:197
No locals.
#5 galera::ReplicatorSMM::replay_trx (this=0x56049c6c9df0, trx=..., lock=..., trx_ctx=0x7ed4840a3c10) at galera/src/replicator_smm.cpp:1054
tsp = {px = 0x0, pn = {pi_ = 0x0}}
ts = <optimized out>
__FUNCTION__ = "replay_trx"
retval = WSREP_OK
#6 0x00007f3b28c8742e in galera_replay_trx (gh=<optimized out>, trx_handle=<optimized out>, recv_ctx=0x7ed4840a3c10) at galera/src/wsrep_provider.cpp:311
lock = {trx_ = @0x7ed4dc218560, locked_ = true}
repl = 0x56049c6c9df0
trx = 0x7ed4dc218560
__FUNCTION__ = "galera_replay_trx"
retval = <optimized out>
#7 0x000056049af62df9 in wsrep::wsrep_provider_v26::replay (this=<optimized out>, ws_handle=..., reply_service=<optimized out>) at /usr/src/debug/MariaDB-10.5.5/src_0/wsrep-lib/src/wsrep_provider_v26.cpp:204
mwsh = {ws_handle_ = @0x7ed3541b8850, native_ = {trx_id = 18583, opaque = 0x7ed4dc218560}}
#8 0x000056049abd4136 in Wsrep_client_service::replay (this=<optimized out>) at /usr/src/debug/MariaDB-10.5.5/src_0/wsrep-lib/include/wsrep/transaction.hpp:219
replayer_service = {<Wsrep_high_priority_service> = {<wsrep::high_priority_service> = {_vptr.high_priority_service = 0x56049b870460 <vtable for Wsrep_replayer_service+16>, server_state_ = @0x56049c694bd0,
must_exit_ = false}, <wsrep::high_priority_context> = {_vptr.high_priority_context = 0x56049b870520 <vtable for Wsrep_replayer_service+208>, client_ = @0x7ed3541b8788, orig_mode_ = wsrep::client_state::m_local},
m_thd = 0x7ed3541b26f8, m_rli = 0x7ed3541b8b20, m_rgi = 0x7ed3541bb860, m_shadow = {option_bits = 2147748608, server_status = 2, vio = 0x0, tx_isolation = 1, db = 0x0, db_length = 0, user_time = {val = 0},
row_count_func = -1, wsrep_applier = true}}, m_orig_thd = 0x7ed354000d38, m_da_shadow = {status = Diagnostics_area::DA_OK_BULK, affected_rows = 0, last_insert_id = 0, message = '\000' <repeats 511 times>},
m_replay_status = wsrep::provider::success}
provider = <optimized out>
replayer_thd = 0x7ed3541b26f8
ret = <optimized out>
#9 0x000056049af5ded7 in wsrep::transaction::replay (this=0x7ed354006e30, lock=...) at /usr/src/debug/MariaDB-10.5.5/src_0/wsrep-lib/src/transaction.cpp:1703
ret = 0
was_streaming = false
replay_ret = <optimized out>
#10 0x000056049af5f290 in wsrep::transaction::after_statement (this=this@entry=0x7ed354006e30) at /usr/src/debug/MariaDB-10.5.5/src_0/wsrep-lib/src/transaction.cpp:816
ret = 0
lock = {mutex_ = @0x7ed354006d90, locked_ = false}
#11 0x000056049af3fc73 in wsrep::client_state::after_statement (this=0x7ed354006dc8) at /usr/src/debug/MariaDB-10.5.5/src_0/wsrep-lib/src/client_state.cpp:246
lock = {mutex_ = @0x7ed354006d90, locked_ = false}
#12 0x000056049a764547 in dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool) () at /usr/src/debug/MariaDB-10.5.5/src_0/wsrep-lib/include/wsrep/client_state.hpp:854
TIME_INVALID_DATES = {m_mode = date_conv_mode_t::INVALID_DATES}
std::__ioinit = {static _S_refcount = <optimized out>, static _S_synced_with_stdio = <optimized out>}
TIME_TIME_ONLY = {m_mode = date_conv_mode_t::TIME_ONLY}
TIME_FUZZY_DATES = {m_mode = date_conv_mode_t::FUZZY_DATES}
TIME_NO_ZEROS = <optimized out>
TIME_INTERVAL_hhmmssff = <optimized out>
TIME_INTERVAL_DAY = <optimized out>
TIME_MODE_FOR_XXX_TO_DATE = <optimized out>
TIME_FRAC_NONE = <optimized out>
TIME_FRAC_TRUNCATE = <optimized out>
TIME_FRAC_ROUND = <optimized out>
sp_data_access_name = <optimized out>
TIME_NO_ZERO_IN_DATE = <optimized out>
TIME_NO_ZERO_DATE = <optimized out>
TIME_CONV_NONE = <optimized out>
any_db = 0x56049af872d7 "*any*"
sql_command_flags = 0x56049b93b760 <sql_command_flags>
server_command_flags = 0x56049b93b360 <server_command_flags>
command_name = 0x56049b73dc40 <command_name>
#13 0x000056049a76624f in do_command (thd=thd@entry=0x7ed354000d38) at /usr/src/debug/MariaDB-10.5.5/src_0/sql/sql_parse.cc:1348
return_value = <optimized out>
packet = 0x7ed35444b24c "\372\002"
packet_length = <optimized out>
net = 0x7ed354000fa8
command = COM_STMT_BULK_EXECUTE
__FUNCTION__ = "do_command"
#14 0x000056049a8e5678 in threadpool_process_request (thd=0x7ed354000d38) at /usr/src/debug/MariaDB-10.5.5/src_0/sql/threadpool_common.cc:354
vio = <optimized out>
retval = <optimized out>
retval = <optimized out>
vio = <optimized out>
#15 tp_callback (c=0x560664638080) at /usr/src/debug/MariaDB-10.5.5/src_0/sql/threadpool_common.cc:194
worker_context = {psi_thread = 0x0, mysys_var = 0x7ed354000b60}
thd = 0x7ed354000d38
#16 0x000056049aad0da0 in worker_main (param=param@entry=0x56066457ccd8) at /usr/src/debug/MariaDB-10.5.5/src_0/sql/threadpool_generic.cc:1559
connection = <optimized out>
ts = {tv_sec = 1601573618, tv_nsec = 356159000}
this_thread = {event_count = 81, thread_group = 0x56066457ccd8, next_in_list = 0x0, prev_in_list = 0x56066457cd40, cond = {m_cond = {__data = {{__wseq = 155, __wseq32 = {__low = 155, __high = 0}}, {__g1_start = 153,
__g1_start32 = {__low = 153, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 4, __wrefs = 0, __g_signals = {0, 0}},
__size = "\233\000\000\000\000\000\000\000\231", '\000' <repeats 23 times>, "\004", '\000' <repeats 14 times>, __align = 155}, m_psi = 0x0}, woken = true}
thread_group = 0x56066457ccd8
last_thread = <optimized out>
#17 0x000056049ab72d5a in pfs_spawn_thread (arg=0x5606646380f8) at /usr/src/debug/MariaDB-10.5.5/src_0/storage/perfschema/pfs.cc:2201
typed_arg = 0x5606646380f8
user_arg = 0x56066457ccd8
user_start_routine = 0x56049aad0d10 <worker_main(void*)>
pfs = <optimized out>
klass = <optimized out>
#18 0x00007f3b2c9f62de in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#19 0x00007f3b2a863e83 in clone () from /lib64/libc.so.6
No symbol table info available.



 Comments   
Comment by Teemu Ollakka [ 2020-10-29 ]

I can reproduce this crash in the following way:

Possible workaround for this issue would be avoiding use of binary protocol, or if that is not possible, limit use of bulk execute to INSERT/REPLACE.

Comment by Teemu Ollakka [ 2020-10-30 ]

With debug build the server crashes with

mysqld: /home/teemu/work/git/mariadb-server/wsrep-lib/src/transaction.cpp:419: int wsrep::transaction::before_commit(): Assertion `state() == s_executing || state() == s_prepared || state() == s_committing || state() == s_must_abort || state() == s_replaying' failed.

Generated at Thu Feb 08 09:25:41 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.