[MDEV-29265] Unexpected server crash Created: 2022-08-07  Updated: 2022-08-31

Status: Open
Project: MariaDB Server
Component/s: Server
Affects Version/s: 10.6.7
Fix Version/s: None

Type: Bug Priority: Major
Reporter: COUNOTTE CEDRIC Assignee: Unassigned
Resolution: Unresolved Votes: 5
Labels: None
Environment:

Unbuntu 22.04


Attachments: File gdb_output.log    

 Description   

Within a 2-node galera cluster, one node has crashed. You can find both nodes logs below:

mariadbd: ./wsrep-lib/include/wsrep/client_state.hpp:668: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.
220807 12:59:06 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.6.7-MariaDB-2ubuntu1.1-log
key_buffer_size=268435456
read_buffer_size=131072
max_used_connections=50
max_threads=5002
thread_count=64
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 11276717 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f2280000c68
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f3bc00e2ca8 thread_stack 0x49000
/usr/sbin/mariadbd(my_print_stacktrace+0x32)[0x55a8e7586702]
/usr/sbin/mariadbd(handle_fatal_signal+0x478)[0x55a8e70c14d8]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f3be3990520]
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f3be39e4a7c]
??:0(__sigaction)[0x7f3be3990476]
??:0(abort)[0x7f3be39767f3]
/lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f3be397671b]
??:0(__assert_fail)[0x7f3be3987e96]
/usr/sbin/mariadbd(_Z14wsrep_bf_abortP3THDS0_+0x607)[0x55a8e736b4c7]
/usr/sbin/mariadbd(wsrep_thd_bf_abort+0x1d)[0x55a8e73715ad]
??:0(wsrep_bf_abort(THD*, THD*))[0x55a8e738e8a2]
??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x55a8e6d4c827]
??:0(Wsrep_server_service::log_dummy_write_set(wsrep::client_state&, wsrep::ws_meta const&))[0x55a8e6d4c979]
??:0(Wsrep_server_service::log_dummy_write_set(wsrep::client_state&, wsrep::ws_meta const&))[0x55a8e741d6c5]
??:0(void std::this_thread::sleep_for<long, std::ratio<1l, 1l> >(std::chrono::duration<long, std::ratio<1l, 1l> > const&))[0x55a8e741ea0e]
??:0(void std::this_thread::sleep_for<long, std::ratio<1l, 1l> >(std::chrono::duration<long, std::ratio<1l, 1l> > const&))[0x55a8e74202d1]
??:0(void std::this_thread::sleep_for<long, std::ratio<1l, 1l> >(std::chrono::duration<long, std::ratio<1l, 1l> > const&))[0x55a8e744f771]
??:0(void std::this_thread::sleep_for<long, std::ratio<1l, 1l> >(std::chrono::duration<long, std::ratio<1l, 1l> > const&))[0x55a8e745014f]
??:0(void std::this_thread::sleep_for<long, std::ratio<1l, 1l> >(std::chrono::duration<long, std::ratio<1l, 1l> > const&))[0x55a8e7430d30]
??:0(void std::this_thread::sleep_for<long, std::ratio<1l, 1l> >(std::chrono::duration<long, std::ratio<1l, 1l> > const&))[0x55a8e7392b74]
??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x55a8e70cedaa]
??:0(handler::ha_update_row(unsigned char const*, unsigned char const*))[0x55a8e71e41ba]
??:0(Update_rows_log_event::do_exec_row(rpl_group_info*))[0x55a8e71d7b87]
??:0(Rows_log_event::do_apply_event(rpl_group_info*))[0x55a8e7369470]
??:0(wsrep_apply_events(THD*, Relay_log_info*, void const*, unsigned long))[0x55a8e7350fe0]
??:0(Wsrep_high_priority_service::remove_fragments(wsrep::ws_meta const&))[0x55a8e7351e26]
??:0(Wsrep_applier_service::apply_write_set(wsrep::ws_meta const&, wsrep::const_buffer const&, wsrep::mutable_buffer&))[0x55a8e75fbefb]
??:0(wsrep::server_state::start_streaming_applier(wsrep::id const&, wsrep::transaction_id const&, wsrep::high_priority_service*))[0x55a8e760e72e]
/usr/lib/galera/libgalera_smm.so(+0x53b14)[0x7f3bd24b4b14]
/usr/lib/galera/libgalera_smm.so(+0x5bb55)[0x7f3bd24bcb55]
/usr/lib/galera/libgalera_smm.so(+0x67ba8)[0x7f3bd24c8ba8]
/usr/lib/galera/libgalera_smm.so(+0x87595)[0x7f3bd24e8595]
src/trx_handle.cpp:396(galera::TrxHandleSlave::apply(void*, wsrep_cb_status (*)(void*, wsrep_ws_handle const*, unsigned int, wsrep_buf const*, wsrep_trx_meta const*, bool*), wsrep_trx_meta const&, bool&))[0x7f3bd24c0cd0]
/usr/lib/galera/libgalera_smm.so(+0x463c1)[0x7f3bd24a73c1]
/usr/sbin/mariadbd(_ZN5wsrep18wsrep_provider_v2611run_applierEPNS_21high_priority_serviceE+0x12)[0x55a8e760edd2]
/usr/sbin/mariadbd(+0xc55577)[0x55a8e736b577]
??:0(wsrep::wsrep_provider_v26::run_applier(wsrep::high_priority_service*))[0x55a8e735c5a3]
??:0(start_wsrep_THD(void*))[0x55a8e72eb386]
/lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7f3be39e2b43]
/lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7f3be3a74a00]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f3bc614b6cb): UPDATE incidents set    status_last_update='2022-08-07 12:59:05'
                                                , status_changed_users_id='2526'
                                                , status_id=3402
                                            where incidents_id=110443
 
Connection ID (thread ID): 7
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             514410               514410               processes
Max open files            1048576              1048576              files
Max locked memory         524288               524288               bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       514410               514410               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
Core pattern: |/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E

Here is other node log:

2022-08-07 12:59:21 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 368fe320-a19b with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 1813 rttvar: 3317 rto: 204000 lost: 0 last_data_recv: 3008 cwnd: 10 last_queued_since: 4874349 last_delivered_since: 3005087293 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2022-08-07 12:59:21 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.0.1:4567
2022-08-07 12:59:22 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') reconnecting to 368fe320-a19b (tcp://192.168.0.1:4567), attempt 0
2022-08-07 12:59:23 0 [Note] WSREP: evs::proto(2c964753-8f33, OPERATIONAL, view_id(REG,2c964753-8f33,6)) suspecting node: 368fe320-a19b
2022-08-07 12:59:23 0 [Note] WSREP: evs::proto(2c964753-8f33, OPERATIONAL, view_id(REG,2c964753-8f33,6)) suspected node without join message, declaring inactive
2022-08-07 12:59:24 0 [Note] WSREP: view(view_id(NON_PRIM,2c964753-8f33,6) memb {
        2c964753-8f33,0
} joined {
} left {
} partitioned {
        368fe320-a19b,0
})
2022-08-07 12:59:24 0 [Note] WSREP: view(view_id(NON_PRIM,2c964753-8f33,7) memb {
        2c964753-8f33,0
} joined {
} left {
} partitioned {
        368fe320-a19b,0
})
2022-08-07 12:59:24 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2022-08-07 12:59:24 0 [Note] WSREP: Flow-control interval: [16, 16]
2022-08-07 12:59:24 0 [Note] WSREP: Received NON-PRIMARY.
2022-08-07 12:59:24 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 19818073)
2022-08-07 12:59:24 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2022-08-07 12:59:24 58815930 [Warning] WSREP: Send action {(nil), 664, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 0 [Note] WSREP: Flow-control interval: [16, 16]
2022-08-07 12:59:24 0 [Note] WSREP: Received NON-PRIMARY.
2022-08-07 12:59:24 58815935 [Warning] WSREP: Send action {(nil), 1272, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58815943 [Warning] WSREP: Send action {(nil), 664, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58815977 [Warning] WSREP: Send action {(nil), 784, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816029 [Warning] WSREP: Send action {(nil), 792, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816100 [Warning] WSREP: Send action {(nil), 1928, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816107 [Warning] WSREP: Send action {(nil), 840, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816112 [Warning] WSREP: Send action {(nil), 656, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 14 [Note] WSREP: ================================================
View:
  id: 2c96698a-0fdf-11ed-90d6-7ecc8fa70984:19818073
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(1):
        0: 2c964753-0fdf-11ed-8f33-fb3ea59a5a33, ovh6.1check.com
=================================================
2022-08-07 12:59:24 14 [Note] WSREP: Non-primary view
2022-08-07 12:59:24 58816115 [Warning] WSREP: Send action {(nil), 656, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816199 [Warning] WSREP: Send action {(nil), 1256, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 14 [Note] WSREP: Server status change synced -> connected
2022-08-07 12:59:24 14 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-08-07 12:59:24 58816203 [Warning] WSREP: Send action {(nil), 496, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 14 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-08-07 12:59:24 14 [Note] WSREP: ================================================
View:
  id: 2c96698a-0fdf-11ed-90d6-7ecc8fa70984:19818073
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(1):
        0: 2c964753-0fdf-11ed-8f33-fb3ea59a5a33, ovh6.1check.com
=================================================
2022-08-07 12:59:24 14 [Note] WSREP: Non-primary view
2022-08-07 12:59:24 14 [Note] WSREP: Server status change connected -> connected
2022-08-07 12:59:24 14 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-08-07 12:59:24 58816207 [Warning] WSREP: Send action {(nil), 792, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 14 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-08-07 12:59:24 58816209 [Warning] WSREP: Send action {(nil), 776, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816290 [Warning] WSREP: Send action {(nil), 656, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816330 [Warning] WSREP: Send action {(nil), 2640, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816365 [Warning] WSREP: Send action {(nil), 664, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816409 [Warning] WSREP: Send action {(nil), 784, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816410 [Warning] WSREP: Send action {(nil), 445056, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816419 [Warning] WSREP: Send action {(nil), 656, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816420 [Warning] WSREP: Send action {(nil), 622056, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816424 [Warning] WSREP: Send action {(nil), 354216, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816425 [Warning] WSREP: Send action {(nil), 587704, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816565 [Warning] WSREP: Send action {(nil), 848, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816569 [Warning] WSREP: Send action {(nil), 888, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816571 [Warning] WSREP: Send action {(nil), 1256, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816574 [Warning] WSREP: Send action {(nil), 2672, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816580 [Warning] WSREP: Send action {(nil), 1264, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816727 [Warning] WSREP: Send action {(nil), 776, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816731 [Warning] WSREP: Send action {(nil), 1384, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816751 [Warning] WSREP: Send action {(nil), 1384, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816755 [Warning] WSREP: Send action {(nil), 1264, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816763 [Warning] WSREP: Send action {(nil), 656, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816769 [Warning] WSREP: Send action {(nil), 856, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816773 [Warning] WSREP: Send action {(nil), 776, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816767 [Warning] WSREP: Send action {(nil), 3549536, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816864 [Warning] WSREP: Send action {(nil), 3472, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816958 [Warning] WSREP: Send action {(nil), 656, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816962 [Warning] WSREP: Send action {(nil), 584, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816966 [Warning] WSREP: Send action {(nil), 784, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816969 [Warning] WSREP: Send action {(nil), 888, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58794613 [Warning] WSREP: Send action {(nil), 664, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58816991 [Warning] WSREP: Send action {(nil), 1272, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58817066 [Warning] WSREP: Send action {(nil), 784, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58817211 [Warning] WSREP: Send action {(nil), 888, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58817224 [Warning] WSREP: Send action {(nil), 664, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58817229 [Warning] WSREP: Send action {(nil), 656, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58817233 [Warning] WSREP: Send action {(nil), 776, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58817331 [Warning] WSREP: Send action {(nil), 776, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58817438 [Warning] WSREP: Send action {(nil), 912, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58817548 [Warning] WSREP: Send action {(nil), 664, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58817645 [Warning] WSREP: Send action {(nil), 912, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58817648 [Warning] WSREP: Send action {(nil), 784, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58817691 [Warning] WSREP: Send action {(nil), 528, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58817779 [Warning] WSREP: Send action {(nil), 555520, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:24 58817886 [Warning] WSREP: Send action {(nil), 848, WRITESET} returned -107 (Transport endpoint is not connected)
2022-08-07 12:59:25 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 129 rttvar: 64 rto: 204000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000635628 last_delivered_since: 3000635628 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 12:59:29 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 150 rttvar: 75 rto: 204000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000046360 last_delivered_since: 3000046360 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 12:59:33 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 144 rttvar: 72 rto: 204000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000007511 last_delivered_since: 3000007511 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 12:59:37 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 92 rttvar: 46 rto: 204000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000093147 last_delivered_since: 3000093147 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 12:59:41 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 104 rttvar: 52 rto: 204000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000125469 last_delivered_since: 3000125469 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 12:59:45 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 98 rttvar: 49 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000138902 last_delivered_since: 3000138902 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 12:59:49 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 107 rttvar: 53 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000076137 last_delivered_since: 3000076137 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 12:59:53 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 136 rttvar: 68 rto: 200000 lost: 0 last_data_recv: 3004 cwnd: 10 last_queued_since: 3000132517 last_delivered_since: 3000132517 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 12:59:57 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 71 rttvar: 35 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000305696 last_delivered_since: 3000305696 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:00:01 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 230 rttvar: 115 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000022184 last_delivered_since: 3000022184 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:00:05 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 102 rttvar: 51 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000129301 last_delivered_since: 3000129301 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:00:09 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 94 rttvar: 47 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000115407 last_delivered_since: 3000115407 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:00:13 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 80 rttvar: 40 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000233084 last_delivered_since: 3000233084 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:00:17 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 99 rttvar: 49 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000139482 last_delivered_since: 3000139482 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:00:21 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 90 rttvar: 45 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000107178 last_delivered_since: 3000107178 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:00:25 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 100 rttvar: 50 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000138070 last_delivered_since: 3000138070 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:00:29 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 114 rttvar: 57 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000134852 last_delivered_since: 3000134852 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:00:33 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 103 rttvar: 51 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000161454 last_delivered_since: 3000161454 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:00:37 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 106 rttvar: 53 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000170602 last_delivered_since: 3000170602 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:00:41 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 108 rttvar: 54 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000223534 last_delivered_since: 3000223534 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:00:45 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 102 rttvar: 51 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000219764 last_delivered_since: 3000219764 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:00:49 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 114 rttvar: 57 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000123103 last_delivered_since: 3000123103 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:00:53 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 98 rttvar: 49 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000112880 last_delivered_since: 3000112880 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:00:57 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 98 rttvar: 49 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000162084 last_delivered_since: 3000162084 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:01:01 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://192.168.0.1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 92 rttvar: 46 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000047362 last_delivered_since: 3000047362 send_queue_length: 0 send_queue_bytes: 0
2022-08-07 13:01:11 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') reconnecting to 368fe320-a19b (tcp://192.168.0.1:4567), attempt 30
2022-08-07 13:01:12 0 [Note] WSREP: (2c964753-8f33, 'tcp://0.0.0.0:4567') connection established to 368fe320-a19c tcp://192.168.0.1:4567
2022-08-07 13:01:12 0 [Note] WSREP: remote endpoint tcp://192.168.0.1:4567 changed identity 368fe320-11cc-11ed-a19b-3bac27a309d1 -> 368fe320-11cc-11ed-a19c-3bac27a309d1
2022-08-07 13:01:13 0 [Note] WSREP: declaring 368fe320-a19c at tcp://192.168.0.1:4567 stable
2022-08-07 13:01:13 0 [Note] WSREP: re-bootstrapping prim from partitioned components



 Comments   
Comment by Andrew Robinson [ 2022-08-08 ]

Experiencing an identical issue on MariaDB 10.6.8 on Rocky Linux 8.6 with 3x Node Galera Cluster.
Crashes happen randomly, sometimes 2-3 times a day other times once every two days. Has been an issue since we upgrading from 10.3 through to 10.6.

Our logs are almost identical to above though most of our failed records are INSERT rather than UPDATE.
See below example.

mariadbd: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.6.8/wsrep-lib/include/wsrep/client_state.hpp:668: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.
220808 11:31:26 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.6.8-MariaDB-log
key_buffer_size=20971520
read_buffer_size=204800
max_used_connections=53
max_threads=502
thread_count=67
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1162025 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f3be0000c58
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f3c6c382c48 thread_stack 0x49000
??:0(my_print_stacktrace)[0x555f99c6011e]
??:0(handle_fatal_signal)[0x555f9973e745]
??:0(__restore_rt)[0x7f3ce4f0dce0]
:0(__GI_raise)[0x7f3ce4269a9f]
:0(__GI_abort)[0x7f3ce423ce05]
??:0(_nl_load_domain.cold.0)[0x7f3ce423ccd9]
:0(__GI___assert_fail)[0x7f3ce42623f6]
??:0(wsrep_bf_abort(THD*, THD*))[0x555f999e0c36]
??:0(wsrep_thd_bf_abort)[0x555f999e82d9]
??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x555f99a0a78e]
??:0(Wsrep_server_service::log_dummy_write_set(wsrep::client_state&, wsrep::ws_meta const&))[0x555f993ccdb7]
??:0(Wsrep_server_service::log_dummy_write_set(wsrep::client_state&, wsrep::ws_meta const&))[0x555f993cce21]
??:0(void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag))[0x555f99ac030f]
??:0(void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag))[0x555f99ac2721]
??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x555f99a10685]
??:0(handler::ha_write_row(unsigned char const*))[0x555f9974caef]
??:0(wsrep_apply_events(THD*, Relay_log_info*, void const*, unsigned long))[0x555f999dfa8c]
??:0(Wsrep_applier_service::apply_write_set(wsrep::ws_meta const&, wsrep::const_buffer const&, wsrep::mutable_buffer&))[0x555f999c6f14]
??:0(wsrep::server_state::start_streaming_applier(wsrep::id const&, wsrep::transaction_id const&, wsrep::high_priority_service*))[0x555f99cedbca]
??:0(wsrep::server_state::on_apply(wsrep::high_priority_service&, wsrep::ws_handle const&, wsrep::ws_meta const&, wsrep::const_buffer const&))[0x555f99cee75d]
??:0(wsrep::wsrep_provider_v26::options[abi:cxx11]() const)[0x555f99d02264]
src/trx_handle.cpp:392(galera::TrxHandleSlave::apply(void*, wsrep_cb_status (*)(void*, wsrep_ws_handle const*, unsigned int, wsrep_buf const*, wsrep_trx_meta const*, bool*), wsrep_trx_meta const&, bool&))[0x7f3ce25a22cf]
src/replicator_smm.cpp:516(galera::ReplicatorSMM::apply_trx(void*, galera::TrxHandleSlave&))[0x7f3ce25b37d0]
src/replicator_smm.cpp:2154(galera::ReplicatorSMM::process_trx(void*, boost::shared_ptr<galera::TrxHandleSlave> const&))[0x7f3ce25b560e]
src/gcs_action_source.cpp:63(galera::GcsActionSource::process_writeset(void*, gcs_action const&, bool&))[0x7f3ce25e1f2b]
src/gcs_action_source.cpp:110(galera::GcsActionSource::dispatch(void*, gcs_action const&, bool&))[0x7f3ce25e20e2]
src/gcs_action_source.cpp:29(galera::GcsActionSource::process(void*, bool&))[0x7f3ce25e2bb1]
src/replicator_smm.cpp:402(galera::ReplicatorSMM::async_recv(void*))[0x7f3ce25b6590]
src/wsrep_provider.cpp:290(galera_recv)[0x7f3ce25918f8]
??:0(wsrep::wsrep_provider_v26::run_applier(wsrep::high_priority_service*))[0x555f99d0303e]
??:0(wsrep_fire_rollbacker)[0x555f999e19a3]
??:0(start_wsrep_THD(void*))[0x555f999d1dcc]
??:0(MyCTX_nopad::finish(unsigned char*, unsigned int*))[0x555f9995f20d]
??:0(start_thread)[0x7f3ce4f031cf]
:0(__GI___clone)[0x7f3ce4254dd3]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f3cda4d150b): INSERT INTO call_legs_72 SET call_id='31055314', calldate='2022-08-08', start_time='10:32:18', duration_hr='0000', duration_min='00', duration_sec='05', calling=TRIM('X9999'), called=TRIM('T7221'), ans='', ans_time=TRIM('****'), digits_dialed='6126904782 #61222221127', digits_actual='#61222221127', ani=TRIM('6126904782'), dnis=TRIM('1744171127'), trans_conf=TRIM(''), extn=TRIM('T7221'), ehdu_in='', ehdu_out='', third_party=TRIM(''), call_log_id=TRIM('Y0015536'), seq_id=TRIM('E'), assoc_log_id=TRIM(''), raw_id='107015854', sysid='001', leg='0', call_start='2022-08-08 10:32:18', call_end='2022-08-08 10:32:23', call_start_utc='2022-08-08 00:32:18', call_end_utc='2022-08-08 00:32:23'
 
Connection ID (thread ID): 15
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /database
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             14785                14785                processes
Max open files            32768                32768                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       14785                14785                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
Core pattern: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e

I can't seem to manually replicate it any way I have tried. Happy to assist with testing any suggestions.

Comment by Joel Davis [ 2022-08-08 ]

I seem to be having the same issue every few days.
MariaDB 10.6.8 on Rocky Linux 8.6 with 3x Node Galera Cluster.

220804 12:26:30 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.6.8-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=48
max_threads=153
thread_count=39
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467959 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
??:0(my_print_stacktrace)[0x5578422ce11e]
??:0(handle_fatal_signal)[0x557841dac745]
??:0(__restore_rt)[0x7f9cc5e1ece0]
:0(__GI_raise)[0x7f9cc517aa9f]
:0(__GI_abort)[0x7f9cc514de05]
/lib64/libstdc++.so.6(+0x9009b)[0x7f9cc590309b]
??:0(std::rethrow_exception(std::__exception_ptr::exception_ptr))[0x7f9cc590953c]
??:0(std::terminate())[0x7f9cc5909597]
??:0(__cxa_throw)[0x7f9cc59097f8]
boost/throw_exception.hpp:69(void boost::throw_exception<std::system_error>(std::system_error const&))[0x7f9cc36190d6]
impl/throw_error.ipp:49(asio::detail::do_throw_error(std::error_code const&, char const*) [clone .isra.265])[0x7f9cc3624a34]
asio/error.hpp:228(asio::basic_socket<asio::ip::tcp, asio::stream_socket_service<asio::ip::tcp> >::remote_endpoint() const)[0x7f9cc362d1d5]
impl/endpoint.ipp:114(gu::AsioStreamReact::assign_addresses())[0x7f9cc3627569]
src/gu_asio_stream_react.cpp:890(gu::AsioAcceptorReact::accept_handler(std::shared_ptr<gu::AsioStreamReact> const&, std::shared_ptr<gu::AsioAcceptorHandler> const&, std::error_code const&))[0x7f9cc3627c99]
detail/gcc_x86_fenced_block.hpp:80(asio::detail::reactive_socket_accept_op<asio::basic_socket<asio::ip::tcp, asio::stream_socket_service<asio::ip::tcp> >, asio::ip::tcp, boost::_bi::bind_t<void, boost::_mfi::mf3<void, gu::AsioAcceptorReact, std::shared_ptr<gu::AsioStreamReact> const&, std::shared_ptr<gu::AsioAcceptorHandler> const&, std::error_code const&>, boost::_bi::list4<boost::_bi::value<std::shared_ptr<gu::AsioAcceptorReact> >, boost::_bi::value<std::shared_ptr<gu::AsioStreamReact> >, boost::_bi::valimpl/epoll_reactor.ipp:653(asio::detail::epoll_reactor::descriptor_state::do_complete(asio::detail::task_io_service*, asio::detail::task_io_service_operation*, std::error_code const&, unsigned long))[0x7f9cc36203c6]
impl/task_io_service.ipp:367(asio::detail::task_io_service::do_run_one(asio::detail::scoped_lock<asio::detail::posix_mutex>&, asio::detail::task_io_service_thread_info&, std::error_code const&))[0x7f9cc361882e]
impl/task_io_service.ipp:148(gu::AsioIoService::run())[0x7f9cc36125b1]
src/asio_protonet.cpp:104(gcomm::AsioProtonet::event_loop(gu::datetime::Period const&))[0x7f9cc35460a1]
src/gu_threads.h:187(gu_mutex_lock_SYS)[0x7f9cc352c44f]
src/gu_threads.h:105(gu_thread_exit)[0x7f9cc352cae6]
??:0(start_thread)[0x7f9cc5e141cf]
:0(__GI___clone)[0x7f9cc5165dd3]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Fatal signal 11 while backtracing

I hope this helps.
Cheers,

Comment by Andrew Robinson [ 2022-08-08 ]

I have enabled core_file on all three of my servers. Next event I will upload the core dump file to hopefully aid in further analysis.

Comment by Andrew Robinson [ 2022-08-08 ]

I have a core file extract and the binary. Attempted to analyse with gdb myself but wasn't overly familiar with the process.
Sharing here, but they were too large to upload directly so I have placed on Dropbox.
https://www.dropbox.com/s/yi85b813fst0rh5/mariadbd-core.zip?dl=0

Comment by Andrew Robinson [ 2022-08-08 ]

gdb_output.log
Pulled a full backtrace from the core files per documented instructions (attached). Hopefully this helps also.

Log output specifically for the event where the core was dumped:

mariadbd: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.6.8/wsrep-lib/include/wsrep/client_state.hpp:668: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.
220808 16:47:58 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.6.8-MariaDB-log
key_buffer_size=20971520
read_buffer_size=204800
max_used_connections=10
max_threads=502
thread_count=27
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1162025 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7fde48002098
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fddcc423c48 thread_stack 0x49000
??:0(my_print_stacktrace)[0x55ca2b85b11e]
??:0(handle_fatal_signal)[0x55ca2b339745]
??:0(__restore_rt)[0x7fde736d8ce0]
:0(__GI_raise)[0x7fde72a34a9f]
:0(__GI_abort)[0x7fde72a07e05]
??:0(_nl_load_domain.cold.0)[0x7fde72a07cd9]
:0(__GI___assert_fail)[0x7fde72a2d3f6]
??:0(wsrep_bf_abort(THD*, THD*))[0x55ca2b5dbc36]
??:0(wsrep_thd_bf_abort)[0x55ca2b5e32d9]
??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x55ca2b60578e]
??:0(Wsrep_server_service::log_dummy_write_set(wsrep::client_state&, wsrep::ws_meta const&))[0x55ca2afc7db7]
??:0(Wsrep_server_service::log_dummy_write_set(wsrep::client_state&, wsrep::ws_meta const&))[0x55ca2afc7e21]
??:0(void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag))[0x55ca2b6bb30f]
??:0(void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag))[0x55ca2b6bd721]
??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x55ca2b60b685]
??:0(handler::ha_write_row(unsigned char const*))[0x55ca2b347aef]
??:0(Rows_log_event::write_row(rpl_group_info*, bool))[0x55ca2b445c65]
??:0(Write_rows_log_event::do_exec_row(rpl_group_info*))[0x55ca2b4462ad]
??:0(Rows_log_event::do_apply_event(rpl_group_info*))[0x55ca2b43b31f]
??:0(wsrep_apply_events(THD*, Relay_log_info*, void const*, unsigned long))[0x55ca2b5daa8c]
??:0(Wsrep_applier_service::apply_write_set(wsrep::ws_meta const&, wsrep::const_buffer const&, wsrep::mutable_buffer&))[0x55ca2b5c1f14]
??:0(wsrep::server_state::start_streaming_applier(wsrep::id const&, wsrep::transaction_id const&, wsrep::high_priority_service*))[0x55ca2b8e8bca]
??:0(wsrep::server_state::on_apply(wsrep::high_priority_service&, wsrep::ws_handle const&, wsrep::ws_meta const&, wsrep::const_buffer const&))[0x55ca2b8e975d]
??:0(wsrep::wsrep_provider_v26::options[abi:cxx11]() const)[0x55ca2b8fd264]
src/trx_handle.cpp:392(galera::TrxHandleSlave::apply(void*, wsrep_cb_status (*)(void*, wsrep_ws_handle const*, unsigned int, wsrep_buf const*, wsrep_trx_meta const*, bool*), wsrep_trx_meta const&, bool&))[0x7fde70d6d2cf]
src/replicator_smm.cpp:516(galera::ReplicatorSMM::apply_trx(void*, galera::TrxHandleSlave&))[0x7fde70d7e7d0]
src/replicator_smm.cpp:2154(galera::ReplicatorSMM::process_trx(void*, boost::shared_ptr<galera::TrxHandleSlave> const&))[0x7fde70d8060e]
src/gcs_action_source.cpp:63(galera::GcsActionSource::process_writeset(void*, gcs_action const&, bool&))[0x7fde70dacf2b]
src/gcs_action_source.cpp:110(galera::GcsActionSource::dispatch(void*, gcs_action const&, bool&))[0x7fde70dad0e2]
src/gcs_action_source.cpp:29(galera::GcsActionSource::process(void*, bool&))[0x7fde70dadbb1]
src/replicator_smm.cpp:402(galera::ReplicatorSMM::async_recv(void*))[0x7fde70d81590]
src/wsrep_provider.cpp:290(galera_recv)[0x7fde70d5c8f8]
??:0(wsrep::wsrep_provider_v26::run_applier(wsrep::high_priority_service*))[0x55ca2b8fe03e]
??:0(wsrep_fire_rollbacker)[0x55ca2b5dc9a3]
??:0(start_wsrep_THD(void*))[0x55ca2b5ccdcc]
??:0(MyCTX_nopad::finish(unsigned char*, unsigned int*))[0x55ca2b55a20d]
??:0(start_thread)[0x7fde736ce1cf]
:0(__GI___clone)[0x7fde72a1fdd3]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7fde65eabb7b): INSERT INTO call_legs_220 SET call_id='14743911', calldate='2022-08-08', start_time='16:40:44', duration_hr='0000', duration_min='05', duration_sec='33', calling=TRIM('X9999'), called=TRIM('79408'), ans='', ans_time=TRIM('0009'), digits_dialed='0417127123 55*50779473', digits_actual='55*50779473', ani=TRIM('0417127123'), dnis=TRIM('0749839400'), trans_conf=TRIM(''), extn=TRIM('79408'), ehdu_in='', ehdu_out='', third_party=TRIM(''), call_log_id=TRIM('W3010770'), seq_id=TRIM('B'), assoc_log_id=TRIM(''), raw_id='49697110', sysid='307', leg='0', call_start='2022-08-08 16:40:44', call_end='2022-08-08 16:46:17', call_start_utc='2022-08-08 06:40:44', call_end_utc='2022-08-08 06:46:17'
 
Connection ID (thread ID): 15
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /database
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             14785                14785                processes
Max open files            32768                32768                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       14785                14785                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
Core pattern: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e

Comment by COUNOTTE CEDRIC [ 2022-08-09 ]

FWIW, the server is crashing nearly everyday for us. I noticed however that it's always a similar query that's running (and causing?) the issue. I'm not sure yet it's within the same DB (we have 350+ DBs) but it's always the same table within one of the DB. I've updated our code to identify the DB in the query when it'll happen again.

In the mean time, I've also launched an OPTIMIZE TABLE <concerned_table> for all DB on all nodes in hope that it might resolve this. Maybe that particular table is corrupt?

Comment by Andrew Robinson [ 2022-08-12 ]

Ours were crashing every 3-4 hours during weekday business hours.
I disabled galera and ran on a single 10.6.8 server which was solid for 24 hours which seems to confirm the problem is related to galera/replication.

Ultimately decided to roll back to a 10.3 backup image and replicate data changes since the upgrade into the recovered server, then added back in 10.3 cluster members.
Not a single crash after 36 hours in a 3x node cluster on 10.3. I guess I will have to wait for a fix and try an upgrade again in some months.

Comment by COUNOTTE CEDRIC [ 2022-08-12 ]

I feel lucky as we've not experienced any crash since 3 days. This assert on is_streaming() might be network related? But sure enough it's galera related.

Even more lucky as I tried reverting to 10.3 and 10.4 but the mariabackup used to boot-strap a replication would not work and we can't stop our service to dump/restore the DB.

Still such instability and the lack of concerns is very worrying.

Comment by COUNOTTE CEDRIC [ 2022-08-16 ]

New server crash today, which resulted in the server hanging, while remaining 2 nodes kept serving queries.

mariadbd: ./wsrep-lib/include/wsrep/client_state.hpp:668: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.
220816 10:45:24 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.6.7-MariaDB-2ubuntu1.1-log
key_buffer_size=268435456
read_buffer_size=131072
max_used_connections=100
max_threads=5002
thread_count=114
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 11276717 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7faf48000c68
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fc5b415aca8 thread_stack 0x49000
/usr/sbin/mariadbd(my_print_stacktrace+0x32)[0x564af1bc2702]
??:0(my_print_stacktrace)[0x564af16fd4d8]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fc5f74e0520]
??:0(__sigaction)[0x7fc5f7534a7c]
??:0(raise)[0x7fc5f74e0476]
??:0(abort)[0x7fc5f74c67f3]
/lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fc5f74c671b]
??:0(__assert_fail)[0x7fc5f74d7e96]
/usr/sbin/mariadbd(_Z14wsrep_bf_abortP3THDS0_+0x607)[0x564af19a74c7]
??:0(wsrep_bf_abort(THD*, THD*))[0x564af19ad5ad]
??:0(wsrep_thd_bf_abort)[0x564af19ca8a2]
??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x564af1388827]
??:0(Wsrep_server_service::log_dummy_write_set(wsrep::client_state&, wsrep::ws_meta const&))[0x564af1388979]
??:0(Wsrep_server_service::log_dummy_write_set(wsrep::client_state&, wsrep::ws_meta const&))[0x564af1a596c5]
??:0(void std::this_thread::sleep_for<long, std::ratio<1l, 1l> >(std::chrono::duration<long, std::ratio<1l, 1l> > const&))[0x564af1a5aa0e]
??:0(void std::this_thread::sleep_for<long, std::ratio<1l, 1l> >(std::chrono::duration<long, std::ratio<1l, 1l> > const&))[0x564af1a5c2d1]
??:0(void std::this_thread::sleep_for<long, std::ratio<1l, 1l> >(std::chrono::duration<long, std::ratio<1l, 1l> > const&))[0x564af1a8b771]
??:0(void std::this_thread::sleep_for<long, std::ratio<1l, 1l> >(std::chrono::duration<long, std::ratio<1l, 1l> > const&))[0x564af1a8c14f]
??:0(void std::this_thread::sleep_for<long, std::ratio<1l, 1l> >(std::chrono::duration<long, std::ratio<1l, 1l> > const&))[0x564af1a6cd30]
??:0(void std::this_thread::sleep_for<long, std::ratio<1l, 1l> >(std::chrono::duration<long, std::ratio<1l, 1l> > const&))[0x564af19ceb74]
??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x564af170adaa]
??:0(handler::ha_update_row(unsigned char const*, unsigned char const*))[0x564af18201ba]
??:0(Update_rows_log_event::do_exec_row(rpl_group_info*))[0x564af1813b87]
??:0(Rows_log_event::do_apply_event(rpl_group_info*))[0x564af19a5470]
??:0(wsrep_apply_events(THD*, Relay_log_info*, void const*, unsigned long))[0x564af198cfe0]
??:0(Wsrep_high_priority_service::remove_fragments(wsrep::ws_meta const&))[0x564af198de26]
??:0(wsrep::server_state::start_streaming_applier(wsrep::id const&, wsrep::transaction_id const&, wsrep::high_priority_service*))[0x564af1c37efb]
??:0(wsrep::wsrep_provider_v26::options[abi:cxx11]() const)[0x564af1c4a72e]
/usr/lib/galera/libgalera_smm.so(+0x53b14)[0x7fc5e6004b14]
/usr/lib/galera/libgalera_smm.so(+0x5bb55)[0x7fc5e600cb55]
/usr/lib/galera/libgalera_smm.so(+0x67ba8)[0x7fc5e6018ba8]
src/trx_handle.cpp:396(galera::TrxHandleSlave::apply(void*, wsrep_cb_status (*)(void*, wsrep_ws_handle const*, unsigned int, wsrep_buf const*, wsrep_trx_meta const*, bool*), wsrep_trx_meta const&, bool&))[0x7fc5e6038595]
/usr/lib/galera/libgalera_smm.so(+0x5fcd0)[0x7fc5e6010cd0]
/usr/lib/galera/libgalera_smm.so(+0x463c1)[0x7fc5e5ff73c1]
/usr/sbin/mariadbd(_ZN5wsrep18wsrep_provider_v2611run_applierEPNS_21high_priority_serviceE+0x12)[0x564af1c4add2]
??:0(wsrep::wsrep_provider_v26::run_applier(wsrep::high_priority_service*))[0x564af19a7577]
??:0(wsrep_bf_abort(THD*, THD*))[0x564af19985a3]
??:0(start_wsrep_THD(void*))[0x564af1927386]
/lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7fc5f7532b43]
??:0(pthread_condattr_setpshared)[0x7fc5f75c4a00]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7fc5dd4772ab): UPDATE _1check_RFR003007.incidents set status_last_update='2022-08-16 10:45:23'
                                                , status_changed_users_id='51'
                                                , status_id=9
                                            where incidents_id=1062
 
Connection ID (thread ID): 12
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             514397               514397               processes
Max open files            1048576              1048576              files
Max locked memory         524288               524288               bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       514397               514397               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
Core pattern: |/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E

It would be nice if this bug was taken care of asap.

Comment by COUNOTTE CEDRIC [ 2022-08-31 ]

Since I upgraded to 10.6.9, servers didn't crash, however it got worse because now the whole cluster gets stuck as described here: https://jira.mariadb.org/browse/MDEV-29388.

Generated at Thu Feb 08 10:07:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.