[MDEV-32363] when InnoDB gets an assertion failure, WSREP layer is not handled gracefully Created: 2023-07-14  Updated: 2024-02-07

Status: Needs Feedback
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.6.12
Fix Version/s: 10.6

Type: Bug Priority: Critical
Reporter: Rick Pizzi Assignee: Rick Pizzi
Resolution: Unresolved Votes: 5
Labels: None

Attachments: File galera_crash_node.test     File logs.tar.gz    

 Description   

If InnoDB gets an assertion failure, the WSREP layer is not immediately notified and this causes all nodes to lose primary status as a consequence.

You can see below that WSREP messages are mixed with the assertion stacktrace, and that WSREP tries to reconnect with peers even if the node is practically crashed and in fact it will die soon.

When this happens, the entire cluster goes into non-primary and a cluster bootstrap is needed to recover, which is not what we expect from the situation – crashed node should just be evicted and cluster should continue normally.

2023-07-07 15:50:19 11985573 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2023-07-07 15:50:19 11985573 [ERROR] mariadbd: Index for table 'failed_table' is corrupt; try to repair it
2023-07-07 15:50:20 0x7f17f7766700  InnoDB: Assertion failure in file /home/jenkins/workspace/Build-Package/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/storage/innobase/page/page0zip.cc line 4213
InnoDB: Failing assertion: slot_rec
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mariadbd startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
230707 15:50:20 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.6.12-7-MariaDB-enterprise-log source revision: 8e2b75dad28995ab5f6e6acd436135420f7031c9
key_buffer_size=268435456
read_buffer_size=131072
max_used_connections=2243
max_threads=6002
thread_count=1565
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 13479553 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7ef735ab51c8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f17f7765cb8 thread_stack 0x49000
Can't start addr2line
/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x5608f6116c7e]
/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x5608f5bc33a5]
/lib64/libpthread.so.0(+0xf630)[0x7f2011c15630]
/lib64/libc.so.6(gsignal+0x37)[0x7f2011060387]
/lib64/libc.so.6(abort+0x148)[0x7f2011061a78]
/usr/sbin/mariadbd(+0x694d97)[0x5608f5834d97]
/usr/sbin/mariadbd(+0xdbfb05)[0x5608f5f5fb05]
/usr/sbin/mariadbd(+0xdaf516)[0x5608f5f4f516]
2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 844de70f-8aaf with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 766 rttvar: 579 rto: 201000 lost: 0 last_data_recv: 2567 cwnd: 10 last_queued_since: 8776161264 last_delivered_since: 11959172679 send_queue_length: 9 send_queue_bytes: 720 segment: 0 messages: 9
2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.103:4567
2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 5dda822d-b4c3 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 6407 rttvar: 10277 rto: 207000 lost: 0 last_data_recv: 5900 cwnd: 10 last_queued_since: 307819 last_delivered_since: 8781038225 send_queue_length: 10 send_queue_bytes: 1080 segment: 0 messages: 10
2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.104:42156
2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 96c49f4b-8727 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 5591 rttvar: 10094 rto: 206000 lost: 0 last_data_recv: 5924 cwnd: 10 last_queued_since: 10916 last_delivered_since: 8781705783 send_queue_length: 11 send_queue_bytes: 1292 segment: 0 messages: 11
2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.101:33510
2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.10.1.101:4567 tcp://10.10.1.103:4567 tcp://10.10.1.104:4567 
2023-07-07 15:50:32 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT12.0655S), skipping check
2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7f17e8905e58
2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer destruct
2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7ef49da77b98
2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer destruct
/usr/sbin/mariadbd(+0xe62a65)[0x5608f6002a65]
2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 96c49f4b-8727 (tcp://10.10.1.101:4567), attempt 0
2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 844de70f-8aaf (tcp://10.10.1.103:4567), attempt 0
2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 5dda822d-b4c3 (tcp://10.10.1.104:4567), attempt 0
2023-07-07 15:50:33 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.73771S), skipping check
/usr/sbin/mariadbd(+0xe4e639)[0x5608f5fee639]
/usr/sbin/mariadbd(+0xe5063b)[0x5608f5ff063b]
/usr/sbin/mariadbd(+0xe62e98)[0x5608f6002e98]
/usr/sbin/mariadbd(+0xde0227)[0x5608f5f80227]
/usr/sbin/mariadbd(+0xde2da4)[0x5608f5f82da4]
2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 1000 rttvar: 500 rto: 201000 lost: 0 last_data_recv: 125408244 cwnd: 10 last_queued_since: 4421911460 last_delivered_since: 8715642835357354 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 359 rttvar: 179 rto: 201000 lost: 0 last_data_recv: 7528 cwnd: 10 last_queued_since: 120133 last_delivered_since: 120133 send_queue_length: 0 send_queue_bytes: 0
2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 437 rttvar: 218 rto: 201000 lost: 0 last_data_recv: 7532 cwnd: 10 last_queued_since: 4131087 last_delivered_since: 4131087 send_queue_length: 0 send_queue_bytes: 0
2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 181 rttvar: 90 rto: 201000 lost: 0 last_data_recv: 7532 cwnd: 10 last_queued_since: 8715642839939685 last_delivered_since: 8715642839939685 send_queue_length: 0 send_queue_bytes: 0
2023-07-07 15:50:41 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT7.53388S), skipping check
/usr/sbin/mariadbd(+0xe151ab)[0x5608f5fb51ab]
2023-07-07 15:50:43 0 [Note] WSREP: Deferred close timer handle_wait Success for 0x7f17ebc5b168
2023-07-07 15:50:43 0 [Note] WSREP: Deferred close timer destruct
2023-07-07 15:50:43 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.61329S), skipping check
/usr/sbin/mariadbd(+0xe15869)[0x5608f5fb5869]
2023-07-07 15:50:44 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.63124S), skipping check
2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 5dda822d-b4c3
2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 844de70f-8aaf
2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 96c49f4b-8727
/usr/sbin/mariadbd(+0xdf28b2)[0x5608f5f928b2]
/usr/sbin/mariadbd(+0xd43ca8)[0x5608f5ee3ca8]
/usr/sbin/mariadbd(_ZN7handler13ha_update_rowEPKhS1_+0x232)[0x5608f5bd12b2]
/usr/sbin/mariadbd(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x1a63)[0x5608f5a5cf33]
/usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x263e)[0x5608f597d38e]
/usr/sbin/mariadbd(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x38)[0x5608f58cb718]
/usr/sbin/mariadbd(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x176)[0x5608f58d48b6]
/usr/sbin/mariadbd(_ZN13sp_instr_stmt7executeEP3THDPj+0x5bc)[0x5608f58d529c]
/usr/sbin/mariadbd(_ZN7sp_head7executeEP3THDb+0xa0c)[0x5608f58ceeac]
2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 0 rttvar: 0 rto: 200000 lost: 0 last_data_recv: 125424590 cwnd: 10 last_queued_since: 11598743653 last_delivered_since: 8715659181935888 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 5dda822d-b4c3 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 185 rttvar: 82 rto: 201000 lost: 0 last_data_recv: 5508 cwnd: 10 last_queued_since: 11598929590 last_delivered_since: 11598962474 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 844de70f-8aaf with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 264 rttvar: 105 rto: 201000 lost: 0 last_data_recv: 5508 cwnd: 10 last_queued_since: 11599134047 last_delivered_since: 11599140957 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 96c49f4b-8727 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 384 rttvar: 151 rto: 201000 lost: 0 last_data_recv: 5678 cwnd: 10 last_queued_since: 11599486516 last_delivered_since: 11599494477 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
/usr/sbin/mariadbd(_ZN7sp_head17execute_procedureEP3THDP4ListI4ItemE+0x66a)[0x5608f58d093a]
/usr/sbin/mariadbd(+0x7cfc17)[0x5608f596fc17]
/usr/sbin/mariadbd(+0x7d3a68)[0x5608f5973a68]
/usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x101a)[0x5608f597bd6a]
2023-07-07 15:51:04 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 0 rttvar: 0 rto: 200000 lost: 0 last_data_recv: 125430994 cwnd: 10 last_queued_since: 221024 last_delivered_since: 8715665585967344 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2023-07-07 15:51:04 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT18.0029S), skipping check
2023-07-07 15:51:04 0 [Note] WSREP: view(view_id(NON_PRIM,5dda822d-b4c3,50) memb {
	6c357751-8d5f,0
} joined {
} left {
} partitioned {
	5dda822d-b4c3,0
	844de70f-8aaf,0
	96c49f4b-8727,0
})
2023-07-07 15:51:04 0 [Note] WSREP: view(view_id(NON_PRIM,6c357751-8d5f,51) memb {
	6c357751-8d5f,0
} joined {
} left {
} partitioned {
	5dda822d-b4c3,0
	844de70f-8aaf,0
	96c49f4b-8727,0
})
2023-07-07 15:51:04 11994604 [Warning] WSREP: Send action {(nil), 139603616989752, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2023-07-07 15:51:04 0 [Note] WSREP: Flow-control interval: [240, 300]
2023-07-07 15:51:04 0 [Note] WSREP: Received NON-PRIMARY.
2023-07-07 15:51:04 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 6204240577)
2023-07-07 15:51:04 11955214 [Warning] WSREP: Send action {(nil), 139599322023456, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 11985855 [Warning] WSREP: Send action {(nil), 139603616990584, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 11987046 [Warning] WSREP: Send action {(nil), 139599322023328, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 11985820 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 46 [Note] WSREP: ================================================
View:
  id: c3a51458-b6fd-11eb-8a80-eb35c100e72c:6204240577
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(1):
	0: 6c357751-ce4f-11ed-8d5f-136e7094748b, PIXID-MDB-MASTER2
=================================================
2023-07-07 15:51:04 46 [Note] WSREP: Non-primary view
2023-07-07 15:51:04 46 [Note] WSREP: Server status change synced -> connected
2023-07-07 15:51:04 46 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2023-07-07 15:51:04 11997537 [Warning] WSREP: Send action {(nil), 139603616989760, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 11996647 [Warning] WSREP: Send action {(nil), 139573552218680, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 11986459 [Warning] WSREP: Send action {(nil), 139736760976944, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 11997986 [Warning] WSREP: Send action {(nil), 139599322023552, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 11985505 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 11988311 [Warning] WSREP: Send action {(nil), 139607911957872, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11917365 [Warning] WSREP: Send action {(nil), 139586437121400, WRITESET} returned -107 (Transport endpoint is not connected)
/usr/sbin/mariadbd(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x38)[0x5608f58cb718]
2023-07-07 15:51:06 11985895 [Warning] WSREP: Send action {(nil), 139590732088096, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2023-07-07 15:51:06 0 [Note] WSREP: Flow-control interval: [240, 300]
2023-07-07 15:51:06 0 [Note] WSREP: Received NON-PRIMARY.
2023-07-07 15:51:06 11978506 [Warning] WSREP: Send action {(nil), 139599322023472, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11997530 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11988302 [Warning] WSREP: Send action {(nil), 139599322023760, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11988512 [Warning] WSREP: Send action {(nil), 139736760977344, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11988290 [Warning] WSREP: Send action {(nil), 139595027055344, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11998006 [Warning] WSREP: Send action {(nil), 139599322023888, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11988294 [Warning] WSREP: Send action {(nil), 139603616990632, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11991973 [Warning] WSREP: Send action {(nil), 139599322023712, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11988288 [Warning] WSREP: Send action {(nil), 139595027057080, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11985860 [Warning] WSREP: Send action {(nil), 139577847186808, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11997914 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11987069 [Warning] WSREP: Send action {(nil), 139599322023280, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11988354 [Warning] WSREP: Send action {(nil), 139736760976760, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11986489 [Warning] WSREP: Send action {(nil), 139564962285120, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11986139 [Warning] WSREP: Send action {(nil), 139582142155592, WRITESET} returned -107 (Transport endpoint is not connected)
/usr/sbin/mariadbd(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x176)[0x5608f58d48b6]
2023-07-07 15:51:07 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.43448S), skipping check
/usr/sbin/mariadbd(_ZN13sp_instr_stmt7executeEP3THDPj+0x5bc)[0x5608f58d529c]
/usr/sbin/mariadbd(_ZN7sp_head7executeEP3THDb+0xa0c)[0x5608f58ceeac]
2023-07-07 15:51:10 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.93087S), skipping check
/usr/sbin/mariadbd(_ZN7sp_head15execute_triggerEP3THDPK25st_mysql_const_lex_stringS4_P13st_grant_info+0x1df)[0x5608f58d008f]
2023-07-07 15:51:11 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 5dda822d-b4c3 tcp://10.10.1.104:4567
2023-07-07 15:51:11 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 844de70f-8aaf tcp://10.10.1.103:4567
/usr/sbin/mariadbd(_ZN19Table_triggers_list16process_triggersEP3THD14trg_event_type20trg_action_time_typeb+0x104)[0x5608f5a40ec4]
2023-07-07 15:51:13 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.87478S), skipping check
/usr/sbin/mariadbd(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP10SQL_I_ListI8st_orderEyyP13select_result+0xd99)[0x5608f5d33da9]
/usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x24c8)[0x5608f597d218]
2023-07-07 15:51:16 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.90666S), skipping check
/usr/sbin/mariadbd(_Z11mysql_parseP3THDPcjP12Parser_state+0x20a)[0x5608f5980c9a]
/usr/sbin/mariadbd(+0x7e1531)[0x5608f5981531]
2023-07-07 15:51:19 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 96c49f4b-8727 tcp://10.10.1.101:4567
2023-07-07 15:51:19 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.83971S), skipping check
/usr/sbin/mariadbd(_Z16dispatch_command19enum_server_commandP3THDPcjb+0x29e1)[0x5608f5984c31]
2023-07-07 15:51:20 0 [Warning] WSREP: evs::proto(6c357751-8d5f, GATHER, view_id(REG,6c357751-8d5f,51)) install timer expired
evs::proto(evs::proto(6c357751-8d5f, GATHER, view_id(REG,6c357751-8d5f,51)), GATHER) {
current_view=view(view_id(REG,6c357751-8d5f,51) memb {
	6c357751-8d5f,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=2,safe_seq=2,node_index=node: {idx=0,range=[3,2],safe_seq=2} },
fifo_seq=1874639154,
last_sent=2,
known:
5dda822d-b4c3 at tcp://10.10.1.104:4567
{o=1,s=0,i=0,fs=834192942,jm=
{v=1,t=4,ut=255,o=1,s=122,sr=-1,as=122,f=4,src=5dda822d-b4c3,srcvid=view_id(REG,5dda822d-b4c3,52),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=834192942,nl=(
	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}
	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
	96c49f4b-8727, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
)
},
}
6c357751-8d5f at 
{o=1,s=0,i=0,fs=-1,jm=
{v=1,t=4,ut=255,o=1,s=2,sr=-1,as=2,f=0,src=6c357751-8d5f,srcvid=view_id(REG,6c357751-8d5f,51),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=1874639154,nl=(
	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}
	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
	96c49f4b-8727, {o=0,s=0,e=0,ls=-1,vid=view_id(REG,00000000-0000,0),ss=-1,ir=[-1,-1],}
)
},
}
844de70f-8aaf at tcp://10.10.1.103:4567
{o=1,s=0,i=0,fs=1475544355,jm=
{v=1,t=4,ut=255,o=1,s=122,sr=-1,as=122,f=4,src=844de70f-8aaf,srcvid=view_id(REG,5dda822d-b4c3,52),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=1475544355,nl=(
	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}
	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
	96c49f4b-8727, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
)
},
}
96c49f4b-8727 at tcp://10.10.1.101:4567
{o=0,s=0,i=0,fs=101154494,}
 }
2023-07-07 15:51:20 0 [Note] WSREP: no install message received
2023-07-07 15:51:20 0 [Note] WSREP: view(view_id(NON_PRIM,6c357751-8d5f,52) memb {
	6c357751-8d5f,0
} joined {
} left {
} partitioned {
	5dda822d-b4c3,0
	844de70f-8aaf,0
	96c49f4b-8727,0
})
2023-07-07 15:51:20 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2023-07-07 15:51:20 0 [Note] WSREP: Flow-control interval: [240, 300]
2023-07-07 15:51:20 0 [Note] WSREP: Received NON-PRIMARY.
/usr/sbin/mariadbd(_Z10do_commandP3THDb+0x132)[0x5608f5985942]
/usr/sbin/mariadbd(_Z24do_handle_one_connectionP7CONNECTb+0x3b7)[0x5608f5aa2dd7]
2023-07-07 15:51:23 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT4.34509S), skipping check
/usr/sbin/mariadbd(handle_one_connection+0x5d)[0x5608f5aa311d]
2023-07-07 15:51:24 11986441 [Warning] WSREP: Send action {(nil), 139599322023528, WRITESET} returned -107 (Transport endpoint is not connected)
/usr/sbin/mariadbd(+0xc839d2)[0x5608f5e239d2]
/lib64/libpthread.so.0(+0x7ea5)[0x7f2011c0dea5]
/lib64/libc.so.6(clone+0x6d)[0x7f2011128b0d]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7ef692470578): UPDATE failed_schema.failed_table
			SET TYPE_EVE =  NAME_CONST('V_TYPE_EVE',_utf8mb3'D' COLLATE 'utf8mb3_general_ci') , DATE_EFFECTIVE =  NAME_CONST('V_DATE_EFFECTIVE',TIMESTAMP'2023-07-07 15:50:20'), STATUT_DATAMART = NULL  WHERE EVT_ID =  NAME_CONST('V_EVT_ID',2051583478)
 
Connection ID (thread ID): 11985593
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /data/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             805978               805978               processes 
Max open files            1048576              1048576              files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       805978               805978               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: core



 Comments   
Comment by Jan Lindström [ 2023-08-04 ]

rpizzi Can we have error log from other nodes to resolve why they can't continue as a cluster? Node that is crashing there is nothing we can do.

Comment by Rick Pizzi [ 2023-08-18 ]

can't you just immediately signal the WSREP threads?

Comment by Jan Lindström [ 2023-08-22 ]

rpizzi Very well, other nodes should not drop from cluster even if one of the nodes crashes. Do you have error logs from other nodes? I'm looking reason why they dropped from cluster.

Comment by Rick Pizzi [ 2023-08-22 ]

As I already explained, this was on a production system and logs are long gone.
I guess we can only wait for another occurrence of the issue.

Comment by Ramesh Sivaraman [ 2023-08-24 ]

janlindstrom Reproduced cluster inconstancy using RQG data load. Active nodes become unstable when one of the nodes in cluster is forcefully killed while RQG data load is active.
The cluster became inconsistent, but the server did not crash as specified in the cluster description.
Test case
1) started 3 node cluster
2) Initiated RQG run on node1 and node2
3) forcefully killed node2

Node1 is disconnected from the cluster and Node3 loses its primary status. Error logs from cluster logs.tar.gz

Node1

node1:root@localhost> show status like '%wsrep%st%';
+------------------------------+--------------------------------------+
| Variable_name                | Value                                |
+------------------------------+--------------------------------------+
| wsrep_local_state_uuid       | 00000000-0000-0000-0000-000000000000 |
| wsrep_last_committed         | -1                                   |
| wsrep_flow_control_requested | false                                |
| wsrep_cert_deps_distance     | 41.6348                              |
| wsrep_local_state            | 5                                    |
| wsrep_local_state_comment    | Inconsistent                         |
| wsrep_cluster_capabilities   |                                      |
| wsrep_cluster_conf_id        | 18446744073709551615                 |
| wsrep_cluster_size           | 0                                    |
| wsrep_cluster_state_uuid     | e8298e61-400d-11ee-bed8-e3ccd61d69c8 |
| wsrep_cluster_status         | Disconnected                         |
+------------------------------+--------------------------------------+
11 rows in set (0.001 sec)

Node3

node3:root@localhost> show status like '%wsrep%st%';
+------------------------------+--------------------------------------+
| Variable_name                | Value                                |
+------------------------------+--------------------------------------+
| wsrep_local_state_uuid       | e8298e61-400d-11ee-bed8-e3ccd61d69c8 |
| wsrep_last_committed         | 19996                                |
| wsrep_flow_control_requested | false                                |
| wsrep_cert_deps_distance     | 26.4544                              |
| wsrep_local_state            | 0                                    |
| wsrep_local_state_comment    | Initialized                          |
| wsrep_cluster_weight         | 0                                    |
| wsrep_evs_evict_list         |                                      |
| wsrep_evs_state              | OPERATIONAL                          |
| wsrep_gmcast_segment         | 0                                    |
| wsrep_cluster_capabilities   |                                      |
| wsrep_cluster_conf_id        | 18446744073709551615                 |
| wsrep_cluster_size           | 1                                    |
| wsrep_cluster_state_uuid     | e8298e61-400d-11ee-bed8-e3ccd61d69c8 |
| wsrep_cluster_status         | non-Primary                          |
+------------------------------+--------------------------------------+
15 rows in set (0.001 sec)
 
node3:root@localhost> 

Comment by Jan Lindström [ 2023-08-24 ]

Looked error logs and node_1 drops from cluster because applier gets error and it tries to do error voting:

2023-08-21 13:32:24 2 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table test.table30_int_autoinc; Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213; handler error HA_ERR_LOCK_DEADLOCK; the event's master log FIRST, end_log_pos 242, Internal MariaDB error code: 1213
2023-08-21 13:32:24 0 [Note] WSREP: Member 0(galapq) initiates vote on e8298e61-400d-11ee-bed8-e3ccd61d69c8:4939,89ae2f2481c15ba0:  Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213;
2023-08-21 13:32:24 8 [Note] WSREP: wsrep_before_commit: 1, 4949
2023-08-21 13:32:24 6 [Note] WSREP: wsrep_commit_empty for 6 client_state exec client_mode high priority trans_state executing sql NULL
2023-08-21 13:32:24 7 [Note] WSREP: wsrep_before_commit: 1, 4947
2023-08-21 13:32:24 0 [Note] WSREP: Votes over e8298e61-400d-11ee-bed8-e3ccd61d69c8:4939:
   0000000000000000:   2/3
   89ae2f2481c15ba0:   1/3
Winner: 0000000000000000
2023-08-21 13:32:24 9 [Note] WSREP: assigned new next trx id: 15048
2023-08-21 13:32:24 6 [Note] WSREP: assigned new next trx id: 15049
2023-08-21 13:32:24 2 [ERROR] WSREP: Inconsistency detected: Inconsistent by consensus on e8298e61-400d-11ee-bed8-e3ccd61d69c8:4939
	 at /test/galera_4x_opt/galera/src/replicator_smm.cpp:process_apply_error():1357

Last node leaves cluster because it's weight is not big enough.

Comment by Sergei Golubchik [ 2023-08-25 ]

why can this cause nodes to be not consistent anymore?

Comment by Sergei Golubchik [ 2023-08-31 ]

Why would other nodes be out of sync with each other? they both have received the write set, they certify it and apply, where's the inconsistency here?

Comment by Sergei Golubchik [ 2023-09-04 ]

janlindstrom, but this means that a crash of a node can make the whole cluster unusable, where's HA in that?

May be a node_2 shouldn't apply a write set until all nodes got it. May be node_3 can get it from node_2. But it has to be fixed somehow, otherwise I don't know how one can claim that galera cluster provides HA

Comment by Jan Lindström [ 2023-09-04 ]

serg I think I need to dig more because you are correct other nodes should be able to continue normally.

Comment by Jan Lindström [ 2023-09-06 ]

rpizzi I tried to reproduce this with 10.6 using 3-node cluster and using simple database with 100k rows and then 2 connections doing inserts and 2 connections doing updates. From another connection then I triggered code crash inside InnoDB ::write_row() on node_2. Remaining nodes node_1 and node_2 remained primary state. Is there something special on node configuration I should know?

Crash instrumentation:

 jan@jan-HP-ZBook-15u-G5:~/work/mariadb/10.6$ git diff
diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
index b440613c13f..e6b90f02279 100644
--- a/storage/innobase/handler/ha_innodb.cc
+++ b/storage/innobase/handler/ha_innodb.cc
@@ -7844,6 +7844,12 @@ ha_innobase::write_row(
 
        trx_t*          trx = thd_to_trx(m_user_thd);
 
+#ifdef WITH_WSREP
+        DBUG_EXECUTE_IF("wsrep_force_assert",
+                       assert(0);
+       );
+#endif
+
        /* Validation checks before we commence write_row operation. */
        if (is_read_only()) {
                DBUG_RETURN(HA_ERR_TABLE_READONLY);

How to enable it:

SET debug_dbug = '+d,wsrep_force_assert'; call insert_t1(2000);

Comment by Rick Pizzi [ 2023-09-06 ]

Only thing that comes to mind is that this is a 4 node cluster with node 4 having pc.weight=0. Not sure this makes any difference.
Also, when testing, you should actually simulate what happened in production, i.e. have an InnoDB assertion failure due to corrupted index.

Comment by Jan Lindström [ 2023-09-06 ]

rpizzi In my understanding pc.weight=0 is not good choice here because it means if one node goes down rest of nodes in cluster will loose Primary status. See https://galeracluster.com/library/documentation/weighted-quorum.html

InnoDB index corruption most likely is not caused by Galera and requires additional investigation. Stack trace is quite limited for this but anyway is out of scope for me.

Comment by Rick Pizzi [ 2023-09-06 ]

It is the opposite. Weight=0 means the node does not participate in quorum, and its online/offline status does not impact quorum calculation.
This ticket is not about finding source for index corruption. We need to find out why all nodes went non-primary when this happened.

Thanks,
Rick

Comment by Jan Lindström [ 2023-09-07 ]

rpizzi I tested this in 10.6 using 4-node cluster so that I set pc.weight=0 in all nodes. Then I used mysqladmin and shutdown one of the nodes. All the rest of nodes did go non-Primary as documentation hints. It appears that weight of 0 does not yield any weight for the PC, and any cluster would end up in split brain after one node dropping out.

Comment by Rick Pizzi [ 2023-09-07 ]

You cannot set pc.weight to 0 on all nodes.
As I said, only node 4 had 0, so that quorum calculation would ignore that node.
Please try and test accordingly.

Thanks
Rick

Comment by Jan Lindström [ 2023-09-07 ]

rpizzi Thanks for pointing out. I tried again with 4-node cluster and 10.6 so that only node_4 has pc.weight=0 and having assertion in same place as reported. However, I could not reproduce the problem that other nodes would drop from Primary state.

Node_2 fails on exactly same place:

mysys/stacktrace.c:215(my_print_stacktrace)[0x562347b46973]
sql/signal_handler.cc:241(handle_fatal_signal)[0x5623471ee0cb]
libc_sigaction.c:0(__restore_rt)[0x7f94d6e3c4b0]
nptl/pthread_kill.c:44(__pthread_kill_implementation)[0x7f94d6e90ffb]
posix/raise.c:27(__GI_raise)[0x7f94d6e3c406]
stdlib/abort.c:81(__GI_abort)[0x7f94d6e2287c]
intl/loadmsgcat.c:1177(_nl_load_domain)[0x7f94d6e2279b]
/lib/x86_64-linux-gnu/libc.so.6(+0x33b86)[0x7f94d6e33b86]
page/page0zip.cc:4216(page_zip_dir_insert(page_cur_t*, unsigned short, unsigned char*, mtr_t*))[0x5623477bba08]
page/page0cur.cc:2143(page_cur_insert_rec_zip(page_cur_t*, unsigned char const*, unsigned short*, mtr_t*))[0x56234779459a]
include/page0cur.inl:195(page_cur_tuple_insert(page_cur_t*, dtuple_t const*, unsigned short**, mem_block_info_t**, unsigned long, mtr_t*))[0x56234793010c]
btr/btr0cur.cc:2491(btr_cur_optimistic_insert(unsigned long, btr_cur_t*, unsigned short**, mem_block_info_t**, dtuple_t*, unsigned char**, big_rec_t**, unsigned long, que_thr_t*, mtr_t*))[0x56234793ba1f]
row/row0ins.cc:2852(row_ins_clust_index_entry_low(unsigned long, btr_latch_mode, dict_index_t*, unsigned long, dtuple_t*, unsigned long, que_thr_t*))[0x562347812880]
row/row0ins.cc:3242(row_ins_clust_index_entry(dict_index_t*, dtuple_t*, que_thr_t*, unsigned long))[0x562347813dd2]
row/row0ins.cc:3368(row_ins_index_entry(dict_index_t*, dtuple_t*, que_thr_t*))[0x56234781436a]
row/row0ins.cc:3536(row_ins_index_entry_step(ins_node_t*, que_thr_t*))[0x562347814cc2]
row/row0ins.cc:3661(row_ins(ins_node_t*, que_thr_t*))[0x5623478151f4]
row/row0ins.cc:3790(row_ins_step(que_thr_t*))[0x5623478159e9]
row/row0mysql.cc:1317(row_insert_for_mysql(unsigned char const*, row_prebuilt_t*, ins_mode_t))[0x56234783803d]
handler/ha_innodb.cc:7907(ha_innobase::write_row(unsigned char const*))[0x5623476550d1]
sql/handler.cc:7639(handler::ha_write_row(unsigned char const*))[0x562347208bde]
sql/sql_insert.cc:2166(write_record(THD*, TABLE*, st_copy_info*, select_result*))[0x562346dd4a08]
sql/sql_insert.cc:1131(mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item> >&, List<Item>&, List<Item>&, enum_duplicates, bool, select_result*))[0x562346dd1467]
sql/sql_parse.cc:4580(mysql_execute_command(THD*, bool))[0x562346e28fbe]
sql/sp_head.cc:3843(sp_instr_stmt::exec_core(THD*, unsigned int*))[0x562346d1e3cb]
sql/sp_head.cc:3568(sp_lex_keeper::reset_lex_and_exec_core(THD*, unsigned int*, bool, sp_instr*))[0x562346d1d69d]
sql/sp_head.cc:3749(sp_instr_stmt::execute(THD*, unsigned int*))[0x562346d1df53]
sql/sp_head.cc:1442(sp_head::execute(THD*, bool))[0x562346d17047]
sql/sp_head.cc:2485(sp_head::execute_procedure(THD*, List<Item>*))[0x562346d19fac]
sql/sql_parse.cc:3036(do_execute_sp(THD*, sp_head*))[0x562346e23bf7]
sql/sql_parse.cc:3282(Sql_cmd_call::execute(THD*))[0x562346e2488a]
sql/sql_parse.cc:6024(mysql_execute_command(THD*, bool))[0x562346e2ebfa]
sql/sql_parse.cc:8048(mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x562346e34f78]
sql/sql_parse.cc:7871(wsrep_mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x562346e3462a]
sql/sql_parse.cc:1883(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool))[0x562346e20776]
sql/sql_parse.cc:1409(do_command(THD*, bool))[0x562346e1f1b2]
sql/sql_connect.cc:1416(do_handle_one_connection(CONNECT*, bool))[0x562346ff7cfe]
sql/sql_connect.cc:1320(handle_one_connection)[0x562346ff7a67]
perfschema/pfs.cc:2203(pfs_spawn_thread)[0x5623475662a2]
nptl/pthread_create.c:444(start_thread)[0x7f94d6e8f18a]
x86_64/clone3.S:83(clone3)[0x7f94d6f1dbd0]

From node_1

mysql> show status like 'wsrep%';
--------------
show status like 'wsrep%'
--------------
 
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| Variable_name                 | Value                                                                                                                                          |
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| wsrep_local_state_uuid        | 5f8f1ef8-4d59-11ee-b03b-1edf34a26753                                                                                                           |
| wsrep_protocol_version        | 10                                                                                                                                             |
| wsrep_last_committed          | 115233                                                                                                                                         |
| wsrep_replicated              | 110496                                                                                                                                         |
| wsrep_replicated_bytes        | 35343800                                                                                                                                       |
| wsrep_repl_keys               | 331483                                                                                                                                         |
| wsrep_repl_keys_bytes         | 5303768                                                                                                                                        |
| wsrep_repl_data_bytes         | 22387370                                                                                                                                       |
| wsrep_repl_other_bytes        | 0                                                                                                                                              |
| wsrep_received                | 7977                                                                                                                                           |
| wsrep_received_bytes          | 1563770                                                                                                                                        |
| wsrep_local_commits           | 110493                                                                                                                                         |
| wsrep_local_cert_failures     | 0                                                                                                                                              |
| wsrep_local_replays           | 0                                                                                                                                              |
| wsrep_local_send_queue        | 0                                                                                                                                              |
| wsrep_local_send_queue_max    | 2                                                                                                                                              |
| wsrep_local_send_queue_min    | 0                                                                                                                                              |
| wsrep_local_send_queue_avg    | 1.79535e-05                                                                                                                                    |
| wsrep_local_recv_queue        | 0                                                                                                                                              |
| wsrep_local_recv_queue_max    | 7                                                                                                                                              |
| wsrep_local_recv_queue_min    | 0                                                                                                                                              |
| wsrep_local_recv_queue_avg    | 0.0208098                                                                                                                                      |
| wsrep_local_cached_downto     | 84794                                                                                                                                          |
| wsrep_flow_control_paused_ns  | 11834047189                                                                                                                                    |
| wsrep_flow_control_paused     | 0.0199474                                                                                                                                      |
| wsrep_flow_control_sent       | 0                                                                                                                                              |
| wsrep_flow_control_recv       | 1                                                                                                                                              |
| wsrep_flow_control_active     | false                                                                                                                                          |
| wsrep_flow_control_requested  | false                                                                                                                                          |
| wsrep_cert_deps_distance      | 93.4516                                                                                                                                        |
| wsrep_apply_oooe              | 0.069626                                                                                                                                       |
| wsrep_apply_oool              | 0.00321965                                                                                                                                     |
| wsrep_apply_window            | 1.09509                                                                                                                                        |
| wsrep_apply_waits             | 0                                                                                                                                              |
| wsrep_commit_oooe             | 0                                                                                                                                              |
| wsrep_commit_oool             | 0                                                                                                                                              |
| wsrep_commit_window           | 1.00368                                                                                                                                        |
| wsrep_local_state             | 4                                                                                                                                              |
| wsrep_local_state_comment     | Synced                                                                                                                                         |
| wsrep_cert_index_size         | 93                                                                                                                                             |
| wsrep_causal_reads            | 11                                                                                                                                             |
| wsrep_cert_interval           | 0.10328                                                                                                                                        |
| wsrep_open_transactions       | 2                                                                                                                                              |
| wsrep_open_connections        | 0                                                                                                                                              |
| wsrep_incoming_addresses      | 127.0.0.1:16020,127.0.0.1:16022,127.0.0.1:16023                                                                                                |
| wsrep_cluster_weight          | 2                                                                                                                                              |
| wsrep_debug_sync_waiters      |                                                                                                                                                |
| wsrep_desync_count            | 0                                                                                                                                              |
| wsrep_evs_delayed             |                                                                                                                                                |
| wsrep_evs_evict_list          |                                                                                                                                                |
| wsrep_evs_repl_latency        | 0.000234768/0.000425666/0.0120175/0.000450524/681                                                                                              |
| wsrep_evs_state               | OPERATIONAL                                                                                                                                    |
| wsrep_gcomm_uuid              | 5f8e47b5-4d59-11ee-82e6-43df7302848a                                                                                                           |
| wsrep_gmcast_segment          | 0                                                                                                                                              |
| wsrep_applier_thread_count    | 4                                                                                                                                              |
| wsrep_cluster_capabilities    |                                                                                                                                                |
| wsrep_cluster_conf_id         | 3                                                                                                                                              |
| wsrep_cluster_size            | 3                                                                                                                                              |
| wsrep_cluster_state_uuid      | 5f8f1ef8-4d59-11ee-b03b-1edf34a26753                                                                                                           |
| wsrep_cluster_status          | Primary                                                                                                                                        |
| wsrep_connected               | ON                                                                                                                                             |
| wsrep_local_bf_aborts         | 0                                                                                                                                              |
| wsrep_local_index             | 0                                                                                                                                              |
| wsrep_provider_capabilities   | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
| wsrep_provider_name           | Galera                                                                                                                                         |
| wsrep_provider_vendor         | Codership Oy <info@codership.com>                                                                                                              |
| wsrep_provider_version        | 26.4.14(r75464733)                                                                                                                             |
| wsrep_ready                   | ON                                                                                                                                             |
| wsrep_rollbacker_thread_count | 1                                                                                                                                              |
| wsrep_thread_count            | 5                                                                                                                                              |
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
70 rows in set (0,00 sec)

Comment by Rick Pizzi [ 2023-09-07 ]

I rechecked the logs of this failure. This happened on node 2.
It appears that after the assertion, the asserting thread took a VERY long time to dump the stack, and also there was a core file to be generated
after that.

See below for the sequence; you can clearly see that while the asserting thread is dumping stack, wsrep is still talking with other nodes.
Hope this will help. Maybe you should enable core-file and see if that makes a difference.

: NO)
2023-07-07 15:50:19 11985573 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2023-07-07 15:50:19 11985573 [ERROR] mariadbd: Index for table 'MAJ_EVENEMENTS_RAPPROCHEMENT' is corrupt; try to repair it
2023-07-07 15:50:20 0x7f17f7766700  InnoDB: Assertion failure in file /home/jenkins/workspace/Build-Package/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/storage/innobase/page/page0zip.cc line 4213
InnoDB: Failing assertion: slot_rec
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mariadbd startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
230707 15:50:20 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.6.12-7-MariaDB-enterprise-log source revision: 8e2b75dad28995ab5f6e6acd436135420f7031c9
key_buffer_size=268435456
read_buffer_size=131072
max_used_connections=2243
max_threads=6002
thread_count=1565
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 13479553 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7ef735ab51c8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f17f7765cb8 thread_stack 0x49000
Can't start addr2line
/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x5608f6116c7e]
/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x5608f5bc33a5]
/lib64/libpthread.so.0(+0xf630)[0x7f2011c15630]
/lib64/libc.so.6(gsignal+0x37)[0x7f2011060387]
/lib64/libc.so.6(abort+0x148)[0x7f2011061a78]
/usr/sbin/mariadbd(+0x694d97)[0x5608f5834d97]
/usr/sbin/mariadbd(+0xdbfb05)[0x5608f5f5fb05]
/usr/sbin/mariadbd(+0xdaf516)[0x5608f5f4f516]
2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 844de70f-8aaf with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 766 rttvar: 579 rto: 201000 lost: 0 last_data_recv: 2567 cwnd: 10 last_queued_since: 8776161264 last_delivered_since: 11959172679 send_queue_length: 9 send_queue_bytes: 720 segment: 0 messages: 9
2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.103:4567
2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 5dda822d-b4c3 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 6407 rttvar: 10277 rto: 207000 lost: 0 last_data_recv: 5900 cwnd: 10 last_queued_since: 307819 last_delivered_since: 8781038225 send_queue_length: 10 send_queue_bytes: 1080 segment: 0 messages: 10
2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.104:42156
2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 96c49f4b-8727 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 5591 rttvar: 10094 rto: 206000 lost: 0 last_data_recv: 5924 cwnd: 10 last_queued_since: 10916 last_delivered_since: 8781705783 send_queue_length: 11 send_queue_bytes: 1292 segment: 0 messages: 11
2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.101:33510
2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.10.1.101:4567 tcp://10.10.1.103:4567 tcp://10.10.1.104:4567 
2023-07-07 15:50:32 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT12.0655S), skipping check
2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7f17e8905e58
2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer destruct
2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7ef49da77b98
2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer destruct
/usr/sbin/mariadbd(+0xe62a65)[0x5608f6002a65]
2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 96c49f4b-8727 (tcp://10.10.1.101:4567), attempt 0
2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 844de70f-8aaf (tcp://10.10.1.103:4567), attempt 0
2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 5dda822d-b4c3 (tcp://10.10.1.104:4567), attempt 0
2023-07-07 15:50:33 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.73771S), skipping check
/usr/sbin/mariadbd(+0xe4e639)[0x5608f5fee639]
/usr/sbin/mariadbd(+0xe5063b)[0x5608f5ff063b]
/usr/sbin/mariadbd(+0xe62e98)[0x5608f6002e98]
/usr/sbin/mariadbd(+0xde0227)[0x5608f5f80227]
/usr/sbin/mariadbd(+0xde2da4)[0x5608f5f82da4]
2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 1000 rttvar: 500 rto: 201000 lost: 0 last_data_recv: 125408244 cwnd: 10 last_queued_since: 4421911460 last_delivered_since: 8715642835357354 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 359 rttvar: 179 rto: 201000 lost: 0 last_data_recv: 7528 cwnd: 10 last_queued_since: 120133 last_delivered_since: 120133 send_queue_length: 0 send_queue_bytes: 0
2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 437 rttvar: 218 rto: 201000 lost: 0 last_data_recv: 7532 cwnd: 10 last_queued_since: 4131087 last_delivered_since: 4131087 send_queue_length: 0 send_queue_bytes: 0
2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 181 rttvar: 90 rto: 201000 lost: 0 last_data_recv: 7532 cwnd: 10 last_queued_since: 8715642839939685 last_delivered_since: 8715642839939685 send_queue_length: 0 send_queue_bytes: 0
2023-07-07 15:50:41 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT7.53388S), skipping check
/usr/sbin/mariadbd(+0xe151ab)[0x5608f5fb51ab]
2023-07-07 15:50:43 0 [Note] WSREP: Deferred close timer handle_wait Success for 0x7f17ebc5b168
2023-07-07 15:50:43 0 [Note] WSREP: Deferred close timer destruct
2023-07-07 15:50:43 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.61329S), skipping check
/usr/sbin/mariadbd(+0xe15869)[0x5608f5fb5869]
2023-07-07 15:50:44 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.63124S), skipping check
2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 5dda822d-b4c3
2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 844de70f-8aaf
2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 96c49f4b-8727
/usr/sbin/mariadbd(+0xdf28b2)[0x5608f5f928b2]
/usr/sbin/mariadbd(+0xd43ca8)[0x5608f5ee3ca8]
/usr/sbin/mariadbd(_ZN7handler13ha_update_rowEPKhS1_+0x232)[0x5608f5bd12b2]
/usr/sbin/mariadbd(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x1a63)[0x5608f5a5cf33]
/usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x263e)[0x5608f597d38e]
/usr/sbin/mariadbd(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x38)[0x5608f58cb718]
/usr/sbin/mariadbd(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x176)[0x5608f58d48b6]
/usr/sbin/mariadbd(_ZN13sp_instr_stmt7executeEP3THDPj+0x5bc)[0x5608f58d529c]
/usr/sbin/mariadbd(_ZN7sp_head7executeEP3THDb+0xa0c)[0x5608f58ceeac]
2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 0 rttvar: 0 rto: 200000 lost: 0 last_data_recv: 125424590 cwnd: 10 last_queued_since: 11598743653 last_delivered_since: 8715659181935888 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 5dda822d-b4c3 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 185 rttvar: 82 rto: 201000 lost: 0 last_data_recv: 5508 cwnd: 10 last_queued_since: 11598929590 last_delivered_since: 11598962474 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 844de70f-8aaf with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 264 rttvar: 105 rto: 201000 lost: 0 last_data_recv: 5508 cwnd: 10 last_queued_since: 11599134047 last_delivered_since: 11599140957 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 96c49f4b-8727 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 384 rttvar: 151 rto: 201000 lost: 0 last_data_recv: 5678 cwnd: 10 last_queued_since: 11599486516 last_delivered_since: 11599494477 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
/usr/sbin/mariadbd(_ZN7sp_head17execute_procedureEP3THDP4ListI4ItemE+0x66a)[0x5608f58d093a]
/usr/sbin/mariadbd(+0x7cfc17)[0x5608f596fc17]
/usr/sbin/mariadbd(+0x7d3a68)[0x5608f5973a68]
/usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x101a)[0x5608f597bd6a]
2023-07-07 15:51:04 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 0 rttvar: 0 rto: 200000 lost: 0 last_data_recv: 125430994 cwnd: 10 last_queued_since: 221024 last_delivered_since: 8715665585967344 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2023-07-07 15:51:04 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT18.0029S), skipping check
2023-07-07 15:51:04 0 [Note] WSREP: view(view_id(NON_PRIM,5dda822d-b4c3,50) memb {
	6c357751-8d5f,0
} joined {
} left {
} partitioned {
	5dda822d-b4c3,0
	844de70f-8aaf,0
	96c49f4b-8727,0
})
2023-07-07 15:51:04 0 [Note] WSREP: view(view_id(NON_PRIM,6c357751-8d5f,51) memb {
	6c357751-8d5f,0
} joined {
} left {
} partitioned {
	5dda822d-b4c3,0
	844de70f-8aaf,0
	96c49f4b-8727,0
})
2023-07-07 15:51:04 11994604 [Warning] WSREP: Send action {(nil), 139603616989752, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2023-07-07 15:51:04 0 [Note] WSREP: Flow-control interval: [240, 300]
2023-07-07 15:51:04 0 [Note] WSREP: Received NON-PRIMARY.
2023-07-07 15:51:04 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 6204240577)
2023-07-07 15:51:04 11955214 [Warning] WSREP: Send action {(nil), 139599322023456, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 11985855 [Warning] WSREP: Send action {(nil), 139603616990584, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 11987046 [Warning] WSREP: Send action {(nil), 139599322023328, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 11985820 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 46 [Note] WSREP: ================================================
View:
  id: c3a51458-b6fd-11eb-8a80-eb35c100e72c:6204240577
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(1):
	0: 6c357751-ce4f-11ed-8d5f-136e7094748b, PIXID-MDB-MASTER2
=================================================
2023-07-07 15:51:04 46 [Note] WSREP: Non-primary view
2023-07-07 15:51:04 46 [Note] WSREP: Server status change synced -> connected
2023-07-07 15:51:04 46 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2023-07-07 15:51:04 11997537 [Warning] WSREP: Send action {(nil), 139603616989760, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 11996647 [Warning] WSREP: Send action {(nil), 139573552218680, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 11986459 [Warning] WSREP: Send action {(nil), 139736760976944, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 11997986 [Warning] WSREP: Send action {(nil), 139599322023552, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 11985505 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:04 11988311 [Warning] WSREP: Send action {(nil), 139607911957872, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11917365 [Warning] WSREP: Send action {(nil), 139586437121400, WRITESET} returned -107 (Transport endpoint is not connected)
/usr/sbin/mariadbd(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x38)[0x5608f58cb718]
2023-07-07 15:51:06 11985895 [Warning] WSREP: Send action {(nil), 139590732088096, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2023-07-07 15:51:06 0 [Note] WSREP: Flow-control interval: [240, 300]
2023-07-07 15:51:06 0 [Note] WSREP: Received NON-PRIMARY.
2023-07-07 15:51:06 11978506 [Warning] WSREP: Send action {(nil), 139599322023472, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11997530 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11988302 [Warning] WSREP: Send action {(nil), 139599322023760, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11988512 [Warning] WSREP: Send action {(nil), 139736760977344, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11988290 [Warning] WSREP: Send action {(nil), 139595027055344, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11998006 [Warning] WSREP: Send action {(nil), 139599322023888, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11988294 [Warning] WSREP: Send action {(nil), 139603616990632, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11991973 [Warning] WSREP: Send action {(nil), 139599322023712, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11988288 [Warning] WSREP: Send action {(nil), 139595027057080, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11985860 [Warning] WSREP: Send action {(nil), 139577847186808, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11997914 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11987069 [Warning] WSREP: Send action {(nil), 139599322023280, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11988354 [Warning] WSREP: Send action {(nil), 139736760976760, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11986489 [Warning] WSREP: Send action {(nil), 139564962285120, WRITESET} returned -107 (Transport endpoint is not connected)
2023-07-07 15:51:06 11986139 [Warning] WSREP: Send action {(nil), 139582142155592, WRITESET} returned -107 (Transport endpoint is not connected)
/usr/sbin/mariadbd(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x176)[0x5608f58d48b6]
2023-07-07 15:51:07 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.43448S), skipping check
/usr/sbin/mariadbd(_ZN13sp_instr_stmt7executeEP3THDPj+0x5bc)[0x5608f58d529c]
/usr/sbin/mariadbd(_ZN7sp_head7executeEP3THDb+0xa0c)[0x5608f58ceeac]
2023-07-07 15:51:10 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.93087S), skipping check
/usr/sbin/mariadbd(_ZN7sp_head15execute_triggerEP3THDPK25st_mysql_const_lex_stringS4_P13st_grant_info+0x1df)[0x5608f58d008f]
2023-07-07 15:51:11 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 5dda822d-b4c3 tcp://10.10.1.104:4567
2023-07-07 15:51:11 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 844de70f-8aaf tcp://10.10.1.103:4567
/usr/sbin/mariadbd(_ZN19Table_triggers_list16process_triggersEP3THD14trg_event_type20trg_action_time_typeb+0x104)[0x5608f5a40ec4]
2023-07-07 15:51:13 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.87478S), skipping check
/usr/sbin/mariadbd(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP10SQL_I_ListI8st_orderEyyP13select_result+0xd99)[0x5608f5d33da9]
/usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x24c8)[0x5608f597d218]
2023-07-07 15:51:16 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.90666S), skipping check
/usr/sbin/mariadbd(_Z11mysql_parseP3THDPcjP12Parser_state+0x20a)[0x5608f5980c9a]
/usr/sbin/mariadbd(+0x7e1531)[0x5608f5981531]
2023-07-07 15:51:19 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 96c49f4b-8727 tcp://10.10.1.101:4567
2023-07-07 15:51:19 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.83971S), skipping check
/usr/sbin/mariadbd(_Z16dispatch_command19enum_server_commandP3THDPcjb+0x29e1)[0x5608f5984c31]
2023-07-07 15:51:20 0 [Warning] WSREP: evs::proto(6c357751-8d5f, GATHER, view_id(REG,6c357751-8d5f,51)) install timer expired
evs::proto(evs::proto(6c357751-8d5f, GATHER, view_id(REG,6c357751-8d5f,51)), GATHER) {
current_view=view(view_id(REG,6c357751-8d5f,51) memb {
	6c357751-8d5f,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=2,safe_seq=2,node_index=node: {idx=0,range=[3,2],safe_seq=2} },
fifo_seq=1874639154,
last_sent=2,
known:
5dda822d-b4c3 at tcp://10.10.1.104:4567
{o=1,s=0,i=0,fs=834192942,jm=
{v=1,t=4,ut=255,o=1,s=122,sr=-1,as=122,f=4,src=5dda822d-b4c3,srcvid=view_id(REG,5dda822d-b4c3,52),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=834192942,nl=(
	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}
	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
	96c49f4b-8727, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
)
},
}
6c357751-8d5f at 
{o=1,s=0,i=0,fs=-1,jm=
{v=1,t=4,ut=255,o=1,s=2,sr=-1,as=2,f=0,src=6c357751-8d5f,srcvid=view_id(REG,6c357751-8d5f,51),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=1874639154,nl=(
	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}
	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
	96c49f4b-8727, {o=0,s=0,e=0,ls=-1,vid=view_id(REG,00000000-0000,0),ss=-1,ir=[-1,-1],}
)
},
}
844de70f-8aaf at tcp://10.10.1.103:4567
{o=1,s=0,i=0,fs=1475544355,jm=
{v=1,t=4,ut=255,o=1,s=122,sr=-1,as=122,f=4,src=844de70f-8aaf,srcvid=view_id(REG,5dda822d-b4c3,52),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=1475544355,nl=(
	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}
	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
	96c49f4b-8727, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
)
},
}
96c49f4b-8727 at tcp://10.10.1.101:4567
{o=0,s=0,i=0,fs=101154494,}
 }
2023-07-07 15:51:20 0 [Note] WSREP: no install message received
2023-07-07 15:51:20 0 [Note] WSREP: view(view_id(NON_PRIM,6c357751-8d5f,52) memb {
	6c357751-8d5f,0
} joined {
} left {
} partitioned {
	5dda822d-b4c3,0
	844de70f-8aaf,0
	96c49f4b-8727,0
})
2023-07-07 15:51:20 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2023-07-07 15:51:20 0 [Note] WSREP: Flow-control interval: [240, 300]
2023-07-07 15:51:20 0 [Note] WSREP: Received NON-PRIMARY.
/usr/sbin/mariadbd(_Z10do_commandP3THDb+0x132)[0x5608f5985942]
/usr/sbin/mariadbd(_Z24do_handle_one_connectionP7CONNECTb+0x3b7)[0x5608f5aa2dd7]
2023-07-07 15:51:23 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT4.34509S), skipping check
/usr/sbin/mariadbd(handle_one_connection+0x5d)[0x5608f5aa311d]
2023-07-07 15:51:24 11986441 [Warning] WSREP: Send action {(nil), 139599322023528, WRITESET} returned -107 (Transport endpoint is not connected)
/usr/sbin/mariadbd(+0xc839d2)[0x5608f5e239d2]
/lib64/libpthread.so.0(+0x7ea5)[0x7f2011c0dea5]
/lib64/libc.so.6(clone+0x6d)[0x7f2011128b0d]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7ef692470578): UPDATE DWHTmp.MAJ_EVENEMENTS_RAPPROCHEMENT
			SET TYPE_EVE =  NAME_CONST('V_TYPE_EVE',_utf8mb3'D' COLLATE 'utf8mb3_general_ci') , DATE_EFFECTIVE =  NAME_CONST('V_DATE_EFFECTIVE',TIMESTAMP'2023-07-07 15:50:20'), STATUT_DATAMART = NULL  WHERE EVT_ID =  NAME_CONST('V_EVT_ID',2051583478)
 
Connection ID (thread ID): 11985593
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /data/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             805978               805978               processes 
Max open files            1048576              1048576              files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       805978               805978               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: core
 
Kernel version: Linux version 3.10.0-1160.88.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Mar 7 15:41:52 UTC 2023
 
2023-07-07 17:08:23 0 [Note] Starting MariaDB 10.6.12-7-MariaDB-enterprise-log source revision 8e2b75dad28995ab5f6e6acd436135420f7031c9 as process 1083

Comment by Jan Lindström [ 2023-09-07 ]

rpizzi Based on these error logs remaining nodes were not in agreement of the absent node state, so they decided to exclude each other from the group. See here:

	6c357751-8d5f, {o=0,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,50),ss=50608928,ir=[50608930,50608929],}
vs.
	6c357751-8d5f, {o=0,s=1,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,50),ss=50608928,ir=[50608930,50608929],}
vs
	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,50),ss=50608928,ir=[50608930,50608929],}

where o=operational and s = suspected. First thinks both are 0 (false), second thinks node is suspected and last thinks node is still operational. This depends on input and timing.

Inconsistency issue found by ramesh is a bug (MDEV-32122) but not related to problem here.

Comment by Rick Pizzi [ 2023-09-07 ]

Well, this confirms the issue doesn't it...
I believe the disagreement comes from the fact that WSREP layer did not die when server did, hence responded to requests from other nodes in an inconsistent manner.

Comment by Jan Lindström [ 2023-09-07 ]

rpizzi Yes. serg signals that are raised or sent to the process (instead of a specific thread), will still be handled by a random thread (among those that do not block it). So is there any way to get those wsrep threads die faster? In the other nodes there is not much we can do as their knowledge on node state depends when information of state last time was received and/or requested but it would help if crashing node threads doing traffic to other nodes are down as soon as possible.

Comment by Michael Widenius [ 2023-09-08 ]

When the server crash, there are usually not much to do except printing a stack trace and call exit()
In theory we could on crash send some kind of signal to WSREP threads (If they have a THD then we can mark it killed).
Would marking the THD as killed help?

Comment by Jan Lindström [ 2023-09-11 ]

Marking THD as killed would help only for appliers but not for thread(s) used inside Galera library for connections to other nodes.

Comment by Rick Pizzi [ 2023-09-11 ]

monty the issue here is that "printing a stack trace" takes, like, 5 minutes.
In the meantime the WSREP threads are alive and sending inconsistent information to other nodes.

Comment by Jan Lindström [ 2023-10-05 ]

marko serg Is there anything we can do for this and we talk here Release builds assertions? Whatever you can do it needs to be done before we enter signal handler and start doing core dump if we do it as it could take long time. Currently, maybe we could set thd on killed state (not sure if this is enough) for wsrep applier threads. But this does not mean yet that node is not reachable on Galera point of view. There is not now a way to disconnect all incoming/out coming connections inside Galera from server code, this would partly solve the issue but again it depends on timing i.e. when other nodes ask/discover crashing node status and because this is asynchronous there is still risk that remaining nodes still do not agree state of crashing node (i.e is it operational or suspected).

Comment by Sergei Golubchik [ 2023-10-09 ]

janlindstrom, why remaining nodes do not agree on state of crashing node?

Comment by Jan Lindström [ 2023-10-10 ]

serg See my comment on 2023-09-07 10:47, there one node thinks node is down, second thinks node is suspected and last thinks node is still operational. This is because nodes notice that node is down or suspected to be down based on information they have received and this when they receive is timing dependent. Why one node thinks node crashing is still operational, it must be based on fact that we have received something from that node before (or during) crashing and information that node is down has not yet arrived or some timeout on connection is not yet reached.

My questions was not how to improve this agreement on node states, it was how to make crashing node more unreachable e.g. by killing appliers and closing all incoming and outgoing connections earlier.

Comment by Sergei Golubchik [ 2023-10-11 ]

If the crashing node needs to send three messages, sequentially, to three different nodes, then there always will be a race condition. You kill them earlier, you kill them later, whatever you do, the node won't die instantly as a whole. Galera must be able to cope with it, otherwise any node crash can break the cluster.

But I don't understand why Galera cannot cope with it. Nodes send messages to peers in a specific order. So if a node A with a lower number thinks that some node X is up and the node B with a higher number thinks that the node X is down, it means that the node X crashed after sending a message to A, but before sending a message to B. This is easy to detect.

Comment by Jan Lindström [ 2023-10-27 ]

teemu.ollakka Can you explain why remaining nodes fail to agree crashing node state and start to self leave.

Comment by Marko Mäkelä [ 2023-11-09 ]

The reported assertion failure here occurs when a record is being inserted into a corrupted ROW_FORMAT=COMPRESSED InnoDB page. This crash was not removed in MDEV-13542. An obvious workaround for this particular case would be to avoid using ROW_FORMAT=COMPRESSED tables. Some design mistakes are not easy to fix; see MDEV-30882 and MDEV-31574.

Comment by Jan Lindström [ 2024-02-02 ]

rpizzi I tried to reproduce this issue with latest 10.6 and Galera library 26.4.17 with attached test case. After several hours of testing, I still could not reproduce the issue.

Comment by Marko Mäkelä [ 2024-02-02 ]

janlindstrom, I see that galera_crash_node.test uses debug injection for crashing a node at a specific point of execution. I think that a more realistic test scenario would be to run CMAKE_BUILD_TYPE=RelWithDebInfo executables and randomly kill one of the cluster nodes externally (by kill -KILL).

Comment by Jan Lindström [ 2024-02-06 ]

I tested again hours with following setup

  • 10.6 commit bde552ae RelWithDebInfo build
  • Galera library 26.4.17 release build
  • 3 node cluster
  • sysbench load to node_1 and node_3 (oltp_read_write)
  • kill -9 node_3 after a while and restart node_3 + sysbench load

Result: remaining nodes stayed up and running as expected i.e. I could not reproduce.

Comment by Rick Pizzi [ 2024-02-06 ]

I don't think that killing the node with kill -9 will ever reproduce it.
As explained it has to be a code assertion.

Rick

Comment by Rick Pizzi [ 2024-02-06 ]

The whole point of this ticket is that WSREP layer remains active after the assertion generates the trap.
Killing the process with SIGKILL will not allow the code to do anything, including executing the trap code.

Comment by Marko Mäkelä [ 2024-02-06 ]

janlindstrom, did you test with kill -ABRT as well? I think that it should trigger our built-in stack trace reporter, which depending on the circumstances, could hang or cause unexpected behaviour.

Comment by Jan Lindström [ 2024-02-07 ]

rpizzimarko Both cases were tested (with several test rounds) and remaining nodes stayed on cluster normally i.e. I could not reproduce case where all nodes leave the cluster.

Generated at Thu Feb 08 10:30:47 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.