[MDEV-32363] when InnoDB gets an assertion failure, WSREP layer is not handled gracefully - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.6.12
Fix Version/s: 10.5.27, 10.6.20, 10.11.10, 11.2.6, 11.4.4
Component/s: Galera, Server
Labels:
None

Description

If InnoDB gets an assertion failure, the WSREP layer is not immediately notified and this causes all nodes to lose primary status as a consequence.

You can see below that WSREP messages are mixed with the assertion stacktrace, and that WSREP tries to reconnect with peers even if the node is practically crashed and in fact it will die soon.

When this happens, the entire cluster goes into non-primary and a cluster bootstrap is needed to recover, which is not what we expect from the situation – crashed node should just be evicted and cluster should continue normally.

2023-07-07 15:50:19 11985573 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.

2023-07-07 15:50:19 11985573 [ERROR] mariadbd: Index for table 'failed_table' is corrupt; try to repair it

2023-07-07 15:50:20 0x7f17f7766700  InnoDB: Assertion failure in file /home/jenkins/workspace/Build-Package/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/storage/innobase/page/page0zip.cc line 4213

InnoDB: Failing assertion: slot_rec

InnoDB: We intentionally generate a memory trap.

InnoDB: Submit a detailed bug report to https://jira.mariadb.org/

InnoDB: If you get repeated assertion failures or crashes, even

InnoDB: immediately after the mariadbd startup, there may be

InnoDB: corruption in the InnoDB tablespace. Please refer to

InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/

InnoDB: about forcing recovery.

230707 15:50:20 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.6.12-7-MariaDB-enterprise-log source revision: 8e2b75dad28995ab5f6e6acd436135420f7031c9

key_buffer_size=268435456

read_buffer_size=131072

max_used_connections=2243

max_threads=6002

thread_count=1565

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 13479553 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7ef735ab51c8

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x7f17f7765cb8 thread_stack 0x49000

Can't start addr2line

/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x5608f6116c7e]

/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x5608f5bc33a5]

/lib64/libpthread.so.0(+0xf630)[0x7f2011c15630]

/lib64/libc.so.6(gsignal+0x37)[0x7f2011060387]

/lib64/libc.so.6(abort+0x148)[0x7f2011061a78]

/usr/sbin/mariadbd(+0x694d97)[0x5608f5834d97]

/usr/sbin/mariadbd(+0xdbfb05)[0x5608f5f5fb05]

/usr/sbin/mariadbd(+0xdaf516)[0x5608f5f4f516]

2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 844de70f-8aaf with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 766 rttvar: 579 rto: 201000 lost: 0 last_data_recv: 2567 cwnd: 10 last_queued_since: 8776161264 last_delivered_since: 11959172679 send_queue_length: 9 send_queue_bytes: 720 segment: 0 messages: 9

2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.103:4567

2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 5dda822d-b4c3 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 6407 rttvar: 10277 rto: 207000 lost: 0 last_data_recv: 5900 cwnd: 10 last_queued_since: 307819 last_delivered_since: 8781038225 send_queue_length: 10 send_queue_bytes: 1080 segment: 0 messages: 10

2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.104:42156

2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 96c49f4b-8727 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 5591 rttvar: 10094 rto: 206000 lost: 0 last_data_recv: 5924 cwnd: 10 last_queued_since: 10916 last_delivered_since: 8781705783 send_queue_length: 11 send_queue_bytes: 1292 segment: 0 messages: 11

2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.101:33510

2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.10.1.101:4567 tcp://10.10.1.103:4567 tcp://10.10.1.104:4567

2023-07-07 15:50:32 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT12.0655S), skipping check

2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7f17e8905e58

2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer destruct

2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7ef49da77b98

2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer destruct

/usr/sbin/mariadbd(+0xe62a65)[0x5608f6002a65]

2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 96c49f4b-8727 (tcp://10.10.1.101:4567), attempt 0

2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 844de70f-8aaf (tcp://10.10.1.103:4567), attempt 0

2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 5dda822d-b4c3 (tcp://10.10.1.104:4567), attempt 0

2023-07-07 15:50:33 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.73771S), skipping check

/usr/sbin/mariadbd(+0xe4e639)[0x5608f5fee639]

/usr/sbin/mariadbd(+0xe5063b)[0x5608f5ff063b]

/usr/sbin/mariadbd(+0xe62e98)[0x5608f6002e98]

/usr/sbin/mariadbd(+0xde0227)[0x5608f5f80227]

/usr/sbin/mariadbd(+0xde2da4)[0x5608f5f82da4]

2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 1000 rttvar: 500 rto: 201000 lost: 0 last_data_recv: 125408244 cwnd: 10 last_queued_since: 4421911460 last_delivered_since: 8715642835357354 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0

2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 359 rttvar: 179 rto: 201000 lost: 0 last_data_recv: 7528 cwnd: 10 last_queued_since: 120133 last_delivered_since: 120133 send_queue_length: 0 send_queue_bytes: 0

2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 437 rttvar: 218 rto: 201000 lost: 0 last_data_recv: 7532 cwnd: 10 last_queued_since: 4131087 last_delivered_since: 4131087 send_queue_length: 0 send_queue_bytes: 0

2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 181 rttvar: 90 rto: 201000 lost: 0 last_data_recv: 7532 cwnd: 10 last_queued_since: 8715642839939685 last_delivered_since: 8715642839939685 send_queue_length: 0 send_queue_bytes: 0

2023-07-07 15:50:41 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT7.53388S), skipping check

/usr/sbin/mariadbd(+0xe151ab)[0x5608f5fb51ab]

2023-07-07 15:50:43 0 [Note] WSREP: Deferred close timer handle_wait Success for 0x7f17ebc5b168

2023-07-07 15:50:43 0 [Note] WSREP: Deferred close timer destruct

2023-07-07 15:50:43 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.61329S), skipping check

/usr/sbin/mariadbd(+0xe15869)[0x5608f5fb5869]

2023-07-07 15:50:44 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.63124S), skipping check

2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 5dda822d-b4c3

2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 844de70f-8aaf

2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 96c49f4b-8727

/usr/sbin/mariadbd(+0xdf28b2)[0x5608f5f928b2]

/usr/sbin/mariadbd(+0xd43ca8)[0x5608f5ee3ca8]

/usr/sbin/mariadbd(_ZN7handler13ha_update_rowEPKhS1_+0x232)[0x5608f5bd12b2]

/usr/sbin/mariadbd(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x1a63)[0x5608f5a5cf33]

/usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x263e)[0x5608f597d38e]

/usr/sbin/mariadbd(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x38)[0x5608f58cb718]

/usr/sbin/mariadbd(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x176)[0x5608f58d48b6]

/usr/sbin/mariadbd(_ZN13sp_instr_stmt7executeEP3THDPj+0x5bc)[0x5608f58d529c]

/usr/sbin/mariadbd(_ZN7sp_head7executeEP3THDb+0xa0c)[0x5608f58ceeac]

2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 0 rttvar: 0 rto: 200000 lost: 0 last_data_recv: 125424590 cwnd: 10 last_queued_since: 11598743653 last_delivered_since: 8715659181935888 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0

2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 5dda822d-b4c3 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 185 rttvar: 82 rto: 201000 lost: 0 last_data_recv: 5508 cwnd: 10 last_queued_since: 11598929590 last_delivered_since: 11598962474 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0

2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 844de70f-8aaf with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 264 rttvar: 105 rto: 201000 lost: 0 last_data_recv: 5508 cwnd: 10 last_queued_since: 11599134047 last_delivered_since: 11599140957 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0

2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 96c49f4b-8727 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 384 rttvar: 151 rto: 201000 lost: 0 last_data_recv: 5678 cwnd: 10 last_queued_since: 11599486516 last_delivered_since: 11599494477 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0

/usr/sbin/mariadbd(_ZN7sp_head17execute_procedureEP3THDP4ListI4ItemE+0x66a)[0x5608f58d093a]

/usr/sbin/mariadbd(+0x7cfc17)[0x5608f596fc17]

/usr/sbin/mariadbd(+0x7d3a68)[0x5608f5973a68]

/usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x101a)[0x5608f597bd6a]

2023-07-07 15:51:04 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 0 rttvar: 0 rto: 200000 lost: 0 last_data_recv: 125430994 cwnd: 10 last_queued_since: 221024 last_delivered_since: 8715665585967344 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0

2023-07-07 15:51:04 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT18.0029S), skipping check

2023-07-07 15:51:04 0 [Note] WSREP: view(view_id(NON_PRIM,5dda822d-b4c3,50) memb {

	6c357751-8d5f,0

} joined {

} left {

} partitioned {

	5dda822d-b4c3,0

	844de70f-8aaf,0

	96c49f4b-8727,0

})

2023-07-07 15:51:04 0 [Note] WSREP: view(view_id(NON_PRIM,6c357751-8d5f,51) memb {

	6c357751-8d5f,0

} joined {

} left {

} partitioned {

	5dda822d-b4c3,0

	844de70f-8aaf,0

	96c49f4b-8727,0

})

2023-07-07 15:51:04 11994604 [Warning] WSREP: Send action {(nil), 139603616989752, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

2023-07-07 15:51:04 0 [Note] WSREP: Flow-control interval: [240, 300]

2023-07-07 15:51:04 0 [Note] WSREP: Received NON-PRIMARY.

2023-07-07 15:51:04 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 6204240577)

2023-07-07 15:51:04 11955214 [Warning] WSREP: Send action {(nil), 139599322023456, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 11985855 [Warning] WSREP: Send action {(nil), 139603616990584, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 11987046 [Warning] WSREP: Send action {(nil), 139599322023328, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 11985820 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 46 [Note] WSREP: ================================================

View:

  id: c3a51458-b6fd-11eb-8a80-eb35c100e72c:6204240577

  status: non-primary

  protocol_version: 4

  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO

  final: no

  own_index: 0

  members(1):

	0: 6c357751-ce4f-11ed-8d5f-136e7094748b, PIXID-MDB-MASTER2

=================================================

2023-07-07 15:51:04 46 [Note] WSREP: Non-primary view

2023-07-07 15:51:04 46 [Note] WSREP: Server status change synced -> connected

2023-07-07 15:51:04 46 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

2023-07-07 15:51:04 11997537 [Warning] WSREP: Send action {(nil), 139603616989760, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 11996647 [Warning] WSREP: Send action {(nil), 139573552218680, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 11986459 [Warning] WSREP: Send action {(nil), 139736760976944, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 11997986 [Warning] WSREP: Send action {(nil), 139599322023552, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 11985505 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 11988311 [Warning] WSREP: Send action {(nil), 139607911957872, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11917365 [Warning] WSREP: Send action {(nil), 139586437121400, WRITESET} returned -107 (Transport endpoint is not connected)

/usr/sbin/mariadbd(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x38)[0x5608f58cb718]

2023-07-07 15:51:06 11985895 [Warning] WSREP: Send action {(nil), 139590732088096, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

2023-07-07 15:51:06 0 [Note] WSREP: Flow-control interval: [240, 300]

2023-07-07 15:51:06 0 [Note] WSREP: Received NON-PRIMARY.

2023-07-07 15:51:06 11978506 [Warning] WSREP: Send action {(nil), 139599322023472, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11997530 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11988302 [Warning] WSREP: Send action {(nil), 139599322023760, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11988512 [Warning] WSREP: Send action {(nil), 139736760977344, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11988290 [Warning] WSREP: Send action {(nil), 139595027055344, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11998006 [Warning] WSREP: Send action {(nil), 139599322023888, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11988294 [Warning] WSREP: Send action {(nil), 139603616990632, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11991973 [Warning] WSREP: Send action {(nil), 139599322023712, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11988288 [Warning] WSREP: Send action {(nil), 139595027057080, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11985860 [Warning] WSREP: Send action {(nil), 139577847186808, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11997914 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11987069 [Warning] WSREP: Send action {(nil), 139599322023280, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11988354 [Warning] WSREP: Send action {(nil), 139736760976760, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11986489 [Warning] WSREP: Send action {(nil), 139564962285120, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11986139 [Warning] WSREP: Send action {(nil), 139582142155592, WRITESET} returned -107 (Transport endpoint is not connected)

/usr/sbin/mariadbd(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x176)[0x5608f58d48b6]

2023-07-07 15:51:07 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.43448S), skipping check

/usr/sbin/mariadbd(_ZN13sp_instr_stmt7executeEP3THDPj+0x5bc)[0x5608f58d529c]

/usr/sbin/mariadbd(_ZN7sp_head7executeEP3THDb+0xa0c)[0x5608f58ceeac]

2023-07-07 15:51:10 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.93087S), skipping check

/usr/sbin/mariadbd(_ZN7sp_head15execute_triggerEP3THDPK25st_mysql_const_lex_stringS4_P13st_grant_info+0x1df)[0x5608f58d008f]

2023-07-07 15:51:11 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 5dda822d-b4c3 tcp://10.10.1.104:4567

2023-07-07 15:51:11 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 844de70f-8aaf tcp://10.10.1.103:4567

/usr/sbin/mariadbd(_ZN19Table_triggers_list16process_triggersEP3THD14trg_event_type20trg_action_time_typeb+0x104)[0x5608f5a40ec4]

2023-07-07 15:51:13 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.87478S), skipping check

/usr/sbin/mariadbd(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP10SQL_I_ListI8st_orderEyyP13select_result+0xd99)[0x5608f5d33da9]

/usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x24c8)[0x5608f597d218]

2023-07-07 15:51:16 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.90666S), skipping check

/usr/sbin/mariadbd(_Z11mysql_parseP3THDPcjP12Parser_state+0x20a)[0x5608f5980c9a]

/usr/sbin/mariadbd(+0x7e1531)[0x5608f5981531]

2023-07-07 15:51:19 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 96c49f4b-8727 tcp://10.10.1.101:4567

2023-07-07 15:51:19 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.83971S), skipping check

/usr/sbin/mariadbd(_Z16dispatch_command19enum_server_commandP3THDPcjb+0x29e1)[0x5608f5984c31]

2023-07-07 15:51:20 0 [Warning] WSREP: evs::proto(6c357751-8d5f, GATHER, view_id(REG,6c357751-8d5f,51)) install timer expired

evs::proto(evs::proto(6c357751-8d5f, GATHER, view_id(REG,6c357751-8d5f,51)), GATHER) {

current_view=view(view_id(REG,6c357751-8d5f,51) memb {

	6c357751-8d5f,0

} joined {

} left {

} partitioned {

}),

input_map=evs::input_map: {aru_seq=2,safe_seq=2,node_index=node: {idx=0,range=[3,2],safe_seq=2} },

fifo_seq=1874639154,

last_sent=2,

known:

5dda822d-b4c3 at tcp://10.10.1.104:4567

{o=1,s=0,i=0,fs=834192942,jm=

{v=1,t=4,ut=255,o=1,s=122,sr=-1,as=122,f=4,src=5dda822d-b4c3,srcvid=view_id(REG,5dda822d-b4c3,52),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=834192942,nl=(

	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}

	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}

	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}

	96c49f4b-8727, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}

},

6c357751-8d5f at

{o=1,s=0,i=0,fs=-1,jm=

{v=1,t=4,ut=255,o=1,s=2,sr=-1,as=2,f=0,src=6c357751-8d5f,srcvid=view_id(REG,6c357751-8d5f,51),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=1874639154,nl=(

	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}

	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}

	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}

	96c49f4b-8727, {o=0,s=0,e=0,ls=-1,vid=view_id(REG,00000000-0000,0),ss=-1,ir=[-1,-1],}

},

844de70f-8aaf at tcp://10.10.1.103:4567

{o=1,s=0,i=0,fs=1475544355,jm=

{v=1,t=4,ut=255,o=1,s=122,sr=-1,as=122,f=4,src=844de70f-8aaf,srcvid=view_id(REG,5dda822d-b4c3,52),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=1475544355,nl=(

	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}

	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}

	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}

	96c49f4b-8727, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}

},

96c49f4b-8727 at tcp://10.10.1.101:4567

{o=0,s=0,i=0,fs=101154494,}

2023-07-07 15:51:20 0 [Note] WSREP: no install message received

2023-07-07 15:51:20 0 [Note] WSREP: view(view_id(NON_PRIM,6c357751-8d5f,52) memb {

	6c357751-8d5f,0

} joined {

} left {

} partitioned {

	5dda822d-b4c3,0

	844de70f-8aaf,0

	96c49f4b-8727,0

})

2023-07-07 15:51:20 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

2023-07-07 15:51:20 0 [Note] WSREP: Flow-control interval: [240, 300]

2023-07-07 15:51:20 0 [Note] WSREP: Received NON-PRIMARY.

/usr/sbin/mariadbd(_Z10do_commandP3THDb+0x132)[0x5608f5985942]

/usr/sbin/mariadbd(_Z24do_handle_one_connectionP7CONNECTb+0x3b7)[0x5608f5aa2dd7]

2023-07-07 15:51:23 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT4.34509S), skipping check

/usr/sbin/mariadbd(handle_one_connection+0x5d)[0x5608f5aa311d]

2023-07-07 15:51:24 11986441 [Warning] WSREP: Send action {(nil), 139599322023528, WRITESET} returned -107 (Transport endpoint is not connected)

/usr/sbin/mariadbd(+0xc839d2)[0x5608f5e239d2]

/lib64/libpthread.so.0(+0x7ea5)[0x7f2011c0dea5]

/lib64/libc.so.6(clone+0x6d)[0x7f2011128b0d]

Trying to get some variables.

Some pointers may be invalid and cause the dump to abort.

Query (0x7ef692470578): UPDATE failed_schema.failed_table

			SET TYPE_EVE =  NAME_CONST('V_TYPE_EVE',_utf8mb3'D' COLLATE 'utf8mb3_general_ci') , DATE_EFFECTIVE =  NAME_CONST('V_DATE_EFFECTIVE',TIMESTAMP'2023-07-07 15:50:20'), STATUT_DATAMART = NULL  WHERE EVT_ID =  NAME_CONST('V_EVT_ID',2051583478)

Connection ID (thread ID): 11985593

Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off

The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains

information that should help you find out what is causing the crash.

Writing a core file...

Working directory at /data/mysql

Resource Limits:

Limit                     Soft Limit           Hard Limit           Units

Max cpu time              unlimited            unlimited            seconds

Max file size             unlimited            unlimited            bytes

Max data size             unlimited            unlimited            bytes

Max stack size            8388608              unlimited            bytes

Max core file size        0                    unlimited            bytes

Max resident set          unlimited            unlimited            bytes

Max processes             805978               805978               processes

Max open files            1048576              1048576              files

Max locked memory         65536                65536                bytes

Max address space         unlimited            unlimited            bytes

Max file locks            unlimited            unlimited            locks

Max pending signals       805978               805978               signals

Max msgqueue size         819200               819200               bytes

Max nice priority         0                    0

Max realtime priority     0                    0

Max realtime timeout      unlimited            unlimited            us

Core pattern: core

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

galera_crash_node.test
4 kB
2024-02-02 08:55
logs.tar.gz
9.48 MB
2023-08-24 05:31

Issue Links

causes

MDEV-34976 Server crash report broken if Galera is not loaded

Closed

relates to

MDEV-21010 Mariadb hangs (during a backtrace), stops responding to new connections

Closed

Activity

Ascending order - Click to sort in descending order

Jan Lindström added a comment - 2023-08-04 06:07

rpizzi Can we have error log from other nodes to resolve why they can't continue as a cluster? Node that is crashing there is nothing we can do.

Jan Lindström added a comment - 2023-08-04 06:07 rpizzi Can we have error log from other nodes to resolve why they can't continue as a cluster? Node that is crashing there is nothing we can do.

Rick Pizzi (Inactive) added a comment - 2023-08-18 06:40 - edited

can't you just immediately signal the WSREP threads?

Rick Pizzi (Inactive) added a comment - 2023-08-18 06:40 - edited can't you just immediately signal the WSREP threads?

Jan Lindström added a comment - 2023-08-22 04:50

rpizzi Very well, other nodes should not drop from cluster even if one of the nodes crashes. Do you have error logs from other nodes? I'm looking reason why they dropped from cluster.

Jan Lindström added a comment - 2023-08-22 04:50 rpizzi Very well, other nodes should not drop from cluster even if one of the nodes crashes. Do you have error logs from other nodes? I'm looking reason why they dropped from cluster.

Rick Pizzi (Inactive) added a comment - 2023-08-22 06:12

As I already explained, this was on a production system and logs are long gone.
I guess we can only wait for another occurrence of the issue.

Rick Pizzi (Inactive) added a comment - 2023-08-22 06:12 As I already explained, this was on a production system and logs are long gone. I guess we can only wait for another occurrence of the issue.

Ramesh Sivaraman added a comment - 2023-08-24 05:32 - edited

janlindstrom Reproduced cluster inconstancy using RQG data load. Active nodes become unstable when one of the nodes in cluster is forcefully killed while RQG data load is active.
The cluster became inconsistent, but the server did not crash as specified in the cluster description.
Test case
1) started 3 node cluster
2) Initiated RQG run on node1 and node2
3) forcefully killed node2

Node1 is disconnected from the cluster and Node3 loses its primary status. Error logs from cluster logs.tar.gz

Node1

node1:root@localhost> show status like '%wsrep%st%';

+------------------------------+--------------------------------------+

| Variable_name                | Value                                |

+------------------------------+--------------------------------------+

| wsrep_local_state_uuid       | 00000000-0000-0000-0000-000000000000 |

| wsrep_last_committed         | -1                                   |

| wsrep_flow_control_requested | false                                |

| wsrep_cert_deps_distance     | 41.6348                              |

| wsrep_local_state            | 5                                    |

| wsrep_local_state_comment    | Inconsistent                         |

| wsrep_cluster_capabilities   |                                      |

| wsrep_cluster_conf_id        | 18446744073709551615                 |

| wsrep_cluster_size           | 0                                    |

| wsrep_cluster_state_uuid     | e8298e61-400d-11ee-bed8-e3ccd61d69c8 |

| wsrep_cluster_status         | Disconnected                         |

+------------------------------+--------------------------------------+

11 rows in set (0.001 sec)

Node3

node3:root@localhost> show status like '%wsrep%st%';

+------------------------------+--------------------------------------+

| Variable_name                | Value                                |

+------------------------------+--------------------------------------+

| wsrep_local_state_uuid       | e8298e61-400d-11ee-bed8-e3ccd61d69c8 |

| wsrep_last_committed         | 19996                                |

| wsrep_flow_control_requested | false                                |

| wsrep_cert_deps_distance     | 26.4544                              |

| wsrep_local_state            | 0                                    |

| wsrep_local_state_comment    | Initialized                          |

| wsrep_cluster_weight         | 0                                    |

| wsrep_evs_evict_list         |                                      |

| wsrep_evs_state              | OPERATIONAL                          |

| wsrep_gmcast_segment         | 0                                    |

| wsrep_cluster_capabilities   |                                      |

| wsrep_cluster_conf_id        | 18446744073709551615                 |

| wsrep_cluster_size           | 1                                    |

| wsrep_cluster_state_uuid     | e8298e61-400d-11ee-bed8-e3ccd61d69c8 |

| wsrep_cluster_status         | non-Primary                          |

+------------------------------+--------------------------------------+

15 rows in set (0.001 sec)

node3:root@localhost>

Jan Lindström added a comment - 2023-08-24 06:36 - edited

Looked error logs and node_1 drops from cluster because applier gets error and it tries to do error voting:

2023-08-21 13:32:24 2 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table test.table30_int_autoinc; Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213; handler error HA_ERR_LOCK_DEADLOCK; the event's master log FIRST, end_log_pos 242, Internal MariaDB error code: 1213

2023-08-21 13:32:24 0 [Note] WSREP: Member 0(galapq) initiates vote on e8298e61-400d-11ee-bed8-e3ccd61d69c8:4939,89ae2f2481c15ba0:  Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213;

2023-08-21 13:32:24 8 [Note] WSREP: wsrep_before_commit: 1, 4949

2023-08-21 13:32:24 6 [Note] WSREP: wsrep_commit_empty for 6 client_state exec client_mode high priority trans_state executing sql NULL

2023-08-21 13:32:24 7 [Note] WSREP: wsrep_before_commit: 1, 4947

2023-08-21 13:32:24 0 [Note] WSREP: Votes over e8298e61-400d-11ee-bed8-e3ccd61d69c8:4939:

   0000000000000000:   2/3

   89ae2f2481c15ba0:   1/3

Winner: 0000000000000000

2023-08-21 13:32:24 9 [Note] WSREP: assigned new next trx id: 15048

2023-08-21 13:32:24 6 [Note] WSREP: assigned new next trx id: 15049

2023-08-21 13:32:24 2 [ERROR] WSREP: Inconsistency detected: Inconsistent by consensus on e8298e61-400d-11ee-bed8-e3ccd61d69c8:4939

	 at /test/galera_4x_opt/galera/src/replicator_smm.cpp:process_apply_error():1357

Last node leaves cluster because it's weight is not big enough.

Jan Lindström added a comment - 2023-08-24 06:36 - edited Looked error logs and node_1 drops from cluster because applier gets error and it tries to do error voting: 2023-08-21 13:32:24 2 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table test.table30_int_autoinc; Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213; handler error HA_ERR_LOCK_DEADLOCK; the event's master log FIRST, end_log_pos 242, Internal MariaDB error code: 1213 2023-08-21 13:32:24 0 [Note] WSREP: Member 0(galapq) initiates vote on e8298e61-400d-11ee-bed8-e3ccd61d69c8:4939,89ae2f2481c15ba0: Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213; 2023-08-21 13:32:24 8 [Note] WSREP: wsrep_before_commit: 1, 4949 2023-08-21 13:32:24 6 [Note] WSREP: wsrep_commit_empty for 6 client_state exec client_mode high priority trans_state executing sql NULL 2023-08-21 13:32:24 7 [Note] WSREP: wsrep_before_commit: 1, 4947 2023-08-21 13:32:24 0 [Note] WSREP: Votes over e8298e61-400d-11ee-bed8-e3ccd61d69c8:4939: 0000000000000000: 2/3 89ae2f2481c15ba0: 1/3 Winner: 0000000000000000 2023-08-21 13:32:24 9 [Note] WSREP: assigned new next trx id: 15048 2023-08-21 13:32:24 6 [Note] WSREP: assigned new next trx id: 15049 2023-08-21 13:32:24 2 [ERROR] WSREP: Inconsistency detected: Inconsistent by consensus on e8298e61-400d-11ee-bed8-e3ccd61d69c8:4939 at /test/galera_4x_opt/galera/src/replicator_smm.cpp:process_apply_error():1357 Last node leaves cluster because it's weight is not big enough.

Sergei Golubchik added a comment - 2023-08-25 15:32

why can this cause nodes to be not consistent anymore?

Sergei Golubchik added a comment - 2023-08-25 15:32 why can this cause nodes to be not consistent anymore?

Sergei Golubchik added a comment - 2023-08-31 11:52

Why would other nodes be out of sync with each other? they both have received the write set, they certify it and apply, where's the inconsistency here?

Sergei Golubchik added a comment - 2023-08-31 11:52 Why would other nodes be out of sync with each other? they both have received the write set, they certify it and apply, where's the inconsistency here?

Sergei Golubchik added a comment - 2023-09-04 12:05

janlindstrom, but this means that a crash of a node can make the whole cluster unusable, where's HA in that?

May be a node_2 shouldn't apply a write set until all nodes got it. May be node_3 can get it from node_2. But it has to be fixed somehow, otherwise I don't know how one can claim that galera cluster provides HA

Sergei Golubchik added a comment - 2023-09-04 12:05 janlindstrom , but this means that a crash of a node can make the whole cluster unusable, where's HA in that? May be a node_2 shouldn't apply a write set until all nodes got it. May be node_3 can get it from node_2. But it has to be fixed somehow, otherwise I don't know how one can claim that galera cluster provides HA

Jan Lindström added a comment - 2023-09-04 12:26

serg I think I need to dig more because you are correct other nodes should be able to continue normally.

Jan Lindström added a comment - 2023-09-04 12:26 serg I think I need to dig more because you are correct other nodes should be able to continue normally.

Jan Lindström added a comment - 2023-09-06 11:57

rpizzi I tried to reproduce this with 10.6 using 3-node cluster and using simple database with 100k rows and then 2 connections doing inserts and 2 connections doing updates. From another connection then I triggered code crash inside InnoDB ::write_row() on node_2. Remaining nodes node_1 and node_2 remained primary state. Is there something special on node configuration I should know?

Crash instrumentation:

 jan@jan-HP-ZBook-15u-G5:~/work/mariadb/10.6$ git diff

diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc

index b440613c13f..e6b90f02279 100644

--- a/storage/innobase/handler/ha_innodb.cc

+++ b/storage/innobase/handler/ha_innodb.cc

@@ -7844,6 +7844,12 @@ ha_innobase::write_row(

        trx_t*          trx = thd_to_trx(m_user_thd);

+#ifdef WITH_WSREP

+        DBUG_EXECUTE_IF("wsrep_force_assert",

+                       assert(0);

+       );

+#endif

        /* Validation checks before we commence write_row operation. */

        if (is_read_only()) {

                DBUG_RETURN(HA_ERR_TABLE_READONLY);

How to enable it:

SET debug_dbug = '+d,wsrep_force_assert'; call insert_t1(2000);

Jan Lindström added a comment - 2023-09-06 11:57 rpizzi I tried to reproduce this with 10.6 using 3-node cluster and using simple database with 100k rows and then 2 connections doing inserts and 2 connections doing updates. From another connection then I triggered code crash inside InnoDB ::write_row() on node_2. Remaining nodes node_1 and node_2 remained primary state. Is there something special on node configuration I should know? Crash instrumentation: jan@jan-HP-ZBook-15u-G5:~/work/mariadb/10.6$ git diff diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index b440613c13f..e6b90f02279 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -7844,6 +7844,12 @@ ha_innobase::write_row( trx_t* trx = thd_to_trx(m_user_thd); +#ifdef WITH_WSREP + DBUG_EXECUTE_IF("wsrep_force_assert", + assert(0); + ); +#endif + /* Validation checks before we commence write_row operation. */ if (is_read_only()) { DBUG_RETURN(HA_ERR_TABLE_READONLY); How to enable it: SET debug_dbug = '+d,wsrep_force_assert'; call insert_t1(2000);

Rick Pizzi (Inactive) added a comment - 2023-09-06 12:45 - edited

Only thing that comes to mind is that this is a 4 node cluster with node 4 having pc.weight=0. Not sure this makes any difference.
Also, when testing, you should actually simulate what happened in production, i.e. have an InnoDB assertion failure due to corrupted index.

Rick Pizzi (Inactive) added a comment - 2023-09-06 12:45 - edited Only thing that comes to mind is that this is a 4 node cluster with node 4 having pc.weight=0. Not sure this makes any difference. Also, when testing, you should actually simulate what happened in production, i.e. have an InnoDB assertion failure due to corrupted index.

Jan Lindström added a comment - 2023-09-06 13:15

rpizzi In my understanding pc.weight=0 is not good choice here because it means if one node goes down rest of nodes in cluster will loose Primary status. See https://galeracluster.com/library/documentation/weighted-quorum.html

InnoDB index corruption most likely is not caused by Galera and requires additional investigation. Stack trace is quite limited for this but anyway is out of scope for me.

Jan Lindström added a comment - 2023-09-06 13:15 rpizzi In my understanding pc.weight=0 is not good choice here because it means if one node goes down rest of nodes in cluster will loose Primary status. See https://galeracluster.com/library/documentation/weighted-quorum.html InnoDB index corruption most likely is not caused by Galera and requires additional investigation. Stack trace is quite limited for this but anyway is out of scope for me.

Rick Pizzi (Inactive) added a comment - 2023-09-06 13:26

It is the opposite. Weight=0 means the node does not participate in quorum, and its online/offline status does not impact quorum calculation.
This ticket is not about finding source for index corruption. We need to find out why all nodes went non-primary when this happened.

Thanks,
Rick

Rick Pizzi (Inactive) added a comment - 2023-09-06 13:26 It is the opposite. Weight=0 means the node does not participate in quorum, and its online/offline status does not impact quorum calculation. This ticket is not about finding source for index corruption. We need to find out why all nodes went non-primary when this happened. Thanks, Rick

Jan Lindström added a comment - 2023-09-07 05:41 - edited

rpizzi I tested this in 10.6 using 4-node cluster so that I set pc.weight=0 in all nodes. Then I used mysqladmin and shutdown one of the nodes. All the rest of nodes did go non-Primary as documentation hints. It appears that weight of 0 does not yield any weight for the PC, and any cluster would end up in split brain after one node dropping out.

Jan Lindström added a comment - 2023-09-07 05:41 - edited rpizzi I tested this in 10.6 using 4-node cluster so that I set pc.weight=0 in all nodes. Then I used mysqladmin and shutdown one of the nodes. All the rest of nodes did go non-Primary as documentation hints. It appears that weight of 0 does not yield any weight for the PC, and any cluster would end up in split brain after one node dropping out.

Rick Pizzi (Inactive) added a comment - 2023-09-07 06:59 - edited

You cannot set pc.weight to 0 on all nodes.
As I said, only node 4 had 0, so that quorum calculation would ignore that node.
Please try and test accordingly.

Thanks
Rick

Rick Pizzi (Inactive) added a comment - 2023-09-07 06:59 - edited You cannot set pc.weight to 0 on all nodes. As I said, only node 4 had 0, so that quorum calculation would ignore that node. Please try and test accordingly. Thanks Rick

Jan Lindström added a comment - 2023-09-07 08:48

rpizzi Thanks for pointing out. I tried again with 4-node cluster and 10.6 so that only node_4 has pc.weight=0 and having assertion in same place as reported. However, I could not reproduce the problem that other nodes would drop from Primary state.

Node_2 fails on exactly same place:

mysys/stacktrace.c:215(my_print_stacktrace)[0x562347b46973]

sql/signal_handler.cc:241(handle_fatal_signal)[0x5623471ee0cb]

libc_sigaction.c:0(__restore_rt)[0x7f94d6e3c4b0]

nptl/pthread_kill.c:44(__pthread_kill_implementation)[0x7f94d6e90ffb]

posix/raise.c:27(__GI_raise)[0x7f94d6e3c406]

stdlib/abort.c:81(__GI_abort)[0x7f94d6e2287c]

intl/loadmsgcat.c:1177(_nl_load_domain)[0x7f94d6e2279b]

/lib/x86_64-linux-gnu/libc.so.6(+0x33b86)[0x7f94d6e33b86]

page/page0zip.cc:4216(page_zip_dir_insert(page_cur_t*, unsigned short, unsigned char*, mtr_t*))[0x5623477bba08]

page/page0cur.cc:2143(page_cur_insert_rec_zip(page_cur_t*, unsigned char const*, unsigned short*, mtr_t*))[0x56234779459a]

include/page0cur.inl:195(page_cur_tuple_insert(page_cur_t*, dtuple_t const*, unsigned short**, mem_block_info_t**, unsigned long, mtr_t*))[0x56234793010c]

btr/btr0cur.cc:2491(btr_cur_optimistic_insert(unsigned long, btr_cur_t*, unsigned short**, mem_block_info_t**, dtuple_t*, unsigned char**, big_rec_t**, unsigned long, que_thr_t*, mtr_t*))[0x56234793ba1f]

row/row0ins.cc:2852(row_ins_clust_index_entry_low(unsigned long, btr_latch_mode, dict_index_t*, unsigned long, dtuple_t*, unsigned long, que_thr_t*))[0x562347812880]

row/row0ins.cc:3242(row_ins_clust_index_entry(dict_index_t*, dtuple_t*, que_thr_t*, unsigned long))[0x562347813dd2]

row/row0ins.cc:3368(row_ins_index_entry(dict_index_t*, dtuple_t*, que_thr_t*))[0x56234781436a]

row/row0ins.cc:3536(row_ins_index_entry_step(ins_node_t*, que_thr_t*))[0x562347814cc2]

row/row0ins.cc:3661(row_ins(ins_node_t*, que_thr_t*))[0x5623478151f4]

row/row0ins.cc:3790(row_ins_step(que_thr_t*))[0x5623478159e9]

row/row0mysql.cc:1317(row_insert_for_mysql(unsigned char const*, row_prebuilt_t*, ins_mode_t))[0x56234783803d]

handler/ha_innodb.cc:7907(ha_innobase::write_row(unsigned char const*))[0x5623476550d1]

sql/handler.cc:7639(handler::ha_write_row(unsigned char const*))[0x562347208bde]

sql/sql_insert.cc:2166(write_record(THD*, TABLE*, st_copy_info*, select_result*))[0x562346dd4a08]

sql/sql_insert.cc:1131(mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item> >&, List<Item>&, List<Item>&, enum_duplicates, bool, select_result*))[0x562346dd1467]

sql/sql_parse.cc:4580(mysql_execute_command(THD*, bool))[0x562346e28fbe]

sql/sp_head.cc:3843(sp_instr_stmt::exec_core(THD*, unsigned int*))[0x562346d1e3cb]

sql/sp_head.cc:3568(sp_lex_keeper::reset_lex_and_exec_core(THD*, unsigned int*, bool, sp_instr*))[0x562346d1d69d]

sql/sp_head.cc:3749(sp_instr_stmt::execute(THD*, unsigned int*))[0x562346d1df53]

sql/sp_head.cc:1442(sp_head::execute(THD*, bool))[0x562346d17047]

sql/sp_head.cc:2485(sp_head::execute_procedure(THD*, List<Item>*))[0x562346d19fac]

sql/sql_parse.cc:3036(do_execute_sp(THD*, sp_head*))[0x562346e23bf7]

sql/sql_parse.cc:3282(Sql_cmd_call::execute(THD*))[0x562346e2488a]

sql/sql_parse.cc:6024(mysql_execute_command(THD*, bool))[0x562346e2ebfa]

sql/sql_parse.cc:8048(mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x562346e34f78]

sql/sql_parse.cc:7871(wsrep_mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x562346e3462a]

sql/sql_parse.cc:1883(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool))[0x562346e20776]

sql/sql_parse.cc:1409(do_command(THD*, bool))[0x562346e1f1b2]

sql/sql_connect.cc:1416(do_handle_one_connection(CONNECT*, bool))[0x562346ff7cfe]

sql/sql_connect.cc:1320(handle_one_connection)[0x562346ff7a67]

perfschema/pfs.cc:2203(pfs_spawn_thread)[0x5623475662a2]

nptl/pthread_create.c:444(start_thread)[0x7f94d6e8f18a]

x86_64/clone3.S:83(clone3)[0x7f94d6f1dbd0]

From node_1

mysql> show status like 'wsrep%';

--------------

show status like 'wsrep%'

--------------

+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+

| Variable_name                 | Value                                                                                                                                          |

+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+

| wsrep_local_state_uuid        | 5f8f1ef8-4d59-11ee-b03b-1edf34a26753                                                                                                           |

| wsrep_protocol_version        | 10                                                                                                                                             |

| wsrep_last_committed          | 115233                                                                                                                                         |

| wsrep_replicated              | 110496                                                                                                                                         |

| wsrep_replicated_bytes        | 35343800                                                                                                                                       |

| wsrep_repl_keys               | 331483                                                                                                                                         |

| wsrep_repl_keys_bytes         | 5303768                                                                                                                                        |

| wsrep_repl_data_bytes         | 22387370                                                                                                                                       |

| wsrep_repl_other_bytes        | 0                                                                                                                                              |

| wsrep_received                | 7977                                                                                                                                           |

| wsrep_received_bytes          | 1563770                                                                                                                                        |

| wsrep_local_commits           | 110493                                                                                                                                         |

| wsrep_local_cert_failures     | 0                                                                                                                                              |

| wsrep_local_replays           | 0                                                                                                                                              |

| wsrep_local_send_queue        | 0                                                                                                                                              |

| wsrep_local_send_queue_max    | 2                                                                                                                                              |

| wsrep_local_send_queue_min    | 0                                                                                                                                              |

| wsrep_local_send_queue_avg    | 1.79535e-05                                                                                                                                    |

| wsrep_local_recv_queue        | 0                                                                                                                                              |

| wsrep_local_recv_queue_max    | 7                                                                                                                                              |

| wsrep_local_recv_queue_min    | 0                                                                                                                                              |

| wsrep_local_recv_queue_avg    | 0.0208098                                                                                                                                      |

| wsrep_local_cached_downto     | 84794                                                                                                                                          |

| wsrep_flow_control_paused_ns  | 11834047189                                                                                                                                    |

| wsrep_flow_control_paused     | 0.0199474                                                                                                                                      |

| wsrep_flow_control_sent       | 0                                                                                                                                              |

| wsrep_flow_control_recv       | 1                                                                                                                                              |

| wsrep_flow_control_active     | false                                                                                                                                          |

| wsrep_flow_control_requested  | false                                                                                                                                          |

| wsrep_cert_deps_distance      | 93.4516                                                                                                                                        |

| wsrep_apply_oooe              | 0.069626                                                                                                                                       |

| wsrep_apply_oool              | 0.00321965                                                                                                                                     |

| wsrep_apply_window            | 1.09509                                                                                                                                        |

| wsrep_apply_waits             | 0                                                                                                                                              |

| wsrep_commit_oooe             | 0                                                                                                                                              |

| wsrep_commit_oool             | 0                                                                                                                                              |

| wsrep_commit_window           | 1.00368                                                                                                                                        |

| wsrep_local_state             | 4                                                                                                                                              |

| wsrep_local_state_comment     | Synced                                                                                                                                         |

| wsrep_cert_index_size         | 93                                                                                                                                             |

| wsrep_causal_reads            | 11                                                                                                                                             |

| wsrep_cert_interval           | 0.10328                                                                                                                                        |

| wsrep_open_transactions       | 2                                                                                                                                              |

| wsrep_open_connections        | 0                                                                                                                                              |

| wsrep_incoming_addresses      | 127.0.0.1:16020,127.0.0.1:16022,127.0.0.1:16023                                                                                                |

| wsrep_cluster_weight          | 2                                                                                                                                              |

| wsrep_debug_sync_waiters      |                                                                                                                                                |

| wsrep_desync_count            | 0                                                                                                                                              |

| wsrep_evs_delayed             |                                                                                                                                                |

| wsrep_evs_evict_list          |                                                                                                                                                |

| wsrep_evs_repl_latency        | 0.000234768/0.000425666/0.0120175/0.000450524/681                                                                                              |

| wsrep_evs_state               | OPERATIONAL                                                                                                                                    |

| wsrep_gcomm_uuid              | 5f8e47b5-4d59-11ee-82e6-43df7302848a                                                                                                           |

| wsrep_gmcast_segment          | 0                                                                                                                                              |

| wsrep_applier_thread_count    | 4                                                                                                                                              |

| wsrep_cluster_capabilities    |                                                                                                                                                |

| wsrep_cluster_conf_id         | 3                                                                                                                                              |

| wsrep_cluster_size            | 3                                                                                                                                              |

| wsrep_cluster_state_uuid      | 5f8f1ef8-4d59-11ee-b03b-1edf34a26753                                                                                                           |

| wsrep_cluster_status          | Primary                                                                                                                                        |

| wsrep_connected               | ON                                                                                                                                             |

| wsrep_local_bf_aborts         | 0                                                                                                                                              |

| wsrep_local_index             | 0                                                                                                                                              |

| wsrep_provider_capabilities   | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |

| wsrep_provider_name           | Galera                                                                                                                                         |

| wsrep_provider_vendor         | Codership Oy <info@codership.com>                                                                                                              |

| wsrep_provider_version        | 26.4.14(r75464733)                                                                                                                             |

| wsrep_ready                   | ON                                                                                                                                             |

| wsrep_rollbacker_thread_count | 1                                                                                                                                              |

| wsrep_thread_count            | 5                                                                                                                                              |

+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+

70 rows in set (0,00 sec)

Jan Lindström added a comment - 2023-09-07 08:48 rpizzi Thanks for pointing out. I tried again with 4-node cluster and 10.6 so that only node_4 has pc.weight=0 and having assertion in same place as reported. However, I could not reproduce the problem that other nodes would drop from Primary state. Node_2 fails on exactly same place: mysys/stacktrace.c:215(my_print_stacktrace)[0x562347b46973] sql/signal_handler.cc:241(handle_fatal_signal)[0x5623471ee0cb] libc_sigaction.c:0(__restore_rt)[0x7f94d6e3c4b0] nptl/pthread_kill.c:44(__pthread_kill_implementation)[0x7f94d6e90ffb] posix/raise.c:27(__GI_raise)[0x7f94d6e3c406] stdlib/abort.c:81(__GI_abort)[0x7f94d6e2287c] intl/loadmsgcat.c:1177(_nl_load_domain)[0x7f94d6e2279b] /lib/x86_64-linux-gnu/libc.so.6(+0x33b86)[0x7f94d6e33b86] page/page0zip.cc:4216(page_zip_dir_insert(page_cur_t*, unsigned short, unsigned char*, mtr_t*))[0x5623477bba08] page/page0cur.cc:2143(page_cur_insert_rec_zip(page_cur_t*, unsigned char const*, unsigned short*, mtr_t*))[0x56234779459a] include/page0cur.inl:195(page_cur_tuple_insert(page_cur_t*, dtuple_t const*, unsigned short**, mem_block_info_t**, unsigned long, mtr_t*))[0x56234793010c] btr/btr0cur.cc:2491(btr_cur_optimistic_insert(unsigned long, btr_cur_t*, unsigned short**, mem_block_info_t**, dtuple_t*, unsigned char**, big_rec_t**, unsigned long, que_thr_t*, mtr_t*))[0x56234793ba1f] row/row0ins.cc:2852(row_ins_clust_index_entry_low(unsigned long, btr_latch_mode, dict_index_t*, unsigned long, dtuple_t*, unsigned long, que_thr_t*))[0x562347812880] row/row0ins.cc:3242(row_ins_clust_index_entry(dict_index_t*, dtuple_t*, que_thr_t*, unsigned long))[0x562347813dd2] row/row0ins.cc:3368(row_ins_index_entry(dict_index_t*, dtuple_t*, que_thr_t*))[0x56234781436a] row/row0ins.cc:3536(row_ins_index_entry_step(ins_node_t*, que_thr_t*))[0x562347814cc2] row/row0ins.cc:3661(row_ins(ins_node_t*, que_thr_t*))[0x5623478151f4] row/row0ins.cc:3790(row_ins_step(que_thr_t*))[0x5623478159e9] row/row0mysql.cc:1317(row_insert_for_mysql(unsigned char const*, row_prebuilt_t*, ins_mode_t))[0x56234783803d] handler/ha_innodb.cc:7907(ha_innobase::write_row(unsigned char const*))[0x5623476550d1] sql/handler.cc:7639(handler::ha_write_row(unsigned char const*))[0x562347208bde] sql/sql_insert.cc:2166(write_record(THD*, TABLE*, st_copy_info*, select_result*))[0x562346dd4a08] sql/sql_insert.cc:1131(mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item> >&, List<Item>&, List<Item>&, enum_duplicates, bool, select_result*))[0x562346dd1467] sql/sql_parse.cc:4580(mysql_execute_command(THD*, bool))[0x562346e28fbe] sql/sp_head.cc:3843(sp_instr_stmt::exec_core(THD*, unsigned int*))[0x562346d1e3cb] sql/sp_head.cc:3568(sp_lex_keeper::reset_lex_and_exec_core(THD*, unsigned int*, bool, sp_instr*))[0x562346d1d69d] sql/sp_head.cc:3749(sp_instr_stmt::execute(THD*, unsigned int*))[0x562346d1df53] sql/sp_head.cc:1442(sp_head::execute(THD*, bool))[0x562346d17047] sql/sp_head.cc:2485(sp_head::execute_procedure(THD*, List<Item>*))[0x562346d19fac] sql/sql_parse.cc:3036(do_execute_sp(THD*, sp_head*))[0x562346e23bf7] sql/sql_parse.cc:3282(Sql_cmd_call::execute(THD*))[0x562346e2488a] sql/sql_parse.cc:6024(mysql_execute_command(THD*, bool))[0x562346e2ebfa] sql/sql_parse.cc:8048(mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x562346e34f78] sql/sql_parse.cc:7871(wsrep_mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x562346e3462a] sql/sql_parse.cc:1883(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool))[0x562346e20776] sql/sql_parse.cc:1409(do_command(THD*, bool))[0x562346e1f1b2] sql/sql_connect.cc:1416(do_handle_one_connection(CONNECT*, bool))[0x562346ff7cfe] sql/sql_connect.cc:1320(handle_one_connection)[0x562346ff7a67] perfschema/pfs.cc:2203(pfs_spawn_thread)[0x5623475662a2] nptl/pthread_create.c:444(start_thread)[0x7f94d6e8f18a] x86_64/clone3.S:83(clone3)[0x7f94d6f1dbd0] From node_1 mysql> show status like 'wsrep%'; -------------- show status like 'wsrep%' -------------- +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 5f8f1ef8-4d59-11ee-b03b-1edf34a26753 | | wsrep_protocol_version | 10 | | wsrep_last_committed | 115233 | | wsrep_replicated | 110496 | | wsrep_replicated_bytes | 35343800 | | wsrep_repl_keys | 331483 | | wsrep_repl_keys_bytes | 5303768 | | wsrep_repl_data_bytes | 22387370 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 7977 | | wsrep_received_bytes | 1563770 | | wsrep_local_commits | 110493 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 2 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 1.79535e-05 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 7 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0.0208098 | | wsrep_local_cached_downto | 84794 | | wsrep_flow_control_paused_ns | 11834047189 | | wsrep_flow_control_paused | 0.0199474 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 1 | | wsrep_flow_control_active | false | | wsrep_flow_control_requested | false | | wsrep_cert_deps_distance | 93.4516 | | wsrep_apply_oooe | 0.069626 | | wsrep_apply_oool | 0.00321965 | | wsrep_apply_window | 1.09509 | | wsrep_apply_waits | 0 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 1.00368 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 93 | | wsrep_causal_reads | 11 | | wsrep_cert_interval | 0.10328 | | wsrep_open_transactions | 2 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | 127.0.0.1:16020,127.0.0.1:16022,127.0.0.1:16023 | | wsrep_cluster_weight | 2 | | wsrep_debug_sync_waiters | | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000234768/0.000425666/0.0120175/0.000450524/681 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | 5f8e47b5-4d59-11ee-82e6-43df7302848a | | wsrep_gmcast_segment | 0 | | wsrep_applier_thread_count | 4 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 3 | | wsrep_cluster_size | 3 | | wsrep_cluster_state_uuid | 5f8f1ef8-4d59-11ee-b03b-1edf34a26753 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 0 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.14(r75464733) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 5 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 70 rows in set (0,00 sec)

Rick Pizzi (Inactive) added a comment - 2023-09-07 09:05 - edited

I rechecked the logs of this failure. This happened on node 2.
It appears that after the assertion, the asserting thread took a VERY long time to dump the stack, and also there was a core file to be generated
after that.

See below for the sequence; you can clearly see that while the asserting thread is dumping stack, wsrep is still talking with other nodes.
Hope this will help. Maybe you should enable core-file and see if that makes a difference.

: NO)

2023-07-07 15:50:19 11985573 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.

2023-07-07 15:50:19 11985573 [ERROR] mariadbd: Index for table 'MAJ_EVENEMENTS_RAPPROCHEMENT' is corrupt; try to repair it

2023-07-07 15:50:20 0x7f17f7766700  InnoDB: Assertion failure in file /home/jenkins/workspace/Build-Package/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/storage/innobase/page/page0zip.cc line 4213

InnoDB: Failing assertion: slot_rec

InnoDB: We intentionally generate a memory trap.

InnoDB: Submit a detailed bug report to https://jira.mariadb.org/

InnoDB: If you get repeated assertion failures or crashes, even

InnoDB: immediately after the mariadbd startup, there may be

InnoDB: corruption in the InnoDB tablespace. Please refer to

InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/

InnoDB: about forcing recovery.

230707 15:50:20 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.6.12-7-MariaDB-enterprise-log source revision: 8e2b75dad28995ab5f6e6acd436135420f7031c9

key_buffer_size=268435456

read_buffer_size=131072

max_used_connections=2243

max_threads=6002

thread_count=1565

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 13479553 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7ef735ab51c8

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x7f17f7765cb8 thread_stack 0x49000

Can't start addr2line

/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x5608f6116c7e]

/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x5608f5bc33a5]

/lib64/libpthread.so.0(+0xf630)[0x7f2011c15630]

/lib64/libc.so.6(gsignal+0x37)[0x7f2011060387]

/lib64/libc.so.6(abort+0x148)[0x7f2011061a78]

/usr/sbin/mariadbd(+0x694d97)[0x5608f5834d97]

/usr/sbin/mariadbd(+0xdbfb05)[0x5608f5f5fb05]

/usr/sbin/mariadbd(+0xdaf516)[0x5608f5f4f516]

2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 844de70f-8aaf with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 766 rttvar: 579 rto: 201000 lost: 0 last_data_recv: 2567 cwnd: 10 last_queued_since: 8776161264 last_delivered_since: 11959172679 send_queue_length: 9 send_queue_bytes: 720 segment: 0 messages: 9

2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.103:4567

2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 5dda822d-b4c3 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 6407 rttvar: 10277 rto: 207000 lost: 0 last_data_recv: 5900 cwnd: 10 last_queued_since: 307819 last_delivered_since: 8781038225 send_queue_length: 10 send_queue_bytes: 1080 segment: 0 messages: 10

2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.104:42156

2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 96c49f4b-8727 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 5591 rttvar: 10094 rto: 206000 lost: 0 last_data_recv: 5924 cwnd: 10 last_queued_since: 10916 last_delivered_since: 8781705783 send_queue_length: 11 send_queue_bytes: 1292 segment: 0 messages: 11

2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.101:33510

2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.10.1.101:4567 tcp://10.10.1.103:4567 tcp://10.10.1.104:4567

2023-07-07 15:50:32 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT12.0655S), skipping check

2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7f17e8905e58

2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer destruct

2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7ef49da77b98

2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer destruct

/usr/sbin/mariadbd(+0xe62a65)[0x5608f6002a65]

2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 96c49f4b-8727 (tcp://10.10.1.101:4567), attempt 0

2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 844de70f-8aaf (tcp://10.10.1.103:4567), attempt 0

2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 5dda822d-b4c3 (tcp://10.10.1.104:4567), attempt 0

2023-07-07 15:50:33 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.73771S), skipping check

/usr/sbin/mariadbd(+0xe4e639)[0x5608f5fee639]

/usr/sbin/mariadbd(+0xe5063b)[0x5608f5ff063b]

/usr/sbin/mariadbd(+0xe62e98)[0x5608f6002e98]

/usr/sbin/mariadbd(+0xde0227)[0x5608f5f80227]

/usr/sbin/mariadbd(+0xde2da4)[0x5608f5f82da4]

2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 1000 rttvar: 500 rto: 201000 lost: 0 last_data_recv: 125408244 cwnd: 10 last_queued_since: 4421911460 last_delivered_since: 8715642835357354 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0

2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 359 rttvar: 179 rto: 201000 lost: 0 last_data_recv: 7528 cwnd: 10 last_queued_since: 120133 last_delivered_since: 120133 send_queue_length: 0 send_queue_bytes: 0

2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 437 rttvar: 218 rto: 201000 lost: 0 last_data_recv: 7532 cwnd: 10 last_queued_since: 4131087 last_delivered_since: 4131087 send_queue_length: 0 send_queue_bytes: 0

2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 181 rttvar: 90 rto: 201000 lost: 0 last_data_recv: 7532 cwnd: 10 last_queued_since: 8715642839939685 last_delivered_since: 8715642839939685 send_queue_length: 0 send_queue_bytes: 0

2023-07-07 15:50:41 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT7.53388S), skipping check

/usr/sbin/mariadbd(+0xe151ab)[0x5608f5fb51ab]

2023-07-07 15:50:43 0 [Note] WSREP: Deferred close timer handle_wait Success for 0x7f17ebc5b168

2023-07-07 15:50:43 0 [Note] WSREP: Deferred close timer destruct

2023-07-07 15:50:43 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.61329S), skipping check

/usr/sbin/mariadbd(+0xe15869)[0x5608f5fb5869]

2023-07-07 15:50:44 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.63124S), skipping check

2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 5dda822d-b4c3

2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 844de70f-8aaf

2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 96c49f4b-8727

/usr/sbin/mariadbd(+0xdf28b2)[0x5608f5f928b2]

/usr/sbin/mariadbd(+0xd43ca8)[0x5608f5ee3ca8]

/usr/sbin/mariadbd(_ZN7handler13ha_update_rowEPKhS1_+0x232)[0x5608f5bd12b2]

/usr/sbin/mariadbd(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x1a63)[0x5608f5a5cf33]

/usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x263e)[0x5608f597d38e]

/usr/sbin/mariadbd(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x38)[0x5608f58cb718]

/usr/sbin/mariadbd(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x176)[0x5608f58d48b6]

/usr/sbin/mariadbd(_ZN13sp_instr_stmt7executeEP3THDPj+0x5bc)[0x5608f58d529c]

/usr/sbin/mariadbd(_ZN7sp_head7executeEP3THDb+0xa0c)[0x5608f58ceeac]

2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 0 rttvar: 0 rto: 200000 lost: 0 last_data_recv: 125424590 cwnd: 10 last_queued_since: 11598743653 last_delivered_since: 8715659181935888 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0

2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 5dda822d-b4c3 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 185 rttvar: 82 rto: 201000 lost: 0 last_data_recv: 5508 cwnd: 10 last_queued_since: 11598929590 last_delivered_since: 11598962474 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0

2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 844de70f-8aaf with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 264 rttvar: 105 rto: 201000 lost: 0 last_data_recv: 5508 cwnd: 10 last_queued_since: 11599134047 last_delivered_since: 11599140957 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0

2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 96c49f4b-8727 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 384 rttvar: 151 rto: 201000 lost: 0 last_data_recv: 5678 cwnd: 10 last_queued_since: 11599486516 last_delivered_since: 11599494477 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0

/usr/sbin/mariadbd(_ZN7sp_head17execute_procedureEP3THDP4ListI4ItemE+0x66a)[0x5608f58d093a]

/usr/sbin/mariadbd(+0x7cfc17)[0x5608f596fc17]

/usr/sbin/mariadbd(+0x7d3a68)[0x5608f5973a68]

/usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x101a)[0x5608f597bd6a]

2023-07-07 15:51:04 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 0 rttvar: 0 rto: 200000 lost: 0 last_data_recv: 125430994 cwnd: 10 last_queued_since: 221024 last_delivered_since: 8715665585967344 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0

2023-07-07 15:51:04 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT18.0029S), skipping check

2023-07-07 15:51:04 0 [Note] WSREP: view(view_id(NON_PRIM,5dda822d-b4c3,50) memb {

	6c357751-8d5f,0

} joined {

} left {

} partitioned {

	5dda822d-b4c3,0

	844de70f-8aaf,0

	96c49f4b-8727,0

})

2023-07-07 15:51:04 0 [Note] WSREP: view(view_id(NON_PRIM,6c357751-8d5f,51) memb {

	6c357751-8d5f,0

} joined {

} left {

} partitioned {

	5dda822d-b4c3,0

	844de70f-8aaf,0

	96c49f4b-8727,0

})

2023-07-07 15:51:04 11994604 [Warning] WSREP: Send action {(nil), 139603616989752, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

2023-07-07 15:51:04 0 [Note] WSREP: Flow-control interval: [240, 300]

2023-07-07 15:51:04 0 [Note] WSREP: Received NON-PRIMARY.

2023-07-07 15:51:04 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 6204240577)

2023-07-07 15:51:04 11955214 [Warning] WSREP: Send action {(nil), 139599322023456, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 11985855 [Warning] WSREP: Send action {(nil), 139603616990584, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 11987046 [Warning] WSREP: Send action {(nil), 139599322023328, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 11985820 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 46 [Note] WSREP: ================================================

View:

  id: c3a51458-b6fd-11eb-8a80-eb35c100e72c:6204240577

  status: non-primary

  protocol_version: 4

  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO

  final: no

  own_index: 0

  members(1):

	0: 6c357751-ce4f-11ed-8d5f-136e7094748b, PIXID-MDB-MASTER2

=================================================

2023-07-07 15:51:04 46 [Note] WSREP: Non-primary view

2023-07-07 15:51:04 46 [Note] WSREP: Server status change synced -> connected

2023-07-07 15:51:04 46 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

2023-07-07 15:51:04 11997537 [Warning] WSREP: Send action {(nil), 139603616989760, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 11996647 [Warning] WSREP: Send action {(nil), 139573552218680, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 11986459 [Warning] WSREP: Send action {(nil), 139736760976944, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 11997986 [Warning] WSREP: Send action {(nil), 139599322023552, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 11985505 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:04 11988311 [Warning] WSREP: Send action {(nil), 139607911957872, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11917365 [Warning] WSREP: Send action {(nil), 139586437121400, WRITESET} returned -107 (Transport endpoint is not connected)

/usr/sbin/mariadbd(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x38)[0x5608f58cb718]

2023-07-07 15:51:06 11985895 [Warning] WSREP: Send action {(nil), 139590732088096, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

2023-07-07 15:51:06 0 [Note] WSREP: Flow-control interval: [240, 300]

2023-07-07 15:51:06 0 [Note] WSREP: Received NON-PRIMARY.

2023-07-07 15:51:06 11978506 [Warning] WSREP: Send action {(nil), 139599322023472, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11997530 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11988302 [Warning] WSREP: Send action {(nil), 139599322023760, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11988512 [Warning] WSREP: Send action {(nil), 139736760977344, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11988290 [Warning] WSREP: Send action {(nil), 139595027055344, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11998006 [Warning] WSREP: Send action {(nil), 139599322023888, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11988294 [Warning] WSREP: Send action {(nil), 139603616990632, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11991973 [Warning] WSREP: Send action {(nil), 139599322023712, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11988288 [Warning] WSREP: Send action {(nil), 139595027057080, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11985860 [Warning] WSREP: Send action {(nil), 139577847186808, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11997914 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11987069 [Warning] WSREP: Send action {(nil), 139599322023280, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11988354 [Warning] WSREP: Send action {(nil), 139736760976760, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11986489 [Warning] WSREP: Send action {(nil), 139564962285120, WRITESET} returned -107 (Transport endpoint is not connected)

2023-07-07 15:51:06 11986139 [Warning] WSREP: Send action {(nil), 139582142155592, WRITESET} returned -107 (Transport endpoint is not connected)

/usr/sbin/mariadbd(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x176)[0x5608f58d48b6]

2023-07-07 15:51:07 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.43448S), skipping check

/usr/sbin/mariadbd(_ZN13sp_instr_stmt7executeEP3THDPj+0x5bc)[0x5608f58d529c]

/usr/sbin/mariadbd(_ZN7sp_head7executeEP3THDb+0xa0c)[0x5608f58ceeac]

2023-07-07 15:51:10 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.93087S), skipping check

/usr/sbin/mariadbd(_ZN7sp_head15execute_triggerEP3THDPK25st_mysql_const_lex_stringS4_P13st_grant_info+0x1df)[0x5608f58d008f]

2023-07-07 15:51:11 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 5dda822d-b4c3 tcp://10.10.1.104:4567

2023-07-07 15:51:11 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 844de70f-8aaf tcp://10.10.1.103:4567

/usr/sbin/mariadbd(_ZN19Table_triggers_list16process_triggersEP3THD14trg_event_type20trg_action_time_typeb+0x104)[0x5608f5a40ec4]

2023-07-07 15:51:13 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.87478S), skipping check

/usr/sbin/mariadbd(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP10SQL_I_ListI8st_orderEyyP13select_result+0xd99)[0x5608f5d33da9]

/usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x24c8)[0x5608f597d218]

2023-07-07 15:51:16 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.90666S), skipping check

/usr/sbin/mariadbd(_Z11mysql_parseP3THDPcjP12Parser_state+0x20a)[0x5608f5980c9a]

/usr/sbin/mariadbd(+0x7e1531)[0x5608f5981531]

2023-07-07 15:51:19 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 96c49f4b-8727 tcp://10.10.1.101:4567

2023-07-07 15:51:19 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.83971S), skipping check

/usr/sbin/mariadbd(_Z16dispatch_command19enum_server_commandP3THDPcjb+0x29e1)[0x5608f5984c31]

2023-07-07 15:51:20 0 [Warning] WSREP: evs::proto(6c357751-8d5f, GATHER, view_id(REG,6c357751-8d5f,51)) install timer expired

evs::proto(evs::proto(6c357751-8d5f, GATHER, view_id(REG,6c357751-8d5f,51)), GATHER) {

current_view=view(view_id(REG,6c357751-8d5f,51) memb {

	6c357751-8d5f,0

} joined {

} left {

} partitioned {

}),

input_map=evs::input_map: {aru_seq=2,safe_seq=2,node_index=node: {idx=0,range=[3,2],safe_seq=2} },

fifo_seq=1874639154,

last_sent=2,

known:

5dda822d-b4c3 at tcp://10.10.1.104:4567

{o=1,s=0,i=0,fs=834192942,jm=

{v=1,t=4,ut=255,o=1,s=122,sr=-1,as=122,f=4,src=5dda822d-b4c3,srcvid=view_id(REG,5dda822d-b4c3,52),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=834192942,nl=(

	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}

	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}

	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}

	96c49f4b-8727, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}

},

6c357751-8d5f at

{o=1,s=0,i=0,fs=-1,jm=

{v=1,t=4,ut=255,o=1,s=2,sr=-1,as=2,f=0,src=6c357751-8d5f,srcvid=view_id(REG,6c357751-8d5f,51),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=1874639154,nl=(

	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}

	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}

	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}

	96c49f4b-8727, {o=0,s=0,e=0,ls=-1,vid=view_id(REG,00000000-0000,0),ss=-1,ir=[-1,-1],}

},

844de70f-8aaf at tcp://10.10.1.103:4567

{o=1,s=0,i=0,fs=1475544355,jm=

{v=1,t=4,ut=255,o=1,s=122,sr=-1,as=122,f=4,src=844de70f-8aaf,srcvid=view_id(REG,5dda822d-b4c3,52),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=1475544355,nl=(

	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}

	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}

	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}

	96c49f4b-8727, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}

},

96c49f4b-8727 at tcp://10.10.1.101:4567

{o=0,s=0,i=0,fs=101154494,}

2023-07-07 15:51:20 0 [Note] WSREP: no install message received

2023-07-07 15:51:20 0 [Note] WSREP: view(view_id(NON_PRIM,6c357751-8d5f,52) memb {

	6c357751-8d5f,0

} joined {

} left {

} partitioned {

	5dda822d-b4c3,0

	844de70f-8aaf,0

	96c49f4b-8727,0

})

2023-07-07 15:51:20 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

2023-07-07 15:51:20 0 [Note] WSREP: Flow-control interval: [240, 300]

2023-07-07 15:51:20 0 [Note] WSREP: Received NON-PRIMARY.

/usr/sbin/mariadbd(_Z10do_commandP3THDb+0x132)[0x5608f5985942]

/usr/sbin/mariadbd(_Z24do_handle_one_connectionP7CONNECTb+0x3b7)[0x5608f5aa2dd7]

2023-07-07 15:51:23 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT4.34509S), skipping check

/usr/sbin/mariadbd(handle_one_connection+0x5d)[0x5608f5aa311d]

2023-07-07 15:51:24 11986441 [Warning] WSREP: Send action {(nil), 139599322023528, WRITESET} returned -107 (Transport endpoint is not connected)

/usr/sbin/mariadbd(+0xc839d2)[0x5608f5e239d2]

/lib64/libpthread.so.0(+0x7ea5)[0x7f2011c0dea5]

/lib64/libc.so.6(clone+0x6d)[0x7f2011128b0d]

Trying to get some variables.

Some pointers may be invalid and cause the dump to abort.

Query (0x7ef692470578): UPDATE DWHTmp.MAJ_EVENEMENTS_RAPPROCHEMENT

			SET TYPE_EVE =  NAME_CONST('V_TYPE_EVE',_utf8mb3'D' COLLATE 'utf8mb3_general_ci') , DATE_EFFECTIVE =  NAME_CONST('V_DATE_EFFECTIVE',TIMESTAMP'2023-07-07 15:50:20'), STATUT_DATAMART = NULL  WHERE EVT_ID =  NAME_CONST('V_EVT_ID',2051583478)

Connection ID (thread ID): 11985593

Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off

The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains

information that should help you find out what is causing the crash.

Writing a core file...

Working directory at /data/mysql

Resource Limits:

Limit                     Soft Limit           Hard Limit           Units

Max cpu time              unlimited            unlimited            seconds

Max file size             unlimited            unlimited            bytes

Max data size             unlimited            unlimited            bytes

Max stack size            8388608              unlimited            bytes

Max core file size        0                    unlimited            bytes

Max resident set          unlimited            unlimited            bytes

Max processes             805978               805978               processes

Max open files            1048576              1048576              files

Max locked memory         65536                65536                bytes

Max address space         unlimited            unlimited            bytes

Max file locks            unlimited            unlimited            locks

Max pending signals       805978               805978               signals

Max msgqueue size         819200               819200               bytes

Max nice priority         0                    0

Max realtime priority     0                    0

Max realtime timeout      unlimited            unlimited            us

Core pattern: core

Kernel version: Linux version 3.10.0-1160.88.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Mar 7 15:41:52 UTC 2023

2023-07-07 17:08:23 0 [Note] Starting MariaDB 10.6.12-7-MariaDB-enterprise-log source revision 8e2b75dad28995ab5f6e6acd436135420f7031c9 as process 1083

Rick Pizzi (Inactive) added a comment - 2023-09-07 09:05 - edited I rechecked the logs of this failure. This happened on node 2. It appears that after the assertion, the asserting thread took a VERY long time to dump the stack, and also there was a core file to be generated after that. See below for the sequence; you can clearly see that while the asserting thread is dumping stack, wsrep is still talking with other nodes. Hope this will help. Maybe you should enable core-file and see if that makes a difference. : NO) 2023-07-07 15:50:19 11985573 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. 2023-07-07 15:50:19 11985573 [ERROR] mariadbd: Index for table 'MAJ_EVENEMENTS_RAPPROCHEMENT' is corrupt; try to repair it 2023-07-07 15:50:20 0x7f17f7766700 InnoDB: Assertion failure in file /home/jenkins/workspace/Build-Package/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/storage/innobase/page/page0zip.cc line 4213 InnoDB: Failing assertion: slot_rec InnoDB: We intentionally generate a memory trap. InnoDB: Submit a detailed bug report to https://jira.mariadb.org/ InnoDB: If you get repeated assertion failures or crashes, even InnoDB: immediately after the mariadbd startup, there may be InnoDB: corruption in the InnoDB tablespace. Please refer to InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/ InnoDB: about forcing recovery. 230707 15:50:20 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. To report this bug, see https://mariadb.com/kb/en/reporting-bugs We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. Server version: 10.6.12-7-MariaDB-enterprise-log source revision: 8e2b75dad28995ab5f6e6acd436135420f7031c9 key_buffer_size=268435456 read_buffer_size=131072 max_used_connections=2243 max_threads=6002 thread_count=1565 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 13479553 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Thread pointer: 0x7ef735ab51c8 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x7f17f7765cb8 thread_stack 0x49000 Can't start addr2line /usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x5608f6116c7e] /usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x5608f5bc33a5] /lib64/libpthread.so.0(+0xf630)[0x7f2011c15630] /lib64/libc.so.6(gsignal+0x37)[0x7f2011060387] /lib64/libc.so.6(abort+0x148)[0x7f2011061a78] /usr/sbin/mariadbd(+0x694d97)[0x5608f5834d97] /usr/sbin/mariadbd(+0xdbfb05)[0x5608f5f5fb05] /usr/sbin/mariadbd(+0xdaf516)[0x5608f5f4f516] 2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 844de70f-8aaf with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 766 rttvar: 579 rto: 201000 lost: 0 last_data_recv: 2567 cwnd: 10 last_queued_since: 8776161264 last_delivered_since: 11959172679 send_queue_length: 9 send_queue_bytes: 720 segment: 0 messages: 9 2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.103:4567 2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 5dda822d-b4c3 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 6407 rttvar: 10277 rto: 207000 lost: 0 last_data_recv: 5900 cwnd: 10 last_queued_since: 307819 last_delivered_since: 8781038225 send_queue_length: 10 send_queue_bytes: 1080 segment: 0 messages: 10 2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.104:42156 2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 96c49f4b-8727 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 5591 rttvar: 10094 rto: 206000 lost: 0 last_data_recv: 5924 cwnd: 10 last_queued_since: 10916 last_delivered_since: 8781705783 send_queue_length: 11 send_queue_bytes: 1292 segment: 0 messages: 11 2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.101:33510 2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.10.1.101:4567 tcp://10.10.1.103:4567 tcp://10.10.1.104:4567 2023-07-07 15:50:32 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT12.0655S), skipping check 2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7f17e8905e58 2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer destruct 2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7ef49da77b98 2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer destruct /usr/sbin/mariadbd(+0xe62a65)[0x5608f6002a65] 2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 96c49f4b-8727 (tcp://10.10.1.101:4567), attempt 0 2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 844de70f-8aaf (tcp://10.10.1.103:4567), attempt 0 2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 5dda822d-b4c3 (tcp://10.10.1.104:4567), attempt 0 2023-07-07 15:50:33 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.73771S), skipping check /usr/sbin/mariadbd(+0xe4e639)[0x5608f5fee639] /usr/sbin/mariadbd(+0xe5063b)[0x5608f5ff063b] /usr/sbin/mariadbd(+0xe62e98)[0x5608f6002e98] /usr/sbin/mariadbd(+0xde0227)[0x5608f5f80227] /usr/sbin/mariadbd(+0xde2da4)[0x5608f5f82da4] 2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr timed out, no messages seen in PT6S, socket stats: rtt: 1000 rttvar: 500 rto: 201000 lost: 0 last_data_recv: 125408244 cwnd: 10 last_queued_since: 4421911460 last_delivered_since: 8715642835357354 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 359 rttvar: 179 rto: 201000 lost: 0 last_data_recv: 7528 cwnd: 10 last_queued_since: 120133 last_delivered_since: 120133 send_queue_length: 0 send_queue_bytes: 0 2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 437 rttvar: 218 rto: 201000 lost: 0 last_data_recv: 7532 cwnd: 10 last_queued_since: 4131087 last_delivered_since: 4131087 send_queue_length: 0 send_queue_bytes: 0 2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 181 rttvar: 90 rto: 201000 lost: 0 last_data_recv: 7532 cwnd: 10 last_queued_since: 8715642839939685 last_delivered_since: 8715642839939685 send_queue_length: 0 send_queue_bytes: 0 2023-07-07 15:50:41 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT7.53388S), skipping check /usr/sbin/mariadbd(+0xe151ab)[0x5608f5fb51ab] 2023-07-07 15:50:43 0 [Note] WSREP: Deferred close timer handle_wait Success for 0x7f17ebc5b168 2023-07-07 15:50:43 0 [Note] WSREP: Deferred close timer destruct 2023-07-07 15:50:43 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.61329S), skipping check /usr/sbin/mariadbd(+0xe15869)[0x5608f5fb5869] 2023-07-07 15:50:44 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.63124S), skipping check 2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 5dda822d-b4c3 2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 844de70f-8aaf 2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 96c49f4b-8727 /usr/sbin/mariadbd(+0xdf28b2)[0x5608f5f928b2] /usr/sbin/mariadbd(+0xd43ca8)[0x5608f5ee3ca8] /usr/sbin/mariadbd(_ZN7handler13ha_update_rowEPKhS1_+0x232)[0x5608f5bd12b2] /usr/sbin/mariadbd(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x1a63)[0x5608f5a5cf33] /usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x263e)[0x5608f597d38e] /usr/sbin/mariadbd(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x38)[0x5608f58cb718] /usr/sbin/mariadbd(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x176)[0x5608f58d48b6] /usr/sbin/mariadbd(_ZN13sp_instr_stmt7executeEP3THDPj+0x5bc)[0x5608f58d529c] /usr/sbin/mariadbd(_ZN7sp_head7executeEP3THDb+0xa0c)[0x5608f58ceeac] 2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr timed out, no messages seen in PT6S, socket stats: rtt: 0 rttvar: 0 rto: 200000 lost: 0 last_data_recv: 125424590 cwnd: 10 last_queued_since: 11598743653 last_delivered_since: 8715659181935888 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 5dda822d-b4c3 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 185 rttvar: 82 rto: 201000 lost: 0 last_data_recv: 5508 cwnd: 10 last_queued_since: 11598929590 last_delivered_since: 11598962474 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 844de70f-8aaf with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 264 rttvar: 105 rto: 201000 lost: 0 last_data_recv: 5508 cwnd: 10 last_queued_since: 11599134047 last_delivered_since: 11599140957 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 96c49f4b-8727 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 384 rttvar: 151 rto: 201000 lost: 0 last_data_recv: 5678 cwnd: 10 last_queued_since: 11599486516 last_delivered_since: 11599494477 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 /usr/sbin/mariadbd(_ZN7sp_head17execute_procedureEP3THDP4ListI4ItemE+0x66a)[0x5608f58d093a] /usr/sbin/mariadbd(+0x7cfc17)[0x5608f596fc17] /usr/sbin/mariadbd(+0x7d3a68)[0x5608f5973a68] /usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x101a)[0x5608f597bd6a] 2023-07-07 15:51:04 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr timed out, no messages seen in PT6S, socket stats: rtt: 0 rttvar: 0 rto: 200000 lost: 0 last_data_recv: 125430994 cwnd: 10 last_queued_since: 221024 last_delivered_since: 8715665585967344 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 2023-07-07 15:51:04 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT18.0029S), skipping check 2023-07-07 15:51:04 0 [Note] WSREP: view(view_id(NON_PRIM,5dda822d-b4c3,50) memb { 6c357751-8d5f,0 } joined { } left { } partitioned { 5dda822d-b4c3,0 844de70f-8aaf,0 96c49f4b-8727,0 }) 2023-07-07 15:51:04 0 [Note] WSREP: view(view_id(NON_PRIM,6c357751-8d5f,51) memb { 6c357751-8d5f,0 } joined { } left { } partitioned { 5dda822d-b4c3,0 844de70f-8aaf,0 96c49f4b-8727,0 }) 2023-07-07 15:51:04 11994604 [Warning] WSREP: Send action {(nil), 139603616989752, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2023-07-07 15:51:04 0 [Note] WSREP: Flow-control interval: [240, 300] 2023-07-07 15:51:04 0 [Note] WSREP: Received NON-PRIMARY. 2023-07-07 15:51:04 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 6204240577) 2023-07-07 15:51:04 11955214 [Warning] WSREP: Send action {(nil), 139599322023456, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 11985855 [Warning] WSREP: Send action {(nil), 139603616990584, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 11987046 [Warning] WSREP: Send action {(nil), 139599322023328, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 11985820 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 46 [Note] WSREP: ================================================ View: id: c3a51458-b6fd-11eb-8a80-eb35c100e72c:6204240577 status: non-primary protocol_version: 4 capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO final: no own_index: 0 members(1): 0: 6c357751-ce4f-11ed-8d5f-136e7094748b, PIXID-MDB-MASTER2 ================================================= 2023-07-07 15:51:04 46 [Note] WSREP: Non-primary view 2023-07-07 15:51:04 46 [Note] WSREP: Server status change synced -> connected 2023-07-07 15:51:04 46 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2023-07-07 15:51:04 11997537 [Warning] WSREP: Send action {(nil), 139603616989760, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 11996647 [Warning] WSREP: Send action {(nil), 139573552218680, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 11986459 [Warning] WSREP: Send action {(nil), 139736760976944, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 11997986 [Warning] WSREP: Send action {(nil), 139599322023552, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 11985505 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 11988311 [Warning] WSREP: Send action {(nil), 139607911957872, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11917365 [Warning] WSREP: Send action {(nil), 139586437121400, WRITESET} returned -107 (Transport endpoint is not connected) /usr/sbin/mariadbd(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x38)[0x5608f58cb718] 2023-07-07 15:51:06 11985895 [Warning] WSREP: Send action {(nil), 139590732088096, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2023-07-07 15:51:06 0 [Note] WSREP: Flow-control interval: [240, 300] 2023-07-07 15:51:06 0 [Note] WSREP: Received NON-PRIMARY. 2023-07-07 15:51:06 11978506 [Warning] WSREP: Send action {(nil), 139599322023472, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11997530 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11988302 [Warning] WSREP: Send action {(nil), 139599322023760, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11988512 [Warning] WSREP: Send action {(nil), 139736760977344, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11988290 [Warning] WSREP: Send action {(nil), 139595027055344, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11998006 [Warning] WSREP: Send action {(nil), 139599322023888, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11988294 [Warning] WSREP: Send action {(nil), 139603616990632, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11991973 [Warning] WSREP: Send action {(nil), 139599322023712, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11988288 [Warning] WSREP: Send action {(nil), 139595027057080, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11985860 [Warning] WSREP: Send action {(nil), 139577847186808, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11997914 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11987069 [Warning] WSREP: Send action {(nil), 139599322023280, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11988354 [Warning] WSREP: Send action {(nil), 139736760976760, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11986489 [Warning] WSREP: Send action {(nil), 139564962285120, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11986139 [Warning] WSREP: Send action {(nil), 139582142155592, WRITESET} returned -107 (Transport endpoint is not connected) /usr/sbin/mariadbd(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x176)[0x5608f58d48b6] 2023-07-07 15:51:07 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.43448S), skipping check /usr/sbin/mariadbd(_ZN13sp_instr_stmt7executeEP3THDPj+0x5bc)[0x5608f58d529c] /usr/sbin/mariadbd(_ZN7sp_head7executeEP3THDb+0xa0c)[0x5608f58ceeac] 2023-07-07 15:51:10 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.93087S), skipping check /usr/sbin/mariadbd(_ZN7sp_head15execute_triggerEP3THDPK25st_mysql_const_lex_stringS4_P13st_grant_info+0x1df)[0x5608f58d008f] 2023-07-07 15:51:11 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 5dda822d-b4c3 tcp://10.10.1.104:4567 2023-07-07 15:51:11 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 844de70f-8aaf tcp://10.10.1.103:4567 /usr/sbin/mariadbd(_ZN19Table_triggers_list16process_triggersEP3THD14trg_event_type20trg_action_time_typeb+0x104)[0x5608f5a40ec4] 2023-07-07 15:51:13 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.87478S), skipping check /usr/sbin/mariadbd(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP10SQL_I_ListI8st_orderEyyP13select_result+0xd99)[0x5608f5d33da9] /usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x24c8)[0x5608f597d218] 2023-07-07 15:51:16 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.90666S), skipping check /usr/sbin/mariadbd(_Z11mysql_parseP3THDPcjP12Parser_state+0x20a)[0x5608f5980c9a] /usr/sbin/mariadbd(+0x7e1531)[0x5608f5981531] 2023-07-07 15:51:19 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 96c49f4b-8727 tcp://10.10.1.101:4567 2023-07-07 15:51:19 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.83971S), skipping check /usr/sbin/mariadbd(_Z16dispatch_command19enum_server_commandP3THDPcjb+0x29e1)[0x5608f5984c31] 2023-07-07 15:51:20 0 [Warning] WSREP: evs::proto(6c357751-8d5f, GATHER, view_id(REG,6c357751-8d5f,51)) install timer expired evs::proto(evs::proto(6c357751-8d5f, GATHER, view_id(REG,6c357751-8d5f,51)), GATHER) { current_view=view(view_id(REG,6c357751-8d5f,51) memb { 6c357751-8d5f,0 } joined { } left { } partitioned { }), input_map=evs::input_map: {aru_seq=2,safe_seq=2,node_index=node: {idx=0,range=[3,2],safe_seq=2} }, fifo_seq=1874639154, last_sent=2, known: 5dda822d-b4c3 at tcp://10.10.1.104:4567 {o=1,s=0,i=0,fs=834192942,jm= {v=1,t=4,ut=255,o=1,s=122,sr=-1,as=122,f=4,src=5dda822d-b4c3,srcvid=view_id(REG,5dda822d-b4c3,52),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=834192942,nl=( 5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],} 6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],} 844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],} 96c49f4b-8727, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],} ) }, } 6c357751-8d5f at {o=1,s=0,i=0,fs=-1,jm= {v=1,t=4,ut=255,o=1,s=2,sr=-1,as=2,f=0,src=6c357751-8d5f,srcvid=view_id(REG,6c357751-8d5f,51),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=1874639154,nl=( 5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],} 6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],} 844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],} 96c49f4b-8727, {o=0,s=0,e=0,ls=-1,vid=view_id(REG,00000000-0000,0),ss=-1,ir=[-1,-1],} ) }, } 844de70f-8aaf at tcp://10.10.1.103:4567 {o=1,s=0,i=0,fs=1475544355,jm= {v=1,t=4,ut=255,o=1,s=122,sr=-1,as=122,f=4,src=844de70f-8aaf,srcvid=view_id(REG,5dda822d-b4c3,52),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=1475544355,nl=( 5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],} 6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],} 844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],} 96c49f4b-8727, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],} ) }, } 96c49f4b-8727 at tcp://10.10.1.101:4567 {o=0,s=0,i=0,fs=101154494,} } 2023-07-07 15:51:20 0 [Note] WSREP: no install message received 2023-07-07 15:51:20 0 [Note] WSREP: view(view_id(NON_PRIM,6c357751-8d5f,52) memb { 6c357751-8d5f,0 } joined { } left { } partitioned { 5dda822d-b4c3,0 844de70f-8aaf,0 96c49f4b-8727,0 }) 2023-07-07 15:51:20 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2023-07-07 15:51:20 0 [Note] WSREP: Flow-control interval: [240, 300] 2023-07-07 15:51:20 0 [Note] WSREP: Received NON-PRIMARY. /usr/sbin/mariadbd(_Z10do_commandP3THDb+0x132)[0x5608f5985942] /usr/sbin/mariadbd(_Z24do_handle_one_connectionP7CONNECTb+0x3b7)[0x5608f5aa2dd7] 2023-07-07 15:51:23 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT4.34509S), skipping check /usr/sbin/mariadbd(handle_one_connection+0x5d)[0x5608f5aa311d] 2023-07-07 15:51:24 11986441 [Warning] WSREP: Send action {(nil), 139599322023528, WRITESET} returned -107 (Transport endpoint is not connected) /usr/sbin/mariadbd(+0xc839d2)[0x5608f5e239d2] /lib64/libpthread.so.0(+0x7ea5)[0x7f2011c0dea5] /lib64/libc.so.6(clone+0x6d)[0x7f2011128b0d] Trying to get some variables. Some pointers may be invalid and cause the dump to abort. Query (0x7ef692470578): UPDATE DWHTmp.MAJ_EVENEMENTS_RAPPROCHEMENT SET TYPE_EVE = NAME_CONST('V_TYPE_EVE',_utf8mb3'D' COLLATE 'utf8mb3_general_ci') , DATE_EFFECTIVE = NAME_CONST('V_DATE_EFFECTIVE',TIMESTAMP'2023-07-07 15:50:20'), STATUT_DATAMART = NULL WHERE EVT_ID = NAME_CONST('V_EVT_ID',2051583478) Connection ID (thread ID): 11985593 Status: NOT_KILLED Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains information that should help you find out what is causing the crash. Writing a core file... Working directory at /data/mysql Resource Limits: Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 805978 805978 processes Max open files 1048576 1048576 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 805978 805978 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us Core pattern: core Kernel version: Linux version 3.10.0-1160.88.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Mar 7 15:41:52 UTC 2023 2023-07-07 17:08:23 0 [Note] Starting MariaDB 10.6.12-7-MariaDB-enterprise-log source revision 8e2b75dad28995ab5f6e6acd436135420f7031c9 as process 1083

Jan Lindström added a comment - 2023-09-07 10:47

rpizzi Based on these error logs remaining nodes were not in agreement of the absent node state, so they decided to exclude each other from the group. See here:

	6c357751-8d5f, {o=0,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,50),ss=50608928,ir=[50608930,50608929],}

vs.

	6c357751-8d5f, {o=0,s=1,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,50),ss=50608928,ir=[50608930,50608929],}

vs

	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,50),ss=50608928,ir=[50608930,50608929],}

where o=operational and s = suspected. First thinks both are 0 (false), second thinks node is suspected and last thinks node is still operational. This depends on input and timing.

Inconsistency issue found by ramesh is a bug (~~MDEV-32122~~) but not related to problem here.

Jan Lindström added a comment - 2023-09-07 10:47 rpizzi Based on these error logs remaining nodes were not in agreement of the absent node state, so they decided to exclude each other from the group. See here: 6c357751-8d5f, {o=0,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,50),ss=50608928,ir=[50608930,50608929],} vs. 6c357751-8d5f, {o=0,s=1,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,50),ss=50608928,ir=[50608930,50608929],} vs 6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,50),ss=50608928,ir=[50608930,50608929],} where o=operational and s = suspected. First thinks both are 0 (false), second thinks node is suspected and last thinks node is still operational. This depends on input and timing. Inconsistency issue found by ramesh is a bug ( MDEV-32122 ) but not related to problem here.

Rick Pizzi (Inactive) added a comment - 2023-09-07 10:52 - edited

Well, this confirms the issue doesn't it...
I believe the disagreement comes from the fact that WSREP layer did not die when server did, hence responded to requests from other nodes in an inconsistent manner.

Rick Pizzi (Inactive) added a comment - 2023-09-07 10:52 - edited Well, this confirms the issue doesn't it... I believe the disagreement comes from the fact that WSREP layer did not die when server did, hence responded to requests from other nodes in an inconsistent manner.

Jan Lindström added a comment - 2023-09-07 11:12

rpizzi Yes. serg signals that are raised or sent to the process (instead of a specific thread), will still be handled by a random thread (among those that do not block it). So is there any way to get those wsrep threads die faster? In the other nodes there is not much we can do as their knowledge on node state depends when information of state last time was received and/or requested but it would help if crashing node threads doing traffic to other nodes are down as soon as possible.

Jan Lindström added a comment - 2023-09-07 11:12 rpizzi Yes. serg signals that are raised or sent to the process (instead of a specific thread), will still be handled by a random thread (among those that do not block it). So is there any way to get those wsrep threads die faster? In the other nodes there is not much we can do as their knowledge on node state depends when information of state last time was received and/or requested but it would help if crashing node threads doing traffic to other nodes are down as soon as possible.

Michael Widenius added a comment - 2023-09-08 15:08

When the server crash, there are usually not much to do except printing a stack trace and call exit()
In theory we could on crash send some kind of signal to WSREP threads (If they have a THD then we can mark it killed).
Would marking the THD as killed help?

Michael Widenius added a comment - 2023-09-08 15:08 When the server crash, there are usually not much to do except printing a stack trace and call exit() In theory we could on crash send some kind of signal to WSREP threads (If they have a THD then we can mark it killed). Would marking the THD as killed help?

Jan Lindström added a comment - 2023-09-11 04:38

Marking THD as killed would help only for appliers but not for thread(s) used inside Galera library for connections to other nodes.

Jan Lindström added a comment - 2023-09-11 04:38 Marking THD as killed would help only for appliers but not for thread(s) used inside Galera library for connections to other nodes.

Rick Pizzi (Inactive) added a comment - 2023-09-11 07:41

monty the issue here is that "printing a stack trace" takes, like, 5 minutes.
In the meantime the WSREP threads are alive and sending inconsistent information to other nodes.

Rick Pizzi (Inactive) added a comment - 2023-09-11 07:41 monty the issue here is that "printing a stack trace" takes, like, 5 minutes. In the meantime the WSREP threads are alive and sending inconsistent information to other nodes.

Jan Lindström added a comment - 2023-10-05 08:36

marko serg Is there anything we can do for this and we talk here Release builds assertions? Whatever you can do it needs to be done before we enter signal handler and start doing core dump if we do it as it could take long time. Currently, maybe we could set thd on killed state (not sure if this is enough) for wsrep applier threads. But this does not mean yet that node is not reachable on Galera point of view. There is not now a way to disconnect all incoming/out coming connections inside Galera from server code, this would partly solve the issue but again it depends on timing i.e. when other nodes ask/discover crashing node status and because this is asynchronous there is still risk that remaining nodes still do not agree state of crashing node (i.e is it operational or suspected).

Jan Lindström added a comment - 2023-10-05 08:36 marko serg Is there anything we can do for this and we talk here Release builds assertions? Whatever you can do it needs to be done before we enter signal handler and start doing core dump if we do it as it could take long time. Currently, maybe we could set thd on killed state (not sure if this is enough) for wsrep applier threads. But this does not mean yet that node is not reachable on Galera point of view. There is not now a way to disconnect all incoming/out coming connections inside Galera from server code, this would partly solve the issue but again it depends on timing i.e. when other nodes ask/discover crashing node status and because this is asynchronous there is still risk that remaining nodes still do not agree state of crashing node (i.e is it operational or suspected).

Sergei Golubchik added a comment - 2023-10-09 11:55

janlindstrom, why remaining nodes do not agree on state of crashing node?

Sergei Golubchik added a comment - 2023-10-09 11:55 janlindstrom , why remaining nodes do not agree on state of crashing node?

Jan Lindström added a comment - 2023-10-10 11:51

serg See my comment on 2023-09-07 10:47, there one node thinks node is down, second thinks node is suspected and last thinks node is still operational. This is because nodes notice that node is down or suspected to be down based on information they have received and this when they receive is timing dependent. Why one node thinks node crashing is still operational, it must be based on fact that we have received something from that node before (or during) crashing and information that node is down has not yet arrived or some timeout on connection is not yet reached.

My questions was not how to improve this agreement on node states, it was how to make crashing node more unreachable e.g. by killing appliers and closing all incoming and outgoing connections earlier.

Jan Lindström added a comment - 2023-10-10 11:51 serg See my comment on 2023-09-07 10:47, there one node thinks node is down, second thinks node is suspected and last thinks node is still operational. This is because nodes notice that node is down or suspected to be down based on information they have received and this when they receive is timing dependent. Why one node thinks node crashing is still operational, it must be based on fact that we have received something from that node before (or during) crashing and information that node is down has not yet arrived or some timeout on connection is not yet reached. My questions was not how to improve this agreement on node states, it was how to make crashing node more unreachable e.g. by killing appliers and closing all incoming and outgoing connections earlier.

Sergei Golubchik added a comment - 2023-10-11 19:46

If the crashing node needs to send three messages, sequentially, to three different nodes, then there always will be a race condition. You kill them earlier, you kill them later, whatever you do, the node won't die instantly as a whole. Galera must be able to cope with it, otherwise any node crash can break the cluster.

But I don't understand why Galera cannot cope with it. Nodes send messages to peers in a specific order. So if a node A with a lower number thinks that some node X is up and the node B with a higher number thinks that the node X is down, it means that the node X crashed after sending a message to A, but before sending a message to B. This is easy to detect.

Sergei Golubchik added a comment - 2023-10-11 19:46 If the crashing node needs to send three messages, sequentially, to three different nodes, then there always will be a race condition. You kill them earlier, you kill them later, whatever you do, the node won't die instantly as a whole. Galera must be able to cope with it, otherwise any node crash can break the cluster. But I don't understand why Galera cannot cope with it. Nodes send messages to peers in a specific order. So if a node A with a lower number thinks that some node X is up and the node B with a higher number thinks that the node X is down, it means that the node X crashed after sending a message to A, but before sending a message to B. This is easy to detect.

Jan Lindström added a comment - 2023-10-27 09:09

teemu.ollakka Can you explain why remaining nodes fail to agree crashing node state and start to self leave.

Jan Lindström added a comment - 2023-10-27 09:09 teemu.ollakka Can you explain why remaining nodes fail to agree crashing node state and start to self leave.

Marko Mäkelä added a comment - 2023-11-09 05:59

The reported assertion failure here occurs when a record is being inserted into a corrupted ROW_FORMAT=COMPRESSED InnoDB page. This crash was not removed in ~~MDEV-13542~~. An obvious workaround for this particular case would be to avoid using ROW_FORMAT=COMPRESSED tables. Some design mistakes are not easy to fix; see ~~MDEV-30882~~ and ~~MDEV-31574~~.

Marko Mäkelä added a comment - 2023-11-09 05:59 The reported assertion failure here occurs when a record is being inserted into a corrupted ROW_FORMAT=COMPRESSED InnoDB page. This crash was not removed in MDEV-13542 . An obvious workaround for this particular case would be to avoid using ROW_FORMAT=COMPRESSED tables. Some design mistakes are not easy to fix; see MDEV-30882 and MDEV-31574 .

Jan Lindström added a comment - 2024-02-02 08:56

rpizzi I tried to reproduce this issue with latest 10.6 and Galera library 26.4.17 with attached test case. After several hours of testing, I still could not reproduce the issue.

Jan Lindström added a comment - 2024-02-02 08:56 rpizzi I tried to reproduce this issue with latest 10.6 and Galera library 26.4.17 with attached test case. After several hours of testing, I still could not reproduce the issue.

Marko Mäkelä added a comment - 2024-02-02 12:17

janlindstrom, I see that galera_crash_node.test uses debug injection for crashing a node at a specific point of execution. I think that a more realistic test scenario would be to run CMAKE_BUILD_TYPE=RelWithDebInfo executables and randomly kill one of the cluster nodes externally (by kill -KILL).

Marko Mäkelä added a comment - 2024-02-02 12:17 janlindstrom , I see that galera_crash_node.test uses debug injection for crashing a node at a specific point of execution. I think that a more realistic test scenario would be to run CMAKE_BUILD_TYPE=RelWithDebInfo executables and randomly kill one of the cluster nodes externally (by kill -KILL ).

Jan Lindström added a comment - 2024-02-06 11:16 - edited

I tested again hours with following setup

10.6 commit bde552ae RelWithDebInfo build
Galera library 26.4.17 release build
3 node cluster
sysbench load to node_1 and node_3 (oltp_read_write)
kill -9 node_3 after a while and restart node_3 + sysbench load

Result: remaining nodes stayed up and running as expected i.e. I could not reproduce.

Jan Lindström added a comment - 2024-02-06 11:16 - edited I tested again hours with following setup 10.6 commit bde552ae RelWithDebInfo build Galera library 26.4.17 release build 3 node cluster sysbench load to node_1 and node_3 (oltp_read_write) kill -9 node_3 after a while and restart node_3 + sysbench load Result: remaining nodes stayed up and running as expected i.e. I could not reproduce.

Rick Pizzi (Inactive) added a comment - 2024-02-06 11:50

I don't think that killing the node with kill -9 will ever reproduce it.
As explained it has to be a code assertion.

Rick

Rick Pizzi (Inactive) added a comment - 2024-02-06 11:50 I don't think that killing the node with kill -9 will ever reproduce it. As explained it has to be a code assertion. Rick

Rick Pizzi (Inactive) added a comment - 2024-02-06 11:54

The whole point of this ticket is that WSREP layer remains active after the assertion generates the trap.
Killing the process with SIGKILL will not allow the code to do anything, including executing the trap code.

Rick Pizzi (Inactive) added a comment - 2024-02-06 11:54 The whole point of this ticket is that WSREP layer remains active after the assertion generates the trap. Killing the process with SIGKILL will not allow the code to do anything, including executing the trap code.

Marko Mäkelä added a comment - 2024-02-06 12:05

janlindstrom, did you test with kill -ABRT as well? I think that it should trigger our built-in stack trace reporter, which depending on the circumstances, could hang or cause unexpected behaviour.

Marko Mäkelä added a comment - 2024-02-06 12:05 janlindstrom , did you test with kill -ABRT as well? I think that it should trigger our built-in stack trace reporter, which depending on the circumstances, could hang or cause unexpected behaviour.

Jan Lindström added a comment - 2024-02-07 07:21

rpizzi marko Both cases were tested (with several test rounds) and remaining nodes stayed on cluster normally i.e. I could not reproduce case where all nodes leave the cluster.

Jan Lindström added a comment - 2024-02-07 07:21 rpizzi marko Both cases were tested (with several test rounds) and remaining nodes stayed on cluster normally i.e. I could not reproduce case where all nodes leave the cluster.

Jan Lindström added a comment - 2024-04-11 06:41

teemu.ollakka Can you try to explain why nodes could disagree state of the crashing node and what we can do for wsrep connections and threads at signal handler?

Jan Lindström added a comment - 2024-04-11 06:41 teemu.ollakka Can you try to explain why nodes could disagree state of the crashing node and what we can do for wsrep connections and threads at signal handler?

Marko Mäkelä added a comment - 2024-06-04 13:24

It just occurred to me that according to the error log excerpt in the Description, the Galera node was stuck for more than 60 seconds, trying to produce a stack trace of the crashing thread, which in my experience would be mostly useless for anything that involves InnoDB or Galera, because it covers only one thread and typically resolves many stack traces incorrectly. Besides, some of the invoked functions would seem to be unsafe according to man 7 signal-safety.

Would this problem be alleviated by configuring the nodes with --skip-stack-trace so that they would fail faster in the event of a fatal error? For post-mortem analysis, core dumps could still be generated independent of this option.

Marko Mäkelä added a comment - 2024-06-04 13:24 It just occurred to me that according to the error log excerpt in the Description, the Galera node was stuck for more than 60 seconds, trying to produce a stack trace of the crashing thread, which in my experience would be mostly useless for anything that involves InnoDB or Galera, because it covers only one thread and typically resolves many stack traces incorrectly. Besides, some of the invoked functions would seem to be unsafe according to man 7 signal-safety . Would this problem be alleviated by configuring the nodes with --skip-stack-trace so that they would fail faster in the event of a fatal error? For post-mortem analysis, core dumps could still be generated independent of this option.

Marko Mäkelä added a comment - 2024-06-18 13:05

~~MDEV-21010~~ appears to be a very similar bug report about the built-in stack trace reporter.

Marko Mäkelä added a comment - 2024-06-18 13:05 MDEV-21010 appears to be a very similar bug report about the built-in stack trace reporter.

Jan Lindström added a comment - 2024-07-30 09:45

Fixed on Galera libarary 24.6.19.

Jan Lindström added a comment - 2024-07-30 09:45 Fixed on Galera libarary 24.6.19.

Marko Mäkelä added a comment - 2024-08-01 09:07

janlindstrom, can you comment on my observation about the time to attempt to produce stack traces in the Description, which apparently ran for 64 seconds between 15:50:20 and 15:51:24? The stack trace output is interleaved with other messages.

I would think that when a process is killed, all connections will be torn down and any peer processes will be notified. If an assertion failure causes the process to stop serving requests, but the connection sockets will be held open (in a stuck state) while the built-in stack trace reporter is running (and I have seen it actually hang in other cases), then the peer processes could remain blocked for a long time.

What exactly do you think would be fixed in the Galera library? Would there be some kind of inactivity timeouts in the peer process?

Marko Mäkelä added a comment - 2024-08-01 09:07 janlindstrom , can you comment on my observation about the time to attempt to produce stack traces in the Description, which apparently ran for 64 seconds between 15:50:20 and 15:51:24? The stack trace output is interleaved with other messages. I would think that when a process is killed, all connections will be torn down and any peer processes will be notified. If an assertion failure causes the process to stop serving requests, but the connection sockets will be held open (in a stuck state) while the built-in stack trace reporter is running (and I have seen it actually hang in other cases), then the peer processes could remain blocked for a long time. What exactly do you think would be fixed in the Galera library? Would there be some kind of inactivity timeouts in the peer process?

Jan Lindström added a comment - 2024-08-01 09:22

marko Galera library has method that can be called so that all connections are closed i.e. node is isolated from rest of the cluster. However, actual server code part is still missing (I did not notice it first). This works as follows: from signal handler inside a server code we call this node isolation function to isolate node from rest of the cluster.

I have not been able to reproduce the problem locally, I have tried with different methods several times.

Jan Lindström added a comment - 2024-08-01 09:22 marko Galera library has method that can be called so that all connections are closed i.e. node is isolated from rest of the cluster. However, actual server code part is still missing (I did not notice it first). This works as follows: from signal handler inside a server code we call this node isolation function to isolate node from rest of the cluster. I have not been able to reproduce the problem locally, I have tried with different methods several times.

Jan Lindström added a comment - 2024-08-08 05:39

https://github.com/MariaDB/server/pull/3437

Jan Lindström added a comment - 2024-08-08 05:39 https://github.com/MariaDB/server/pull/3437

Julius Goryavsky added a comment - 2024-09-01 13:48

The fix has been merged with the head revision: https://github.com/MariaDB/server/commit/54a10a429334a9579558a5d284c510d6f8b5bc97

Julius Goryavsky added a comment - 2024-09-01 13:48 The fix has been merged with the head revision: https://github.com/MariaDB/server/commit/54a10a429334a9579558a5d284c510d6f8b5bc97

People

Assignee:: Julius Goryavsky

Reporter:: Rick Pizzi (Inactive)

Votes:: 5 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 2023-07-14 14:03

Updated:: 2024-09-22 08:08

Resolved:: 2024-09-01 13:48

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration