Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32363

when InnoDB gets an assertion failure, WSREP layer is not handled gracefully

Details

    Description

      If InnoDB gets an assertion failure, the WSREP layer is not immediately notified and this causes all nodes to lose primary status as a consequence.

      You can see below that WSREP messages are mixed with the assertion stacktrace, and that WSREP tries to reconnect with peers even if the node is practically crashed and in fact it will die soon.

      When this happens, the entire cluster goes into non-primary and a cluster bootstrap is needed to recover, which is not what we expect from the situation – crashed node should just be evicted and cluster should continue normally.

      2023-07-07 15:50:19 11985573 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
      2023-07-07 15:50:19 11985573 [ERROR] mariadbd: Index for table 'failed_table' is corrupt; try to repair it
      2023-07-07 15:50:20 0x7f17f7766700  InnoDB: Assertion failure in file /home/jenkins/workspace/Build-Package/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/storage/innobase/page/page0zip.cc line 4213
      InnoDB: Failing assertion: slot_rec
      InnoDB: We intentionally generate a memory trap.
      InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
      InnoDB: If you get repeated assertion failures or crashes, even
      InnoDB: immediately after the mariadbd startup, there may be
      InnoDB: corruption in the InnoDB tablespace. Please refer to
      InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
      InnoDB: about forcing recovery.
      230707 15:50:20 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
       
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed, 
      something is definitely wrong and this may fail.
       
      Server version: 10.6.12-7-MariaDB-enterprise-log source revision: 8e2b75dad28995ab5f6e6acd436135420f7031c9
      key_buffer_size=268435456
      read_buffer_size=131072
      max_used_connections=2243
      max_threads=6002
      thread_count=1565
      It is possible that mysqld could use up to 
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 13479553 K  bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.
       
      Thread pointer: 0x7ef735ab51c8
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x7f17f7765cb8 thread_stack 0x49000
      Can't start addr2line
      /usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x5608f6116c7e]
      /usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x5608f5bc33a5]
      /lib64/libpthread.so.0(+0xf630)[0x7f2011c15630]
      /lib64/libc.so.6(gsignal+0x37)[0x7f2011060387]
      /lib64/libc.so.6(abort+0x148)[0x7f2011061a78]
      /usr/sbin/mariadbd(+0x694d97)[0x5608f5834d97]
      /usr/sbin/mariadbd(+0xdbfb05)[0x5608f5f5fb05]
      /usr/sbin/mariadbd(+0xdaf516)[0x5608f5f4f516]
      2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 844de70f-8aaf with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 766 rttvar: 579 rto: 201000 lost: 0 last_data_recv: 2567 cwnd: 10 last_queued_since: 8776161264 last_delivered_since: 11959172679 send_queue_length: 9 send_queue_bytes: 720 segment: 0 messages: 9
      2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.103:4567
      2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 5dda822d-b4c3 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 6407 rttvar: 10277 rto: 207000 lost: 0 last_data_recv: 5900 cwnd: 10 last_queued_since: 307819 last_delivered_since: 8781038225 send_queue_length: 10 send_queue_bytes: 1080 segment: 0 messages: 10
      2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.104:42156
      2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 96c49f4b-8727 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 5591 rttvar: 10094 rto: 206000 lost: 0 last_data_recv: 5924 cwnd: 10 last_queued_since: 10916 last_delivered_since: 8781705783 send_queue_length: 11 send_queue_bytes: 1292 segment: 0 messages: 11
      2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.101:33510
      2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.10.1.101:4567 tcp://10.10.1.103:4567 tcp://10.10.1.104:4567 
      2023-07-07 15:50:32 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT12.0655S), skipping check
      2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7f17e8905e58
      2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer destruct
      2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7ef49da77b98
      2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer destruct
      /usr/sbin/mariadbd(+0xe62a65)[0x5608f6002a65]
      2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 96c49f4b-8727 (tcp://10.10.1.101:4567), attempt 0
      2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 844de70f-8aaf (tcp://10.10.1.103:4567), attempt 0
      2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 5dda822d-b4c3 (tcp://10.10.1.104:4567), attempt 0
      2023-07-07 15:50:33 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.73771S), skipping check
      /usr/sbin/mariadbd(+0xe4e639)[0x5608f5fee639]
      /usr/sbin/mariadbd(+0xe5063b)[0x5608f5ff063b]
      /usr/sbin/mariadbd(+0xe62e98)[0x5608f6002e98]
      /usr/sbin/mariadbd(+0xde0227)[0x5608f5f80227]
      /usr/sbin/mariadbd(+0xde2da4)[0x5608f5f82da4]
      2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 1000 rttvar: 500 rto: 201000 lost: 0 last_data_recv: 125408244 cwnd: 10 last_queued_since: 4421911460 last_delivered_since: 8715642835357354 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
      2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 359 rttvar: 179 rto: 201000 lost: 0 last_data_recv: 7528 cwnd: 10 last_queued_since: 120133 last_delivered_since: 120133 send_queue_length: 0 send_queue_bytes: 0
      2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 437 rttvar: 218 rto: 201000 lost: 0 last_data_recv: 7532 cwnd: 10 last_queued_since: 4131087 last_delivered_since: 4131087 send_queue_length: 0 send_queue_bytes: 0
      2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 181 rttvar: 90 rto: 201000 lost: 0 last_data_recv: 7532 cwnd: 10 last_queued_since: 8715642839939685 last_delivered_since: 8715642839939685 send_queue_length: 0 send_queue_bytes: 0
      2023-07-07 15:50:41 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT7.53388S), skipping check
      /usr/sbin/mariadbd(+0xe151ab)[0x5608f5fb51ab]
      2023-07-07 15:50:43 0 [Note] WSREP: Deferred close timer handle_wait Success for 0x7f17ebc5b168
      2023-07-07 15:50:43 0 [Note] WSREP: Deferred close timer destruct
      2023-07-07 15:50:43 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.61329S), skipping check
      /usr/sbin/mariadbd(+0xe15869)[0x5608f5fb5869]
      2023-07-07 15:50:44 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.63124S), skipping check
      2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 5dda822d-b4c3
      2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 844de70f-8aaf
      2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 96c49f4b-8727
      /usr/sbin/mariadbd(+0xdf28b2)[0x5608f5f928b2]
      /usr/sbin/mariadbd(+0xd43ca8)[0x5608f5ee3ca8]
      /usr/sbin/mariadbd(_ZN7handler13ha_update_rowEPKhS1_+0x232)[0x5608f5bd12b2]
      /usr/sbin/mariadbd(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x1a63)[0x5608f5a5cf33]
      /usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x263e)[0x5608f597d38e]
      /usr/sbin/mariadbd(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x38)[0x5608f58cb718]
      /usr/sbin/mariadbd(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x176)[0x5608f58d48b6]
      /usr/sbin/mariadbd(_ZN13sp_instr_stmt7executeEP3THDPj+0x5bc)[0x5608f58d529c]
      /usr/sbin/mariadbd(_ZN7sp_head7executeEP3THDb+0xa0c)[0x5608f58ceeac]
      2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 0 rttvar: 0 rto: 200000 lost: 0 last_data_recv: 125424590 cwnd: 10 last_queued_since: 11598743653 last_delivered_since: 8715659181935888 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
      2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 5dda822d-b4c3 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 185 rttvar: 82 rto: 201000 lost: 0 last_data_recv: 5508 cwnd: 10 last_queued_since: 11598929590 last_delivered_since: 11598962474 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
      2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 844de70f-8aaf with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 264 rttvar: 105 rto: 201000 lost: 0 last_data_recv: 5508 cwnd: 10 last_queued_since: 11599134047 last_delivered_since: 11599140957 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
      2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 96c49f4b-8727 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 384 rttvar: 151 rto: 201000 lost: 0 last_data_recv: 5678 cwnd: 10 last_queued_since: 11599486516 last_delivered_since: 11599494477 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
      /usr/sbin/mariadbd(_ZN7sp_head17execute_procedureEP3THDP4ListI4ItemE+0x66a)[0x5608f58d093a]
      /usr/sbin/mariadbd(+0x7cfc17)[0x5608f596fc17]
      /usr/sbin/mariadbd(+0x7d3a68)[0x5608f5973a68]
      /usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x101a)[0x5608f597bd6a]
      2023-07-07 15:51:04 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 0 rttvar: 0 rto: 200000 lost: 0 last_data_recv: 125430994 cwnd: 10 last_queued_since: 221024 last_delivered_since: 8715665585967344 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
      2023-07-07 15:51:04 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT18.0029S), skipping check
      2023-07-07 15:51:04 0 [Note] WSREP: view(view_id(NON_PRIM,5dda822d-b4c3,50) memb {
      	6c357751-8d5f,0
      } joined {
      } left {
      } partitioned {
      	5dda822d-b4c3,0
      	844de70f-8aaf,0
      	96c49f4b-8727,0
      })
      2023-07-07 15:51:04 0 [Note] WSREP: view(view_id(NON_PRIM,6c357751-8d5f,51) memb {
      	6c357751-8d5f,0
      } joined {
      } left {
      } partitioned {
      	5dda822d-b4c3,0
      	844de70f-8aaf,0
      	96c49f4b-8727,0
      })
      2023-07-07 15:51:04 11994604 [Warning] WSREP: Send action {(nil), 139603616989752, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:04 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
      2023-07-07 15:51:04 0 [Note] WSREP: Flow-control interval: [240, 300]
      2023-07-07 15:51:04 0 [Note] WSREP: Received NON-PRIMARY.
      2023-07-07 15:51:04 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 6204240577)
      2023-07-07 15:51:04 11955214 [Warning] WSREP: Send action {(nil), 139599322023456, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:04 11985855 [Warning] WSREP: Send action {(nil), 139603616990584, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:04 11987046 [Warning] WSREP: Send action {(nil), 139599322023328, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:04 11985820 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:04 46 [Note] WSREP: ================================================
      View:
        id: c3a51458-b6fd-11eb-8a80-eb35c100e72c:6204240577
        status: non-primary
        protocol_version: 4
        capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
        final: no
        own_index: 0
        members(1):
      	0: 6c357751-ce4f-11ed-8d5f-136e7094748b, PIXID-MDB-MASTER2
      =================================================
      2023-07-07 15:51:04 46 [Note] WSREP: Non-primary view
      2023-07-07 15:51:04 46 [Note] WSREP: Server status change synced -> connected
      2023-07-07 15:51:04 46 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2023-07-07 15:51:04 11997537 [Warning] WSREP: Send action {(nil), 139603616989760, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:04 11996647 [Warning] WSREP: Send action {(nil), 139573552218680, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:04 11986459 [Warning] WSREP: Send action {(nil), 139736760976944, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:04 11997986 [Warning] WSREP: Send action {(nil), 139599322023552, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:04 11985505 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:04 11988311 [Warning] WSREP: Send action {(nil), 139607911957872, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:06 11917365 [Warning] WSREP: Send action {(nil), 139586437121400, WRITESET} returned -107 (Transport endpoint is not connected)
      /usr/sbin/mariadbd(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x38)[0x5608f58cb718]
      2023-07-07 15:51:06 11985895 [Warning] WSREP: Send action {(nil), 139590732088096, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:06 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
      2023-07-07 15:51:06 0 [Note] WSREP: Flow-control interval: [240, 300]
      2023-07-07 15:51:06 0 [Note] WSREP: Received NON-PRIMARY.
      2023-07-07 15:51:06 11978506 [Warning] WSREP: Send action {(nil), 139599322023472, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:06 11997530 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:06 11988302 [Warning] WSREP: Send action {(nil), 139599322023760, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:06 11988512 [Warning] WSREP: Send action {(nil), 139736760977344, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:06 11988290 [Warning] WSREP: Send action {(nil), 139595027055344, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:06 11998006 [Warning] WSREP: Send action {(nil), 139599322023888, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:06 11988294 [Warning] WSREP: Send action {(nil), 139603616990632, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:06 11991973 [Warning] WSREP: Send action {(nil), 139599322023712, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:06 11988288 [Warning] WSREP: Send action {(nil), 139595027057080, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:06 11985860 [Warning] WSREP: Send action {(nil), 139577847186808, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:06 11997914 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:06 11987069 [Warning] WSREP: Send action {(nil), 139599322023280, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:06 11988354 [Warning] WSREP: Send action {(nil), 139736760976760, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:06 11986489 [Warning] WSREP: Send action {(nil), 139564962285120, WRITESET} returned -107 (Transport endpoint is not connected)
      2023-07-07 15:51:06 11986139 [Warning] WSREP: Send action {(nil), 139582142155592, WRITESET} returned -107 (Transport endpoint is not connected)
      /usr/sbin/mariadbd(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x176)[0x5608f58d48b6]
      2023-07-07 15:51:07 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.43448S), skipping check
      /usr/sbin/mariadbd(_ZN13sp_instr_stmt7executeEP3THDPj+0x5bc)[0x5608f58d529c]
      /usr/sbin/mariadbd(_ZN7sp_head7executeEP3THDb+0xa0c)[0x5608f58ceeac]
      2023-07-07 15:51:10 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.93087S), skipping check
      /usr/sbin/mariadbd(_ZN7sp_head15execute_triggerEP3THDPK25st_mysql_const_lex_stringS4_P13st_grant_info+0x1df)[0x5608f58d008f]
      2023-07-07 15:51:11 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 5dda822d-b4c3 tcp://10.10.1.104:4567
      2023-07-07 15:51:11 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 844de70f-8aaf tcp://10.10.1.103:4567
      /usr/sbin/mariadbd(_ZN19Table_triggers_list16process_triggersEP3THD14trg_event_type20trg_action_time_typeb+0x104)[0x5608f5a40ec4]
      2023-07-07 15:51:13 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.87478S), skipping check
      /usr/sbin/mariadbd(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP10SQL_I_ListI8st_orderEyyP13select_result+0xd99)[0x5608f5d33da9]
      /usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x24c8)[0x5608f597d218]
      2023-07-07 15:51:16 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.90666S), skipping check
      /usr/sbin/mariadbd(_Z11mysql_parseP3THDPcjP12Parser_state+0x20a)[0x5608f5980c9a]
      /usr/sbin/mariadbd(+0x7e1531)[0x5608f5981531]
      2023-07-07 15:51:19 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 96c49f4b-8727 tcp://10.10.1.101:4567
      2023-07-07 15:51:19 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.83971S), skipping check
      /usr/sbin/mariadbd(_Z16dispatch_command19enum_server_commandP3THDPcjb+0x29e1)[0x5608f5984c31]
      2023-07-07 15:51:20 0 [Warning] WSREP: evs::proto(6c357751-8d5f, GATHER, view_id(REG,6c357751-8d5f,51)) install timer expired
      evs::proto(evs::proto(6c357751-8d5f, GATHER, view_id(REG,6c357751-8d5f,51)), GATHER) {
      current_view=view(view_id(REG,6c357751-8d5f,51) memb {
      	6c357751-8d5f,0
      } joined {
      } left {
      } partitioned {
      }),
      input_map=evs::input_map: {aru_seq=2,safe_seq=2,node_index=node: {idx=0,range=[3,2],safe_seq=2} },
      fifo_seq=1874639154,
      last_sent=2,
      known:
      5dda822d-b4c3 at tcp://10.10.1.104:4567
      {o=1,s=0,i=0,fs=834192942,jm=
      {v=1,t=4,ut=255,o=1,s=122,sr=-1,as=122,f=4,src=5dda822d-b4c3,srcvid=view_id(REG,5dda822d-b4c3,52),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=834192942,nl=(
      	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
      	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}
      	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
      	96c49f4b-8727, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
      )
      },
      }
      6c357751-8d5f at 
      {o=1,s=0,i=0,fs=-1,jm=
      {v=1,t=4,ut=255,o=1,s=2,sr=-1,as=2,f=0,src=6c357751-8d5f,srcvid=view_id(REG,6c357751-8d5f,51),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=1874639154,nl=(
      	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
      	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}
      	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
      	96c49f4b-8727, {o=0,s=0,e=0,ls=-1,vid=view_id(REG,00000000-0000,0),ss=-1,ir=[-1,-1],}
      )
      },
      }
      844de70f-8aaf at tcp://10.10.1.103:4567
      {o=1,s=0,i=0,fs=1475544355,jm=
      {v=1,t=4,ut=255,o=1,s=122,sr=-1,as=122,f=4,src=844de70f-8aaf,srcvid=view_id(REG,5dda822d-b4c3,52),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=1475544355,nl=(
      	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
      	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}
      	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
      	96c49f4b-8727, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
      )
      },
      }
      96c49f4b-8727 at tcp://10.10.1.101:4567
      {o=0,s=0,i=0,fs=101154494,}
       }
      2023-07-07 15:51:20 0 [Note] WSREP: no install message received
      2023-07-07 15:51:20 0 [Note] WSREP: view(view_id(NON_PRIM,6c357751-8d5f,52) memb {
      	6c357751-8d5f,0
      } joined {
      } left {
      } partitioned {
      	5dda822d-b4c3,0
      	844de70f-8aaf,0
      	96c49f4b-8727,0
      })
      2023-07-07 15:51:20 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
      2023-07-07 15:51:20 0 [Note] WSREP: Flow-control interval: [240, 300]
      2023-07-07 15:51:20 0 [Note] WSREP: Received NON-PRIMARY.
      /usr/sbin/mariadbd(_Z10do_commandP3THDb+0x132)[0x5608f5985942]
      /usr/sbin/mariadbd(_Z24do_handle_one_connectionP7CONNECTb+0x3b7)[0x5608f5aa2dd7]
      2023-07-07 15:51:23 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT4.34509S), skipping check
      /usr/sbin/mariadbd(handle_one_connection+0x5d)[0x5608f5aa311d]
      2023-07-07 15:51:24 11986441 [Warning] WSREP: Send action {(nil), 139599322023528, WRITESET} returned -107 (Transport endpoint is not connected)
      /usr/sbin/mariadbd(+0xc839d2)[0x5608f5e239d2]
      /lib64/libpthread.so.0(+0x7ea5)[0x7f2011c0dea5]
      /lib64/libc.so.6(clone+0x6d)[0x7f2011128b0d]
       
      Trying to get some variables.
      Some pointers may be invalid and cause the dump to abort.
      Query (0x7ef692470578): UPDATE failed_schema.failed_table
      			SET TYPE_EVE =  NAME_CONST('V_TYPE_EVE',_utf8mb3'D' COLLATE 'utf8mb3_general_ci') , DATE_EFFECTIVE =  NAME_CONST('V_DATE_EFFECTIVE',TIMESTAMP'2023-07-07 15:50:20'), STATUT_DATAMART = NULL  WHERE EVT_ID =  NAME_CONST('V_EVT_ID',2051583478)
       
      Connection ID (thread ID): 11985593
      Status: NOT_KILLED
       
      Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
       
      The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
      information that should help you find out what is causing the crash.
      Writing a core file...
      Working directory at /data/mysql
      Resource Limits:
      Limit                     Soft Limit           Hard Limit           Units     
      Max cpu time              unlimited            unlimited            seconds   
      Max file size             unlimited            unlimited            bytes     
      Max data size             unlimited            unlimited            bytes     
      Max stack size            8388608              unlimited            bytes     
      Max core file size        0                    unlimited            bytes     
      Max resident set          unlimited            unlimited            bytes     
      Max processes             805978               805978               processes 
      Max open files            1048576              1048576              files     
      Max locked memory         65536                65536                bytes     
      Max address space         unlimited            unlimited            bytes     
      Max file locks            unlimited            unlimited            locks     
      Max pending signals       805978               805978               signals   
      Max msgqueue size         819200               819200               bytes     
      Max nice priority         0                    0                    
      Max realtime priority     0                    0                    
      Max realtime timeout      unlimited            unlimited            us        
      Core pattern: core
      

      Attachments

        Issue Links

          Activity

            rpizzi Can we have error log from other nodes to resolve why they can't continue as a cluster? Node that is crashing there is nothing we can do.

            janlindstrom Jan Lindström added a comment - rpizzi Can we have error log from other nodes to resolve why they can't continue as a cluster? Node that is crashing there is nothing we can do.
            rpizzi Rick Pizzi (Inactive) added a comment - - edited

            can't you just immediately signal the WSREP threads?

            rpizzi Rick Pizzi (Inactive) added a comment - - edited can't you just immediately signal the WSREP threads?

            rpizzi Very well, other nodes should not drop from cluster even if one of the nodes crashes. Do you have error logs from other nodes? I'm looking reason why they dropped from cluster.

            janlindstrom Jan Lindström added a comment - rpizzi Very well, other nodes should not drop from cluster even if one of the nodes crashes. Do you have error logs from other nodes? I'm looking reason why they dropped from cluster.

            As I already explained, this was on a production system and logs are long gone.
            I guess we can only wait for another occurrence of the issue.

            rpizzi Rick Pizzi (Inactive) added a comment - As I already explained, this was on a production system and logs are long gone. I guess we can only wait for another occurrence of the issue.
            ramesh Ramesh Sivaraman added a comment - - edited

            janlindstrom Reproduced cluster inconstancy using RQG data load. Active nodes become unstable when one of the nodes in cluster is forcefully killed while RQG data load is active.
            The cluster became inconsistent, but the server did not crash as specified in the cluster description.
            Test case
            1) started 3 node cluster
            2) Initiated RQG run on node1 and node2
            3) forcefully killed node2

            Node1 is disconnected from the cluster and Node3 loses its primary status. Error logs from cluster logs.tar.gz

            Node1

            node1:root@localhost> show status like '%wsrep%st%';
            +------------------------------+--------------------------------------+
            | Variable_name                | Value                                |
            +------------------------------+--------------------------------------+
            | wsrep_local_state_uuid       | 00000000-0000-0000-0000-000000000000 |
            | wsrep_last_committed         | -1                                   |
            | wsrep_flow_control_requested | false                                |
            | wsrep_cert_deps_distance     | 41.6348                              |
            | wsrep_local_state            | 5                                    |
            | wsrep_local_state_comment    | Inconsistent                         |
            | wsrep_cluster_capabilities   |                                      |
            | wsrep_cluster_conf_id        | 18446744073709551615                 |
            | wsrep_cluster_size           | 0                                    |
            | wsrep_cluster_state_uuid     | e8298e61-400d-11ee-bed8-e3ccd61d69c8 |
            | wsrep_cluster_status         | Disconnected                         |
            +------------------------------+--------------------------------------+
            11 rows in set (0.001 sec)
            

            Node3

            node3:root@localhost> show status like '%wsrep%st%';
            +------------------------------+--------------------------------------+
            | Variable_name                | Value                                |
            +------------------------------+--------------------------------------+
            | wsrep_local_state_uuid       | e8298e61-400d-11ee-bed8-e3ccd61d69c8 |
            | wsrep_last_committed         | 19996                                |
            | wsrep_flow_control_requested | false                                |
            | wsrep_cert_deps_distance     | 26.4544                              |
            | wsrep_local_state            | 0                                    |
            | wsrep_local_state_comment    | Initialized                          |
            | wsrep_cluster_weight         | 0                                    |
            | wsrep_evs_evict_list         |                                      |
            | wsrep_evs_state              | OPERATIONAL                          |
            | wsrep_gmcast_segment         | 0                                    |
            | wsrep_cluster_capabilities   |                                      |
            | wsrep_cluster_conf_id        | 18446744073709551615                 |
            | wsrep_cluster_size           | 1                                    |
            | wsrep_cluster_state_uuid     | e8298e61-400d-11ee-bed8-e3ccd61d69c8 |
            | wsrep_cluster_status         | non-Primary                          |
            +------------------------------+--------------------------------------+
            15 rows in set (0.001 sec)
             
            node3:root@localhost> 
            

            ramesh Ramesh Sivaraman added a comment - - edited janlindstrom Reproduced cluster inconstancy using RQG data load. Active nodes become unstable when one of the nodes in cluster is forcefully killed while RQG data load is active. The cluster became inconsistent, but the server did not crash as specified in the cluster description. Test case 1) started 3 node cluster 2) Initiated RQG run on node1 and node2 3) forcefully killed node2 Node1 is disconnected from the cluster and Node3 loses its primary status. Error logs from cluster logs.tar.gz Node1 node1:root@localhost> show status like '%wsrep%st%' ; + ------------------------------+--------------------------------------+ | Variable_name | Value | + ------------------------------+--------------------------------------+ | wsrep_local_state_uuid | 00000000-0000-0000-0000-000000000000 | | wsrep_last_committed | -1 | | wsrep_flow_control_requested | false | | wsrep_cert_deps_distance | 41.6348 | | wsrep_local_state | 5 | | wsrep_local_state_comment | Inconsistent | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 18446744073709551615 | | wsrep_cluster_size | 0 | | wsrep_cluster_state_uuid | e8298e61-400d-11ee-bed8-e3ccd61d69c8 | | wsrep_cluster_status | Disconnected | + ------------------------------+--------------------------------------+ 11 rows in set (0.001 sec) Node3 node3:root@localhost> show status like '%wsrep%st%' ; + ------------------------------+--------------------------------------+ | Variable_name | Value | + ------------------------------+--------------------------------------+ | wsrep_local_state_uuid | e8298e61-400d-11ee-bed8-e3ccd61d69c8 | | wsrep_last_committed | 19996 | | wsrep_flow_control_requested | false | | wsrep_cert_deps_distance | 26.4544 | | wsrep_local_state | 0 | | wsrep_local_state_comment | Initialized | | wsrep_cluster_weight | 0 | | wsrep_evs_evict_list | | | wsrep_evs_state | OPERATIONAL | | wsrep_gmcast_segment | 0 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 18446744073709551615 | | wsrep_cluster_size | 1 | | wsrep_cluster_state_uuid | e8298e61-400d-11ee-bed8-e3ccd61d69c8 | | wsrep_cluster_status | non- Primary | + ------------------------------+--------------------------------------+ 15 rows in set (0.001 sec)   node3:root@localhost>
            janlindstrom Jan Lindström added a comment - - edited

            Looked error logs and node_1 drops from cluster because applier gets error and it tries to do error voting:

            2023-08-21 13:32:24 2 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table test.table30_int_autoinc; Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213; handler error HA_ERR_LOCK_DEADLOCK; the event's master log FIRST, end_log_pos 242, Internal MariaDB error code: 1213
            2023-08-21 13:32:24 0 [Note] WSREP: Member 0(galapq) initiates vote on e8298e61-400d-11ee-bed8-e3ccd61d69c8:4939,89ae2f2481c15ba0:  Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213;
            2023-08-21 13:32:24 8 [Note] WSREP: wsrep_before_commit: 1, 4949
            2023-08-21 13:32:24 6 [Note] WSREP: wsrep_commit_empty for 6 client_state exec client_mode high priority trans_state executing sql NULL
            2023-08-21 13:32:24 7 [Note] WSREP: wsrep_before_commit: 1, 4947
            2023-08-21 13:32:24 0 [Note] WSREP: Votes over e8298e61-400d-11ee-bed8-e3ccd61d69c8:4939:
               0000000000000000:   2/3
               89ae2f2481c15ba0:   1/3
            Winner: 0000000000000000
            2023-08-21 13:32:24 9 [Note] WSREP: assigned new next trx id: 15048
            2023-08-21 13:32:24 6 [Note] WSREP: assigned new next trx id: 15049
            2023-08-21 13:32:24 2 [ERROR] WSREP: Inconsistency detected: Inconsistent by consensus on e8298e61-400d-11ee-bed8-e3ccd61d69c8:4939
            	 at /test/galera_4x_opt/galera/src/replicator_smm.cpp:process_apply_error():1357
            

            Last node leaves cluster because it's weight is not big enough.

            janlindstrom Jan Lindström added a comment - - edited Looked error logs and node_1 drops from cluster because applier gets error and it tries to do error voting: 2023-08-21 13:32:24 2 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table test.table30_int_autoinc; Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213; handler error HA_ERR_LOCK_DEADLOCK; the event's master log FIRST, end_log_pos 242, Internal MariaDB error code: 1213 2023-08-21 13:32:24 0 [Note] WSREP: Member 0(galapq) initiates vote on e8298e61-400d-11ee-bed8-e3ccd61d69c8:4939,89ae2f2481c15ba0: Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213; 2023-08-21 13:32:24 8 [Note] WSREP: wsrep_before_commit: 1, 4949 2023-08-21 13:32:24 6 [Note] WSREP: wsrep_commit_empty for 6 client_state exec client_mode high priority trans_state executing sql NULL 2023-08-21 13:32:24 7 [Note] WSREP: wsrep_before_commit: 1, 4947 2023-08-21 13:32:24 0 [Note] WSREP: Votes over e8298e61-400d-11ee-bed8-e3ccd61d69c8:4939: 0000000000000000: 2/3 89ae2f2481c15ba0: 1/3 Winner: 0000000000000000 2023-08-21 13:32:24 9 [Note] WSREP: assigned new next trx id: 15048 2023-08-21 13:32:24 6 [Note] WSREP: assigned new next trx id: 15049 2023-08-21 13:32:24 2 [ERROR] WSREP: Inconsistency detected: Inconsistent by consensus on e8298e61-400d-11ee-bed8-e3ccd61d69c8:4939 at /test/galera_4x_opt/galera/src/replicator_smm.cpp:process_apply_error():1357 Last node leaves cluster because it's weight is not big enough.

            why can this cause nodes to be not consistent anymore?

            serg Sergei Golubchik added a comment - why can this cause nodes to be not consistent anymore?

            Why would other nodes be out of sync with each other? they both have received the write set, they certify it and apply, where's the inconsistency here?

            serg Sergei Golubchik added a comment - Why would other nodes be out of sync with each other? they both have received the write set, they certify it and apply, where's the inconsistency here?

            janlindstrom, but this means that a crash of a node can make the whole cluster unusable, where's HA in that?

            May be a node_2 shouldn't apply a write set until all nodes got it. May be node_3 can get it from node_2. But it has to be fixed somehow, otherwise I don't know how one can claim that galera cluster provides HA

            serg Sergei Golubchik added a comment - janlindstrom , but this means that a crash of a node can make the whole cluster unusable, where's HA in that? May be a node_2 shouldn't apply a write set until all nodes got it. May be node_3 can get it from node_2. But it has to be fixed somehow, otherwise I don't know how one can claim that galera cluster provides HA

            serg I think I need to dig more because you are correct other nodes should be able to continue normally.

            janlindstrom Jan Lindström added a comment - serg I think I need to dig more because you are correct other nodes should be able to continue normally.

            rpizzi I tried to reproduce this with 10.6 using 3-node cluster and using simple database with 100k rows and then 2 connections doing inserts and 2 connections doing updates. From another connection then I triggered code crash inside InnoDB ::write_row() on node_2. Remaining nodes node_1 and node_2 remained primary state. Is there something special on node configuration I should know?

            Crash instrumentation:

             jan@jan-HP-ZBook-15u-G5:~/work/mariadb/10.6$ git diff
            diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
            index b440613c13f..e6b90f02279 100644
            --- a/storage/innobase/handler/ha_innodb.cc
            +++ b/storage/innobase/handler/ha_innodb.cc
            @@ -7844,6 +7844,12 @@ ha_innobase::write_row(
             
                    trx_t*          trx = thd_to_trx(m_user_thd);
             
            +#ifdef WITH_WSREP
            +        DBUG_EXECUTE_IF("wsrep_force_assert",
            +                       assert(0);
            +       );
            +#endif
            +
                    /* Validation checks before we commence write_row operation. */
                    if (is_read_only()) {
                            DBUG_RETURN(HA_ERR_TABLE_READONLY);
            

            How to enable it:

            SET debug_dbug = '+d,wsrep_force_assert'; call insert_t1(2000);
            

            janlindstrom Jan Lindström added a comment - rpizzi I tried to reproduce this with 10.6 using 3-node cluster and using simple database with 100k rows and then 2 connections doing inserts and 2 connections doing updates. From another connection then I triggered code crash inside InnoDB ::write_row() on node_2. Remaining nodes node_1 and node_2 remained primary state. Is there something special on node configuration I should know? Crash instrumentation: jan@jan-HP-ZBook-15u-G5:~/work/mariadb/10.6$ git diff diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index b440613c13f..e6b90f02279 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -7844,6 +7844,12 @@ ha_innobase::write_row( trx_t* trx = thd_to_trx(m_user_thd); +#ifdef WITH_WSREP + DBUG_EXECUTE_IF("wsrep_force_assert", + assert(0); + ); +#endif + /* Validation checks before we commence write_row operation. */ if (is_read_only()) { DBUG_RETURN(HA_ERR_TABLE_READONLY); How to enable it: SET debug_dbug = '+d,wsrep_force_assert'; call insert_t1(2000);
            rpizzi Rick Pizzi (Inactive) added a comment - - edited

            Only thing that comes to mind is that this is a 4 node cluster with node 4 having pc.weight=0. Not sure this makes any difference.
            Also, when testing, you should actually simulate what happened in production, i.e. have an InnoDB assertion failure due to corrupted index.

            rpizzi Rick Pizzi (Inactive) added a comment - - edited Only thing that comes to mind is that this is a 4 node cluster with node 4 having pc.weight=0. Not sure this makes any difference. Also, when testing, you should actually simulate what happened in production, i.e. have an InnoDB assertion failure due to corrupted index.

            rpizzi In my understanding pc.weight=0 is not good choice here because it means if one node goes down rest of nodes in cluster will loose Primary status. See https://galeracluster.com/library/documentation/weighted-quorum.html

            InnoDB index corruption most likely is not caused by Galera and requires additional investigation. Stack trace is quite limited for this but anyway is out of scope for me.

            janlindstrom Jan Lindström added a comment - rpizzi In my understanding pc.weight=0 is not good choice here because it means if one node goes down rest of nodes in cluster will loose Primary status. See https://galeracluster.com/library/documentation/weighted-quorum.html InnoDB index corruption most likely is not caused by Galera and requires additional investigation. Stack trace is quite limited for this but anyway is out of scope for me.

            It is the opposite. Weight=0 means the node does not participate in quorum, and its online/offline status does not impact quorum calculation.
            This ticket is not about finding source for index corruption. We need to find out why all nodes went non-primary when this happened.

            Thanks,
            Rick

            rpizzi Rick Pizzi (Inactive) added a comment - It is the opposite. Weight=0 means the node does not participate in quorum, and its online/offline status does not impact quorum calculation. This ticket is not about finding source for index corruption. We need to find out why all nodes went non-primary when this happened. Thanks, Rick
            janlindstrom Jan Lindström added a comment - - edited

            rpizzi I tested this in 10.6 using 4-node cluster so that I set pc.weight=0 in all nodes. Then I used mysqladmin and shutdown one of the nodes. All the rest of nodes did go non-Primary as documentation hints. It appears that weight of 0 does not yield any weight for the PC, and any cluster would end up in split brain after one node dropping out.

            janlindstrom Jan Lindström added a comment - - edited rpizzi I tested this in 10.6 using 4-node cluster so that I set pc.weight=0 in all nodes. Then I used mysqladmin and shutdown one of the nodes. All the rest of nodes did go non-Primary as documentation hints. It appears that weight of 0 does not yield any weight for the PC, and any cluster would end up in split brain after one node dropping out.
            rpizzi Rick Pizzi (Inactive) added a comment - - edited

            You cannot set pc.weight to 0 on all nodes.
            As I said, only node 4 had 0, so that quorum calculation would ignore that node.
            Please try and test accordingly.

            Thanks
            Rick

            rpizzi Rick Pizzi (Inactive) added a comment - - edited You cannot set pc.weight to 0 on all nodes. As I said, only node 4 had 0, so that quorum calculation would ignore that node. Please try and test accordingly. Thanks Rick

            rpizzi Thanks for pointing out. I tried again with 4-node cluster and 10.6 so that only node_4 has pc.weight=0 and having assertion in same place as reported. However, I could not reproduce the problem that other nodes would drop from Primary state.

            Node_2 fails on exactly same place:

            mysys/stacktrace.c:215(my_print_stacktrace)[0x562347b46973]
            sql/signal_handler.cc:241(handle_fatal_signal)[0x5623471ee0cb]
            libc_sigaction.c:0(__restore_rt)[0x7f94d6e3c4b0]
            nptl/pthread_kill.c:44(__pthread_kill_implementation)[0x7f94d6e90ffb]
            posix/raise.c:27(__GI_raise)[0x7f94d6e3c406]
            stdlib/abort.c:81(__GI_abort)[0x7f94d6e2287c]
            intl/loadmsgcat.c:1177(_nl_load_domain)[0x7f94d6e2279b]
            /lib/x86_64-linux-gnu/libc.so.6(+0x33b86)[0x7f94d6e33b86]
            page/page0zip.cc:4216(page_zip_dir_insert(page_cur_t*, unsigned short, unsigned char*, mtr_t*))[0x5623477bba08]
            page/page0cur.cc:2143(page_cur_insert_rec_zip(page_cur_t*, unsigned char const*, unsigned short*, mtr_t*))[0x56234779459a]
            include/page0cur.inl:195(page_cur_tuple_insert(page_cur_t*, dtuple_t const*, unsigned short**, mem_block_info_t**, unsigned long, mtr_t*))[0x56234793010c]
            btr/btr0cur.cc:2491(btr_cur_optimistic_insert(unsigned long, btr_cur_t*, unsigned short**, mem_block_info_t**, dtuple_t*, unsigned char**, big_rec_t**, unsigned long, que_thr_t*, mtr_t*))[0x56234793ba1f]
            row/row0ins.cc:2852(row_ins_clust_index_entry_low(unsigned long, btr_latch_mode, dict_index_t*, unsigned long, dtuple_t*, unsigned long, que_thr_t*))[0x562347812880]
            row/row0ins.cc:3242(row_ins_clust_index_entry(dict_index_t*, dtuple_t*, que_thr_t*, unsigned long))[0x562347813dd2]
            row/row0ins.cc:3368(row_ins_index_entry(dict_index_t*, dtuple_t*, que_thr_t*))[0x56234781436a]
            row/row0ins.cc:3536(row_ins_index_entry_step(ins_node_t*, que_thr_t*))[0x562347814cc2]
            row/row0ins.cc:3661(row_ins(ins_node_t*, que_thr_t*))[0x5623478151f4]
            row/row0ins.cc:3790(row_ins_step(que_thr_t*))[0x5623478159e9]
            row/row0mysql.cc:1317(row_insert_for_mysql(unsigned char const*, row_prebuilt_t*, ins_mode_t))[0x56234783803d]
            handler/ha_innodb.cc:7907(ha_innobase::write_row(unsigned char const*))[0x5623476550d1]
            sql/handler.cc:7639(handler::ha_write_row(unsigned char const*))[0x562347208bde]
            sql/sql_insert.cc:2166(write_record(THD*, TABLE*, st_copy_info*, select_result*))[0x562346dd4a08]
            sql/sql_insert.cc:1131(mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item> >&, List<Item>&, List<Item>&, enum_duplicates, bool, select_result*))[0x562346dd1467]
            sql/sql_parse.cc:4580(mysql_execute_command(THD*, bool))[0x562346e28fbe]
            sql/sp_head.cc:3843(sp_instr_stmt::exec_core(THD*, unsigned int*))[0x562346d1e3cb]
            sql/sp_head.cc:3568(sp_lex_keeper::reset_lex_and_exec_core(THD*, unsigned int*, bool, sp_instr*))[0x562346d1d69d]
            sql/sp_head.cc:3749(sp_instr_stmt::execute(THD*, unsigned int*))[0x562346d1df53]
            sql/sp_head.cc:1442(sp_head::execute(THD*, bool))[0x562346d17047]
            sql/sp_head.cc:2485(sp_head::execute_procedure(THD*, List<Item>*))[0x562346d19fac]
            sql/sql_parse.cc:3036(do_execute_sp(THD*, sp_head*))[0x562346e23bf7]
            sql/sql_parse.cc:3282(Sql_cmd_call::execute(THD*))[0x562346e2488a]
            sql/sql_parse.cc:6024(mysql_execute_command(THD*, bool))[0x562346e2ebfa]
            sql/sql_parse.cc:8048(mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x562346e34f78]
            sql/sql_parse.cc:7871(wsrep_mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x562346e3462a]
            sql/sql_parse.cc:1883(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool))[0x562346e20776]
            sql/sql_parse.cc:1409(do_command(THD*, bool))[0x562346e1f1b2]
            sql/sql_connect.cc:1416(do_handle_one_connection(CONNECT*, bool))[0x562346ff7cfe]
            sql/sql_connect.cc:1320(handle_one_connection)[0x562346ff7a67]
            perfschema/pfs.cc:2203(pfs_spawn_thread)[0x5623475662a2]
            nptl/pthread_create.c:444(start_thread)[0x7f94d6e8f18a]
            x86_64/clone3.S:83(clone3)[0x7f94d6f1dbd0]
            

            From node_1

            mysql> show status like 'wsrep%';
            --------------
            show status like 'wsrep%'
            --------------
             
            +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
            | Variable_name                 | Value                                                                                                                                          |
            +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
            | wsrep_local_state_uuid        | 5f8f1ef8-4d59-11ee-b03b-1edf34a26753                                                                                                           |
            | wsrep_protocol_version        | 10                                                                                                                                             |
            | wsrep_last_committed          | 115233                                                                                                                                         |
            | wsrep_replicated              | 110496                                                                                                                                         |
            | wsrep_replicated_bytes        | 35343800                                                                                                                                       |
            | wsrep_repl_keys               | 331483                                                                                                                                         |
            | wsrep_repl_keys_bytes         | 5303768                                                                                                                                        |
            | wsrep_repl_data_bytes         | 22387370                                                                                                                                       |
            | wsrep_repl_other_bytes        | 0                                                                                                                                              |
            | wsrep_received                | 7977                                                                                                                                           |
            | wsrep_received_bytes          | 1563770                                                                                                                                        |
            | wsrep_local_commits           | 110493                                                                                                                                         |
            | wsrep_local_cert_failures     | 0                                                                                                                                              |
            | wsrep_local_replays           | 0                                                                                                                                              |
            | wsrep_local_send_queue        | 0                                                                                                                                              |
            | wsrep_local_send_queue_max    | 2                                                                                                                                              |
            | wsrep_local_send_queue_min    | 0                                                                                                                                              |
            | wsrep_local_send_queue_avg    | 1.79535e-05                                                                                                                                    |
            | wsrep_local_recv_queue        | 0                                                                                                                                              |
            | wsrep_local_recv_queue_max    | 7                                                                                                                                              |
            | wsrep_local_recv_queue_min    | 0                                                                                                                                              |
            | wsrep_local_recv_queue_avg    | 0.0208098                                                                                                                                      |
            | wsrep_local_cached_downto     | 84794                                                                                                                                          |
            | wsrep_flow_control_paused_ns  | 11834047189                                                                                                                                    |
            | wsrep_flow_control_paused     | 0.0199474                                                                                                                                      |
            | wsrep_flow_control_sent       | 0                                                                                                                                              |
            | wsrep_flow_control_recv       | 1                                                                                                                                              |
            | wsrep_flow_control_active     | false                                                                                                                                          |
            | wsrep_flow_control_requested  | false                                                                                                                                          |
            | wsrep_cert_deps_distance      | 93.4516                                                                                                                                        |
            | wsrep_apply_oooe              | 0.069626                                                                                                                                       |
            | wsrep_apply_oool              | 0.00321965                                                                                                                                     |
            | wsrep_apply_window            | 1.09509                                                                                                                                        |
            | wsrep_apply_waits             | 0                                                                                                                                              |
            | wsrep_commit_oooe             | 0                                                                                                                                              |
            | wsrep_commit_oool             | 0                                                                                                                                              |
            | wsrep_commit_window           | 1.00368                                                                                                                                        |
            | wsrep_local_state             | 4                                                                                                                                              |
            | wsrep_local_state_comment     | Synced                                                                                                                                         |
            | wsrep_cert_index_size         | 93                                                                                                                                             |
            | wsrep_causal_reads            | 11                                                                                                                                             |
            | wsrep_cert_interval           | 0.10328                                                                                                                                        |
            | wsrep_open_transactions       | 2                                                                                                                                              |
            | wsrep_open_connections        | 0                                                                                                                                              |
            | wsrep_incoming_addresses      | 127.0.0.1:16020,127.0.0.1:16022,127.0.0.1:16023                                                                                                |
            | wsrep_cluster_weight          | 2                                                                                                                                              |
            | wsrep_debug_sync_waiters      |                                                                                                                                                |
            | wsrep_desync_count            | 0                                                                                                                                              |
            | wsrep_evs_delayed             |                                                                                                                                                |
            | wsrep_evs_evict_list          |                                                                                                                                                |
            | wsrep_evs_repl_latency        | 0.000234768/0.000425666/0.0120175/0.000450524/681                                                                                              |
            | wsrep_evs_state               | OPERATIONAL                                                                                                                                    |
            | wsrep_gcomm_uuid              | 5f8e47b5-4d59-11ee-82e6-43df7302848a                                                                                                           |
            | wsrep_gmcast_segment          | 0                                                                                                                                              |
            | wsrep_applier_thread_count    | 4                                                                                                                                              |
            | wsrep_cluster_capabilities    |                                                                                                                                                |
            | wsrep_cluster_conf_id         | 3                                                                                                                                              |
            | wsrep_cluster_size            | 3                                                                                                                                              |
            | wsrep_cluster_state_uuid      | 5f8f1ef8-4d59-11ee-b03b-1edf34a26753                                                                                                           |
            | wsrep_cluster_status          | Primary                                                                                                                                        |
            | wsrep_connected               | ON                                                                                                                                             |
            | wsrep_local_bf_aborts         | 0                                                                                                                                              |
            | wsrep_local_index             | 0                                                                                                                                              |
            | wsrep_provider_capabilities   | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
            | wsrep_provider_name           | Galera                                                                                                                                         |
            | wsrep_provider_vendor         | Codership Oy <info@codership.com>                                                                                                              |
            | wsrep_provider_version        | 26.4.14(r75464733)                                                                                                                             |
            | wsrep_ready                   | ON                                                                                                                                             |
            | wsrep_rollbacker_thread_count | 1                                                                                                                                              |
            | wsrep_thread_count            | 5                                                                                                                                              |
            +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
            70 rows in set (0,00 sec)
            

            janlindstrom Jan Lindström added a comment - rpizzi Thanks for pointing out. I tried again with 4-node cluster and 10.6 so that only node_4 has pc.weight=0 and having assertion in same place as reported. However, I could not reproduce the problem that other nodes would drop from Primary state. Node_2 fails on exactly same place: mysys/stacktrace.c:215(my_print_stacktrace)[0x562347b46973] sql/signal_handler.cc:241(handle_fatal_signal)[0x5623471ee0cb] libc_sigaction.c:0(__restore_rt)[0x7f94d6e3c4b0] nptl/pthread_kill.c:44(__pthread_kill_implementation)[0x7f94d6e90ffb] posix/raise.c:27(__GI_raise)[0x7f94d6e3c406] stdlib/abort.c:81(__GI_abort)[0x7f94d6e2287c] intl/loadmsgcat.c:1177(_nl_load_domain)[0x7f94d6e2279b] /lib/x86_64-linux-gnu/libc.so.6(+0x33b86)[0x7f94d6e33b86] page/page0zip.cc:4216(page_zip_dir_insert(page_cur_t*, unsigned short, unsigned char*, mtr_t*))[0x5623477bba08] page/page0cur.cc:2143(page_cur_insert_rec_zip(page_cur_t*, unsigned char const*, unsigned short*, mtr_t*))[0x56234779459a] include/page0cur.inl:195(page_cur_tuple_insert(page_cur_t*, dtuple_t const*, unsigned short**, mem_block_info_t**, unsigned long, mtr_t*))[0x56234793010c] btr/btr0cur.cc:2491(btr_cur_optimistic_insert(unsigned long, btr_cur_t*, unsigned short**, mem_block_info_t**, dtuple_t*, unsigned char**, big_rec_t**, unsigned long, que_thr_t*, mtr_t*))[0x56234793ba1f] row/row0ins.cc:2852(row_ins_clust_index_entry_low(unsigned long, btr_latch_mode, dict_index_t*, unsigned long, dtuple_t*, unsigned long, que_thr_t*))[0x562347812880] row/row0ins.cc:3242(row_ins_clust_index_entry(dict_index_t*, dtuple_t*, que_thr_t*, unsigned long))[0x562347813dd2] row/row0ins.cc:3368(row_ins_index_entry(dict_index_t*, dtuple_t*, que_thr_t*))[0x56234781436a] row/row0ins.cc:3536(row_ins_index_entry_step(ins_node_t*, que_thr_t*))[0x562347814cc2] row/row0ins.cc:3661(row_ins(ins_node_t*, que_thr_t*))[0x5623478151f4] row/row0ins.cc:3790(row_ins_step(que_thr_t*))[0x5623478159e9] row/row0mysql.cc:1317(row_insert_for_mysql(unsigned char const*, row_prebuilt_t*, ins_mode_t))[0x56234783803d] handler/ha_innodb.cc:7907(ha_innobase::write_row(unsigned char const*))[0x5623476550d1] sql/handler.cc:7639(handler::ha_write_row(unsigned char const*))[0x562347208bde] sql/sql_insert.cc:2166(write_record(THD*, TABLE*, st_copy_info*, select_result*))[0x562346dd4a08] sql/sql_insert.cc:1131(mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item> >&, List<Item>&, List<Item>&, enum_duplicates, bool, select_result*))[0x562346dd1467] sql/sql_parse.cc:4580(mysql_execute_command(THD*, bool))[0x562346e28fbe] sql/sp_head.cc:3843(sp_instr_stmt::exec_core(THD*, unsigned int*))[0x562346d1e3cb] sql/sp_head.cc:3568(sp_lex_keeper::reset_lex_and_exec_core(THD*, unsigned int*, bool, sp_instr*))[0x562346d1d69d] sql/sp_head.cc:3749(sp_instr_stmt::execute(THD*, unsigned int*))[0x562346d1df53] sql/sp_head.cc:1442(sp_head::execute(THD*, bool))[0x562346d17047] sql/sp_head.cc:2485(sp_head::execute_procedure(THD*, List<Item>*))[0x562346d19fac] sql/sql_parse.cc:3036(do_execute_sp(THD*, sp_head*))[0x562346e23bf7] sql/sql_parse.cc:3282(Sql_cmd_call::execute(THD*))[0x562346e2488a] sql/sql_parse.cc:6024(mysql_execute_command(THD*, bool))[0x562346e2ebfa] sql/sql_parse.cc:8048(mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x562346e34f78] sql/sql_parse.cc:7871(wsrep_mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x562346e3462a] sql/sql_parse.cc:1883(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool))[0x562346e20776] sql/sql_parse.cc:1409(do_command(THD*, bool))[0x562346e1f1b2] sql/sql_connect.cc:1416(do_handle_one_connection(CONNECT*, bool))[0x562346ff7cfe] sql/sql_connect.cc:1320(handle_one_connection)[0x562346ff7a67] perfschema/pfs.cc:2203(pfs_spawn_thread)[0x5623475662a2] nptl/pthread_create.c:444(start_thread)[0x7f94d6e8f18a] x86_64/clone3.S:83(clone3)[0x7f94d6f1dbd0] From node_1 mysql> show status like 'wsrep%'; -------------- show status like 'wsrep%' --------------   +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 5f8f1ef8-4d59-11ee-b03b-1edf34a26753 | | wsrep_protocol_version | 10 | | wsrep_last_committed | 115233 | | wsrep_replicated | 110496 | | wsrep_replicated_bytes | 35343800 | | wsrep_repl_keys | 331483 | | wsrep_repl_keys_bytes | 5303768 | | wsrep_repl_data_bytes | 22387370 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 7977 | | wsrep_received_bytes | 1563770 | | wsrep_local_commits | 110493 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 2 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 1.79535e-05 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 7 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0.0208098 | | wsrep_local_cached_downto | 84794 | | wsrep_flow_control_paused_ns | 11834047189 | | wsrep_flow_control_paused | 0.0199474 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 1 | | wsrep_flow_control_active | false | | wsrep_flow_control_requested | false | | wsrep_cert_deps_distance | 93.4516 | | wsrep_apply_oooe | 0.069626 | | wsrep_apply_oool | 0.00321965 | | wsrep_apply_window | 1.09509 | | wsrep_apply_waits | 0 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 1.00368 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 93 | | wsrep_causal_reads | 11 | | wsrep_cert_interval | 0.10328 | | wsrep_open_transactions | 2 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | 127.0.0.1:16020,127.0.0.1:16022,127.0.0.1:16023 | | wsrep_cluster_weight | 2 | | wsrep_debug_sync_waiters | | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000234768/0.000425666/0.0120175/0.000450524/681 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | 5f8e47b5-4d59-11ee-82e6-43df7302848a | | wsrep_gmcast_segment | 0 | | wsrep_applier_thread_count | 4 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 3 | | wsrep_cluster_size | 3 | | wsrep_cluster_state_uuid | 5f8f1ef8-4d59-11ee-b03b-1edf34a26753 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 0 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.14(r75464733) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 5 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 70 rows in set (0,00 sec)
            rpizzi Rick Pizzi (Inactive) added a comment - - edited

            I rechecked the logs of this failure. This happened on node 2.
            It appears that after the assertion, the asserting thread took a VERY long time to dump the stack, and also there was a core file to be generated
            after that.

            See below for the sequence; you can clearly see that while the asserting thread is dumping stack, wsrep is still talking with other nodes.
            Hope this will help. Maybe you should enable core-file and see if that makes a difference.

            : NO)
            2023-07-07 15:50:19 11985573 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
            2023-07-07 15:50:19 11985573 [ERROR] mariadbd: Index for table 'MAJ_EVENEMENTS_RAPPROCHEMENT' is corrupt; try to repair it
            2023-07-07 15:50:20 0x7f17f7766700  InnoDB: Assertion failure in file /home/jenkins/workspace/Build-Package/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/storage/innobase/page/page0zip.cc line 4213
            InnoDB: Failing assertion: slot_rec
            InnoDB: We intentionally generate a memory trap.
            InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
            InnoDB: If you get repeated assertion failures or crashes, even
            InnoDB: immediately after the mariadbd startup, there may be
            InnoDB: corruption in the InnoDB tablespace. Please refer to
            InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
            InnoDB: about forcing recovery.
            230707 15:50:20 [ERROR] mysqld got signal 6 ;
            This could be because you hit a bug. It is also possible that this binary
            or one of the libraries it was linked against is corrupt, improperly built,
            or misconfigured. This error can also be caused by malfunctioning hardware.
             
            To report this bug, see https://mariadb.com/kb/en/reporting-bugs
             
            We will try our best to scrape up some info that will hopefully help
            diagnose the problem, but since we have already crashed, 
            something is definitely wrong and this may fail.
             
            Server version: 10.6.12-7-MariaDB-enterprise-log source revision: 8e2b75dad28995ab5f6e6acd436135420f7031c9
            key_buffer_size=268435456
            read_buffer_size=131072
            max_used_connections=2243
            max_threads=6002
            thread_count=1565
            It is possible that mysqld could use up to 
            key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 13479553 K  bytes of memory
            Hope that's ok; if not, decrease some variables in the equation.
             
            Thread pointer: 0x7ef735ab51c8
            Attempting backtrace. You can use the following information to find out
            where mysqld died. If you see no messages after this, something went
            terribly wrong...
            stack_bottom = 0x7f17f7765cb8 thread_stack 0x49000
            Can't start addr2line
            /usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x5608f6116c7e]
            /usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x5608f5bc33a5]
            /lib64/libpthread.so.0(+0xf630)[0x7f2011c15630]
            /lib64/libc.so.6(gsignal+0x37)[0x7f2011060387]
            /lib64/libc.so.6(abort+0x148)[0x7f2011061a78]
            /usr/sbin/mariadbd(+0x694d97)[0x5608f5834d97]
            /usr/sbin/mariadbd(+0xdbfb05)[0x5608f5f5fb05]
            /usr/sbin/mariadbd(+0xdaf516)[0x5608f5f4f516]
            2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 844de70f-8aaf with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 766 rttvar: 579 rto: 201000 lost: 0 last_data_recv: 2567 cwnd: 10 last_queued_since: 8776161264 last_delivered_since: 11959172679 send_queue_length: 9 send_queue_bytes: 720 segment: 0 messages: 9
            2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.103:4567
            2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 5dda822d-b4c3 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 6407 rttvar: 10277 rto: 207000 lost: 0 last_data_recv: 5900 cwnd: 10 last_queued_since: 307819 last_delivered_since: 8781038225 send_queue_length: 10 send_queue_bytes: 1080 segment: 0 messages: 10
            2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.104:42156
            2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 96c49f4b-8727 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 5591 rttvar: 10094 rto: 206000 lost: 0 last_data_recv: 5924 cwnd: 10 last_queued_since: 10916 last_delivered_since: 8781705783 send_queue_length: 11 send_queue_bytes: 1292 segment: 0 messages: 11
            2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.101:33510
            2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.10.1.101:4567 tcp://10.10.1.103:4567 tcp://10.10.1.104:4567 
            2023-07-07 15:50:32 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT12.0655S), skipping check
            2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7f17e8905e58
            2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer destruct
            2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7ef49da77b98
            2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer destruct
            /usr/sbin/mariadbd(+0xe62a65)[0x5608f6002a65]
            2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 96c49f4b-8727 (tcp://10.10.1.101:4567), attempt 0
            2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 844de70f-8aaf (tcp://10.10.1.103:4567), attempt 0
            2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 5dda822d-b4c3 (tcp://10.10.1.104:4567), attempt 0
            2023-07-07 15:50:33 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.73771S), skipping check
            /usr/sbin/mariadbd(+0xe4e639)[0x5608f5fee639]
            /usr/sbin/mariadbd(+0xe5063b)[0x5608f5ff063b]
            /usr/sbin/mariadbd(+0xe62e98)[0x5608f6002e98]
            /usr/sbin/mariadbd(+0xde0227)[0x5608f5f80227]
            /usr/sbin/mariadbd(+0xde2da4)[0x5608f5f82da4]
            2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 1000 rttvar: 500 rto: 201000 lost: 0 last_data_recv: 125408244 cwnd: 10 last_queued_since: 4421911460 last_delivered_since: 8715642835357354 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
            2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 359 rttvar: 179 rto: 201000 lost: 0 last_data_recv: 7528 cwnd: 10 last_queued_since: 120133 last_delivered_since: 120133 send_queue_length: 0 send_queue_bytes: 0
            2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 437 rttvar: 218 rto: 201000 lost: 0 last_data_recv: 7532 cwnd: 10 last_queued_since: 4131087 last_delivered_since: 4131087 send_queue_length: 0 send_queue_bytes: 0
            2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 181 rttvar: 90 rto: 201000 lost: 0 last_data_recv: 7532 cwnd: 10 last_queued_since: 8715642839939685 last_delivered_since: 8715642839939685 send_queue_length: 0 send_queue_bytes: 0
            2023-07-07 15:50:41 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT7.53388S), skipping check
            /usr/sbin/mariadbd(+0xe151ab)[0x5608f5fb51ab]
            2023-07-07 15:50:43 0 [Note] WSREP: Deferred close timer handle_wait Success for 0x7f17ebc5b168
            2023-07-07 15:50:43 0 [Note] WSREP: Deferred close timer destruct
            2023-07-07 15:50:43 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.61329S), skipping check
            /usr/sbin/mariadbd(+0xe15869)[0x5608f5fb5869]
            2023-07-07 15:50:44 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.63124S), skipping check
            2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 5dda822d-b4c3
            2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 844de70f-8aaf
            2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 96c49f4b-8727
            /usr/sbin/mariadbd(+0xdf28b2)[0x5608f5f928b2]
            /usr/sbin/mariadbd(+0xd43ca8)[0x5608f5ee3ca8]
            /usr/sbin/mariadbd(_ZN7handler13ha_update_rowEPKhS1_+0x232)[0x5608f5bd12b2]
            /usr/sbin/mariadbd(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x1a63)[0x5608f5a5cf33]
            /usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x263e)[0x5608f597d38e]
            /usr/sbin/mariadbd(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x38)[0x5608f58cb718]
            /usr/sbin/mariadbd(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x176)[0x5608f58d48b6]
            /usr/sbin/mariadbd(_ZN13sp_instr_stmt7executeEP3THDPj+0x5bc)[0x5608f58d529c]
            /usr/sbin/mariadbd(_ZN7sp_head7executeEP3THDb+0xa0c)[0x5608f58ceeac]
            2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 0 rttvar: 0 rto: 200000 lost: 0 last_data_recv: 125424590 cwnd: 10 last_queued_since: 11598743653 last_delivered_since: 8715659181935888 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
            2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 5dda822d-b4c3 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 185 rttvar: 82 rto: 201000 lost: 0 last_data_recv: 5508 cwnd: 10 last_queued_since: 11598929590 last_delivered_since: 11598962474 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
            2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 844de70f-8aaf with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 264 rttvar: 105 rto: 201000 lost: 0 last_data_recv: 5508 cwnd: 10 last_queued_since: 11599134047 last_delivered_since: 11599140957 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
            2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 96c49f4b-8727 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 384 rttvar: 151 rto: 201000 lost: 0 last_data_recv: 5678 cwnd: 10 last_queued_since: 11599486516 last_delivered_since: 11599494477 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
            /usr/sbin/mariadbd(_ZN7sp_head17execute_procedureEP3THDP4ListI4ItemE+0x66a)[0x5608f58d093a]
            /usr/sbin/mariadbd(+0x7cfc17)[0x5608f596fc17]
            /usr/sbin/mariadbd(+0x7d3a68)[0x5608f5973a68]
            /usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x101a)[0x5608f597bd6a]
            2023-07-07 15:51:04 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT6S, socket stats: rtt: 0 rttvar: 0 rto: 200000 lost: 0 last_data_recv: 125430994 cwnd: 10 last_queued_since: 221024 last_delivered_since: 8715665585967344 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
            2023-07-07 15:51:04 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT18.0029S), skipping check
            2023-07-07 15:51:04 0 [Note] WSREP: view(view_id(NON_PRIM,5dda822d-b4c3,50) memb {
            	6c357751-8d5f,0
            } joined {
            } left {
            } partitioned {
            	5dda822d-b4c3,0
            	844de70f-8aaf,0
            	96c49f4b-8727,0
            })
            2023-07-07 15:51:04 0 [Note] WSREP: view(view_id(NON_PRIM,6c357751-8d5f,51) memb {
            	6c357751-8d5f,0
            } joined {
            } left {
            } partitioned {
            	5dda822d-b4c3,0
            	844de70f-8aaf,0
            	96c49f4b-8727,0
            })
            2023-07-07 15:51:04 11994604 [Warning] WSREP: Send action {(nil), 139603616989752, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:04 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
            2023-07-07 15:51:04 0 [Note] WSREP: Flow-control interval: [240, 300]
            2023-07-07 15:51:04 0 [Note] WSREP: Received NON-PRIMARY.
            2023-07-07 15:51:04 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 6204240577)
            2023-07-07 15:51:04 11955214 [Warning] WSREP: Send action {(nil), 139599322023456, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:04 11985855 [Warning] WSREP: Send action {(nil), 139603616990584, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:04 11987046 [Warning] WSREP: Send action {(nil), 139599322023328, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:04 11985820 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:04 46 [Note] WSREP: ================================================
            View:
              id: c3a51458-b6fd-11eb-8a80-eb35c100e72c:6204240577
              status: non-primary
              protocol_version: 4
              capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
              final: no
              own_index: 0
              members(1):
            	0: 6c357751-ce4f-11ed-8d5f-136e7094748b, PIXID-MDB-MASTER2
            =================================================
            2023-07-07 15:51:04 46 [Note] WSREP: Non-primary view
            2023-07-07 15:51:04 46 [Note] WSREP: Server status change synced -> connected
            2023-07-07 15:51:04 46 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
            2023-07-07 15:51:04 11997537 [Warning] WSREP: Send action {(nil), 139603616989760, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:04 11996647 [Warning] WSREP: Send action {(nil), 139573552218680, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:04 11986459 [Warning] WSREP: Send action {(nil), 139736760976944, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:04 11997986 [Warning] WSREP: Send action {(nil), 139599322023552, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:04 11985505 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:04 11988311 [Warning] WSREP: Send action {(nil), 139607911957872, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:06 11917365 [Warning] WSREP: Send action {(nil), 139586437121400, WRITESET} returned -107 (Transport endpoint is not connected)
            /usr/sbin/mariadbd(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x38)[0x5608f58cb718]
            2023-07-07 15:51:06 11985895 [Warning] WSREP: Send action {(nil), 139590732088096, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:06 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
            2023-07-07 15:51:06 0 [Note] WSREP: Flow-control interval: [240, 300]
            2023-07-07 15:51:06 0 [Note] WSREP: Received NON-PRIMARY.
            2023-07-07 15:51:06 11978506 [Warning] WSREP: Send action {(nil), 139599322023472, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:06 11997530 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:06 11988302 [Warning] WSREP: Send action {(nil), 139599322023760, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:06 11988512 [Warning] WSREP: Send action {(nil), 139736760977344, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:06 11988290 [Warning] WSREP: Send action {(nil), 139595027055344, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:06 11998006 [Warning] WSREP: Send action {(nil), 139599322023888, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:06 11988294 [Warning] WSREP: Send action {(nil), 139603616990632, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:06 11991973 [Warning] WSREP: Send action {(nil), 139599322023712, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:06 11988288 [Warning] WSREP: Send action {(nil), 139595027057080, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:06 11985860 [Warning] WSREP: Send action {(nil), 139577847186808, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:06 11997914 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:06 11987069 [Warning] WSREP: Send action {(nil), 139599322023280, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:06 11988354 [Warning] WSREP: Send action {(nil), 139736760976760, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:06 11986489 [Warning] WSREP: Send action {(nil), 139564962285120, WRITESET} returned -107 (Transport endpoint is not connected)
            2023-07-07 15:51:06 11986139 [Warning] WSREP: Send action {(nil), 139582142155592, WRITESET} returned -107 (Transport endpoint is not connected)
            /usr/sbin/mariadbd(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x176)[0x5608f58d48b6]
            2023-07-07 15:51:07 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.43448S), skipping check
            /usr/sbin/mariadbd(_ZN13sp_instr_stmt7executeEP3THDPj+0x5bc)[0x5608f58d529c]
            /usr/sbin/mariadbd(_ZN7sp_head7executeEP3THDb+0xa0c)[0x5608f58ceeac]
            2023-07-07 15:51:10 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.93087S), skipping check
            /usr/sbin/mariadbd(_ZN7sp_head15execute_triggerEP3THDPK25st_mysql_const_lex_stringS4_P13st_grant_info+0x1df)[0x5608f58d008f]
            2023-07-07 15:51:11 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 5dda822d-b4c3 tcp://10.10.1.104:4567
            2023-07-07 15:51:11 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 844de70f-8aaf tcp://10.10.1.103:4567
            /usr/sbin/mariadbd(_ZN19Table_triggers_list16process_triggersEP3THD14trg_event_type20trg_action_time_typeb+0x104)[0x5608f5a40ec4]
            2023-07-07 15:51:13 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.87478S), skipping check
            /usr/sbin/mariadbd(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP10SQL_I_ListI8st_orderEyyP13select_result+0xd99)[0x5608f5d33da9]
            /usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x24c8)[0x5608f597d218]
            2023-07-07 15:51:16 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.90666S), skipping check
            /usr/sbin/mariadbd(_Z11mysql_parseP3THDPcjP12Parser_state+0x20a)[0x5608f5980c9a]
            /usr/sbin/mariadbd(+0x7e1531)[0x5608f5981531]
            2023-07-07 15:51:19 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 96c49f4b-8727 tcp://10.10.1.101:4567
            2023-07-07 15:51:19 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.83971S), skipping check
            /usr/sbin/mariadbd(_Z16dispatch_command19enum_server_commandP3THDPcjb+0x29e1)[0x5608f5984c31]
            2023-07-07 15:51:20 0 [Warning] WSREP: evs::proto(6c357751-8d5f, GATHER, view_id(REG,6c357751-8d5f,51)) install timer expired
            evs::proto(evs::proto(6c357751-8d5f, GATHER, view_id(REG,6c357751-8d5f,51)), GATHER) {
            current_view=view(view_id(REG,6c357751-8d5f,51) memb {
            	6c357751-8d5f,0
            } joined {
            } left {
            } partitioned {
            }),
            input_map=evs::input_map: {aru_seq=2,safe_seq=2,node_index=node: {idx=0,range=[3,2],safe_seq=2} },
            fifo_seq=1874639154,
            last_sent=2,
            known:
            5dda822d-b4c3 at tcp://10.10.1.104:4567
            {o=1,s=0,i=0,fs=834192942,jm=
            {v=1,t=4,ut=255,o=1,s=122,sr=-1,as=122,f=4,src=5dda822d-b4c3,srcvid=view_id(REG,5dda822d-b4c3,52),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=834192942,nl=(
            	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
            	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}
            	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
            	96c49f4b-8727, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
            )
            },
            }
            6c357751-8d5f at 
            {o=1,s=0,i=0,fs=-1,jm=
            {v=1,t=4,ut=255,o=1,s=2,sr=-1,as=2,f=0,src=6c357751-8d5f,srcvid=view_id(REG,6c357751-8d5f,51),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=1874639154,nl=(
            	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
            	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}
            	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
            	96c49f4b-8727, {o=0,s=0,e=0,ls=-1,vid=view_id(REG,00000000-0000,0),ss=-1,ir=[-1,-1],}
            )
            },
            }
            844de70f-8aaf at tcp://10.10.1.103:4567
            {o=1,s=0,i=0,fs=1475544355,jm=
            {v=1,t=4,ut=255,o=1,s=122,sr=-1,as=122,f=4,src=844de70f-8aaf,srcvid=view_id(REG,5dda822d-b4c3,52),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=1475544355,nl=(
            	5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
            	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],}
            	844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
            	96c49f4b-8727, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],}
            )
            },
            }
            96c49f4b-8727 at tcp://10.10.1.101:4567
            {o=0,s=0,i=0,fs=101154494,}
             }
            2023-07-07 15:51:20 0 [Note] WSREP: no install message received
            2023-07-07 15:51:20 0 [Note] WSREP: view(view_id(NON_PRIM,6c357751-8d5f,52) memb {
            	6c357751-8d5f,0
            } joined {
            } left {
            } partitioned {
            	5dda822d-b4c3,0
            	844de70f-8aaf,0
            	96c49f4b-8727,0
            })
            2023-07-07 15:51:20 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
            2023-07-07 15:51:20 0 [Note] WSREP: Flow-control interval: [240, 300]
            2023-07-07 15:51:20 0 [Note] WSREP: Received NON-PRIMARY.
            /usr/sbin/mariadbd(_Z10do_commandP3THDb+0x132)[0x5608f5985942]
            /usr/sbin/mariadbd(_Z24do_handle_one_connectionP7CONNECTb+0x3b7)[0x5608f5aa2dd7]
            2023-07-07 15:51:23 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT4.34509S), skipping check
            /usr/sbin/mariadbd(handle_one_connection+0x5d)[0x5608f5aa311d]
            2023-07-07 15:51:24 11986441 [Warning] WSREP: Send action {(nil), 139599322023528, WRITESET} returned -107 (Transport endpoint is not connected)
            /usr/sbin/mariadbd(+0xc839d2)[0x5608f5e239d2]
            /lib64/libpthread.so.0(+0x7ea5)[0x7f2011c0dea5]
            /lib64/libc.so.6(clone+0x6d)[0x7f2011128b0d]
             
            Trying to get some variables.
            Some pointers may be invalid and cause the dump to abort.
            Query (0x7ef692470578): UPDATE DWHTmp.MAJ_EVENEMENTS_RAPPROCHEMENT
            			SET TYPE_EVE =  NAME_CONST('V_TYPE_EVE',_utf8mb3'D' COLLATE 'utf8mb3_general_ci') , DATE_EFFECTIVE =  NAME_CONST('V_DATE_EFFECTIVE',TIMESTAMP'2023-07-07 15:50:20'), STATUT_DATAMART = NULL  WHERE EVT_ID =  NAME_CONST('V_EVT_ID',2051583478)
             
            Connection ID (thread ID): 11985593
            Status: NOT_KILLED
             
            Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
             
            The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
            information that should help you find out what is causing the crash.
            Writing a core file...
            Working directory at /data/mysql
            Resource Limits:
            Limit                     Soft Limit           Hard Limit           Units     
            Max cpu time              unlimited            unlimited            seconds   
            Max file size             unlimited            unlimited            bytes     
            Max data size             unlimited            unlimited            bytes     
            Max stack size            8388608              unlimited            bytes     
            Max core file size        0                    unlimited            bytes     
            Max resident set          unlimited            unlimited            bytes     
            Max processes             805978               805978               processes 
            Max open files            1048576              1048576              files     
            Max locked memory         65536                65536                bytes     
            Max address space         unlimited            unlimited            bytes     
            Max file locks            unlimited            unlimited            locks     
            Max pending signals       805978               805978               signals   
            Max msgqueue size         819200               819200               bytes     
            Max nice priority         0                    0                    
            Max realtime priority     0                    0                    
            Max realtime timeout      unlimited            unlimited            us        
            Core pattern: core
             
            Kernel version: Linux version 3.10.0-1160.88.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Mar 7 15:41:52 UTC 2023
             
            2023-07-07 17:08:23 0 [Note] Starting MariaDB 10.6.12-7-MariaDB-enterprise-log source revision 8e2b75dad28995ab5f6e6acd436135420f7031c9 as process 1083
            

            rpizzi Rick Pizzi (Inactive) added a comment - - edited I rechecked the logs of this failure. This happened on node 2. It appears that after the assertion, the asserting thread took a VERY long time to dump the stack, and also there was a core file to be generated after that. See below for the sequence; you can clearly see that while the asserting thread is dumping stack, wsrep is still talking with other nodes. Hope this will help. Maybe you should enable core-file and see if that makes a difference. : NO) 2023-07-07 15:50:19 11985573 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. 2023-07-07 15:50:19 11985573 [ERROR] mariadbd: Index for table 'MAJ_EVENEMENTS_RAPPROCHEMENT' is corrupt; try to repair it 2023-07-07 15:50:20 0x7f17f7766700 InnoDB: Assertion failure in file /home/jenkins/workspace/Build-Package/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/storage/innobase/page/page0zip.cc line 4213 InnoDB: Failing assertion: slot_rec InnoDB: We intentionally generate a memory trap. InnoDB: Submit a detailed bug report to https://jira.mariadb.org/ InnoDB: If you get repeated assertion failures or crashes, even InnoDB: immediately after the mariadbd startup, there may be InnoDB: corruption in the InnoDB tablespace. Please refer to InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/ InnoDB: about forcing recovery. 230707 15:50:20 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware.   To report this bug, see https://mariadb.com/kb/en/reporting-bugs   We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail.   Server version: 10.6.12-7-MariaDB-enterprise-log source revision: 8e2b75dad28995ab5f6e6acd436135420f7031c9 key_buffer_size=268435456 read_buffer_size=131072 max_used_connections=2243 max_threads=6002 thread_count=1565 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 13479553 K bytes of memory Hope that's ok; if not, decrease some variables in the equation.   Thread pointer: 0x7ef735ab51c8 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x7f17f7765cb8 thread_stack 0x49000 Can't start addr2line /usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x5608f6116c7e] /usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x5608f5bc33a5] /lib64/libpthread.so.0(+0xf630)[0x7f2011c15630] /lib64/libc.so.6(gsignal+0x37)[0x7f2011060387] /lib64/libc.so.6(abort+0x148)[0x7f2011061a78] /usr/sbin/mariadbd(+0x694d97)[0x5608f5834d97] /usr/sbin/mariadbd(+0xdbfb05)[0x5608f5f5fb05] /usr/sbin/mariadbd(+0xdaf516)[0x5608f5f4f516] 2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 844de70f-8aaf with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 766 rttvar: 579 rto: 201000 lost: 0 last_data_recv: 2567 cwnd: 10 last_queued_since: 8776161264 last_delivered_since: 11959172679 send_queue_length: 9 send_queue_bytes: 720 segment: 0 messages: 9 2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.103:4567 2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 5dda822d-b4c3 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 6407 rttvar: 10277 rto: 207000 lost: 0 last_data_recv: 5900 cwnd: 10 last_queued_since: 307819 last_delivered_since: 8781038225 send_queue_length: 10 send_queue_bytes: 1080 segment: 0 messages: 10 2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.104:42156 2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 96c49f4b-8727 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 5591 rttvar: 10094 rto: 206000 lost: 0 last_data_recv: 5924 cwnd: 10 last_queued_since: 10916 last_delivered_since: 8781705783 send_queue_length: 11 send_queue_bytes: 1292 segment: 0 messages: 11 2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://10.10.1.101:33510 2023-07-07 15:50:32 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.10.1.101:4567 tcp://10.10.1.103:4567 tcp://10.10.1.104:4567 2023-07-07 15:50:32 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT12.0655S), skipping check 2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7f17e8905e58 2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer destruct 2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7ef49da77b98 2023-07-07 15:50:32 0 [Note] WSREP: Deferred close timer destruct /usr/sbin/mariadbd(+0xe62a65)[0x5608f6002a65] 2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 96c49f4b-8727 (tcp://10.10.1.101:4567), attempt 0 2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 844de70f-8aaf (tcp://10.10.1.103:4567), attempt 0 2023-07-07 15:50:33 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') reconnecting to 5dda822d-b4c3 (tcp://10.10.1.104:4567), attempt 0 2023-07-07 15:50:33 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.73771S), skipping check /usr/sbin/mariadbd(+0xe4e639)[0x5608f5fee639] /usr/sbin/mariadbd(+0xe5063b)[0x5608f5ff063b] /usr/sbin/mariadbd(+0xe62e98)[0x5608f6002e98] /usr/sbin/mariadbd(+0xde0227)[0x5608f5f80227] /usr/sbin/mariadbd(+0xde2da4)[0x5608f5f82da4] 2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr timed out, no messages seen in PT6S, socket stats: rtt: 1000 rttvar: 500 rto: 201000 lost: 0 last_data_recv: 125408244 cwnd: 10 last_queued_since: 4421911460 last_delivered_since: 8715642835357354 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 359 rttvar: 179 rto: 201000 lost: 0 last_data_recv: 7528 cwnd: 10 last_queued_since: 120133 last_delivered_since: 120133 send_queue_length: 0 send_queue_bytes: 0 2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 437 rttvar: 218 rto: 201000 lost: 0 last_data_recv: 7532 cwnd: 10 last_queued_since: 4131087 last_delivered_since: 4131087 send_queue_length: 0 send_queue_bytes: 0 2023-07-07 15:50:41 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 181 rttvar: 90 rto: 201000 lost: 0 last_data_recv: 7532 cwnd: 10 last_queued_since: 8715642839939685 last_delivered_since: 8715642839939685 send_queue_length: 0 send_queue_bytes: 0 2023-07-07 15:50:41 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT7.53388S), skipping check /usr/sbin/mariadbd(+0xe151ab)[0x5608f5fb51ab] 2023-07-07 15:50:43 0 [Note] WSREP: Deferred close timer handle_wait Success for 0x7f17ebc5b168 2023-07-07 15:50:43 0 [Note] WSREP: Deferred close timer destruct 2023-07-07 15:50:43 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.61329S), skipping check /usr/sbin/mariadbd(+0xe15869)[0x5608f5fb5869] 2023-07-07 15:50:44 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.63124S), skipping check 2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 5dda822d-b4c3 2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 844de70f-8aaf 2023-07-07 15:50:46 0 [Note] WSREP: evs::proto(6c357751-8d5f, OPERATIONAL, view_id(REG,5dda822d-b4c3,50)) detected inactive node: 96c49f4b-8727 /usr/sbin/mariadbd(+0xdf28b2)[0x5608f5f928b2] /usr/sbin/mariadbd(+0xd43ca8)[0x5608f5ee3ca8] /usr/sbin/mariadbd(_ZN7handler13ha_update_rowEPKhS1_+0x232)[0x5608f5bd12b2] /usr/sbin/mariadbd(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x1a63)[0x5608f5a5cf33] /usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x263e)[0x5608f597d38e] /usr/sbin/mariadbd(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x38)[0x5608f58cb718] /usr/sbin/mariadbd(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x176)[0x5608f58d48b6] /usr/sbin/mariadbd(_ZN13sp_instr_stmt7executeEP3THDPj+0x5bc)[0x5608f58d529c] /usr/sbin/mariadbd(_ZN7sp_head7executeEP3THDb+0xa0c)[0x5608f58ceeac] 2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr timed out, no messages seen in PT6S, socket stats: rtt: 0 rttvar: 0 rto: 200000 lost: 0 last_data_recv: 125424590 cwnd: 10 last_queued_since: 11598743653 last_delivered_since: 8715659181935888 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 5dda822d-b4c3 with addr tcp://10.10.1.104:4567 timed out, no messages seen in PT6S, socket stats: rtt: 185 rttvar: 82 rto: 201000 lost: 0 last_data_recv: 5508 cwnd: 10 last_queued_since: 11598929590 last_delivered_since: 11598962474 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 844de70f-8aaf with addr tcp://10.10.1.103:4567 timed out, no messages seen in PT6S, socket stats: rtt: 264 rttvar: 105 rto: 201000 lost: 0 last_data_recv: 5508 cwnd: 10 last_queued_since: 11599134047 last_delivered_since: 11599140957 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 2023-07-07 15:50:57 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 96c49f4b-8727 with addr tcp://10.10.1.101:4567 timed out, no messages seen in PT6S, socket stats: rtt: 384 rttvar: 151 rto: 201000 lost: 0 last_data_recv: 5678 cwnd: 10 last_queued_since: 11599486516 last_delivered_since: 11599494477 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 /usr/sbin/mariadbd(_ZN7sp_head17execute_procedureEP3THDP4ListI4ItemE+0x66a)[0x5608f58d093a] /usr/sbin/mariadbd(+0x7cfc17)[0x5608f596fc17] /usr/sbin/mariadbd(+0x7d3a68)[0x5608f5973a68] /usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x101a)[0x5608f597bd6a] 2023-07-07 15:51:04 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr timed out, no messages seen in PT6S, socket stats: rtt: 0 rttvar: 0 rto: 200000 lost: 0 last_data_recv: 125430994 cwnd: 10 last_queued_since: 221024 last_delivered_since: 8715665585967344 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 2023-07-07 15:51:04 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT18.0029S), skipping check 2023-07-07 15:51:04 0 [Note] WSREP: view(view_id(NON_PRIM,5dda822d-b4c3,50) memb { 6c357751-8d5f,0 } joined { } left { } partitioned { 5dda822d-b4c3,0 844de70f-8aaf,0 96c49f4b-8727,0 }) 2023-07-07 15:51:04 0 [Note] WSREP: view(view_id(NON_PRIM,6c357751-8d5f,51) memb { 6c357751-8d5f,0 } joined { } left { } partitioned { 5dda822d-b4c3,0 844de70f-8aaf,0 96c49f4b-8727,0 }) 2023-07-07 15:51:04 11994604 [Warning] WSREP: Send action {(nil), 139603616989752, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2023-07-07 15:51:04 0 [Note] WSREP: Flow-control interval: [240, 300] 2023-07-07 15:51:04 0 [Note] WSREP: Received NON-PRIMARY. 2023-07-07 15:51:04 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 6204240577) 2023-07-07 15:51:04 11955214 [Warning] WSREP: Send action {(nil), 139599322023456, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 11985855 [Warning] WSREP: Send action {(nil), 139603616990584, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 11987046 [Warning] WSREP: Send action {(nil), 139599322023328, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 11985820 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 46 [Note] WSREP: ================================================ View: id: c3a51458-b6fd-11eb-8a80-eb35c100e72c:6204240577 status: non-primary protocol_version: 4 capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO final: no own_index: 0 members(1): 0: 6c357751-ce4f-11ed-8d5f-136e7094748b, PIXID-MDB-MASTER2 ================================================= 2023-07-07 15:51:04 46 [Note] WSREP: Non-primary view 2023-07-07 15:51:04 46 [Note] WSREP: Server status change synced -> connected 2023-07-07 15:51:04 46 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2023-07-07 15:51:04 11997537 [Warning] WSREP: Send action {(nil), 139603616989760, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 11996647 [Warning] WSREP: Send action {(nil), 139573552218680, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 11986459 [Warning] WSREP: Send action {(nil), 139736760976944, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 11997986 [Warning] WSREP: Send action {(nil), 139599322023552, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 11985505 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:04 11988311 [Warning] WSREP: Send action {(nil), 139607911957872, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11917365 [Warning] WSREP: Send action {(nil), 139586437121400, WRITESET} returned -107 (Transport endpoint is not connected) /usr/sbin/mariadbd(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x38)[0x5608f58cb718] 2023-07-07 15:51:06 11985895 [Warning] WSREP: Send action {(nil), 139590732088096, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2023-07-07 15:51:06 0 [Note] WSREP: Flow-control interval: [240, 300] 2023-07-07 15:51:06 0 [Note] WSREP: Received NON-PRIMARY. 2023-07-07 15:51:06 11978506 [Warning] WSREP: Send action {(nil), 139599322023472, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11997530 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11988302 [Warning] WSREP: Send action {(nil), 139599322023760, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11988512 [Warning] WSREP: Send action {(nil), 139736760977344, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11988290 [Warning] WSREP: Send action {(nil), 139595027055344, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11998006 [Warning] WSREP: Send action {(nil), 139599322023888, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11988294 [Warning] WSREP: Send action {(nil), 139603616990632, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11991973 [Warning] WSREP: Send action {(nil), 139599322023712, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11988288 [Warning] WSREP: Send action {(nil), 139595027057080, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11985860 [Warning] WSREP: Send action {(nil), 139577847186808, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11997914 [Warning] WSREP: Send action {(nil), 139599322023336, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11987069 [Warning] WSREP: Send action {(nil), 139599322023280, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11988354 [Warning] WSREP: Send action {(nil), 139736760976760, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11986489 [Warning] WSREP: Send action {(nil), 139564962285120, WRITESET} returned -107 (Transport endpoint is not connected) 2023-07-07 15:51:06 11986139 [Warning] WSREP: Send action {(nil), 139582142155592, WRITESET} returned -107 (Transport endpoint is not connected) /usr/sbin/mariadbd(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x176)[0x5608f58d48b6] 2023-07-07 15:51:07 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.43448S), skipping check /usr/sbin/mariadbd(_ZN13sp_instr_stmt7executeEP3THDPj+0x5bc)[0x5608f58d529c] /usr/sbin/mariadbd(_ZN7sp_head7executeEP3THDb+0xa0c)[0x5608f58ceeac] 2023-07-07 15:51:10 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.93087S), skipping check /usr/sbin/mariadbd(_ZN7sp_head15execute_triggerEP3THDPK25st_mysql_const_lex_stringS4_P13st_grant_info+0x1df)[0x5608f58d008f] 2023-07-07 15:51:11 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 5dda822d-b4c3 tcp://10.10.1.104:4567 2023-07-07 15:51:11 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 844de70f-8aaf tcp://10.10.1.103:4567 /usr/sbin/mariadbd(_ZN19Table_triggers_list16process_triggersEP3THD14trg_event_type20trg_action_time_typeb+0x104)[0x5608f5a40ec4] 2023-07-07 15:51:13 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.87478S), skipping check /usr/sbin/mariadbd(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP10SQL_I_ListI8st_orderEyyP13select_result+0xd99)[0x5608f5d33da9] /usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x24c8)[0x5608f597d218] 2023-07-07 15:51:16 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.90666S), skipping check /usr/sbin/mariadbd(_Z11mysql_parseP3THDPcjP12Parser_state+0x20a)[0x5608f5980c9a] /usr/sbin/mariadbd(+0x7e1531)[0x5608f5981531] 2023-07-07 15:51:19 0 [Note] WSREP: (6c357751-8d5f, 'tcp://0.0.0.0:4567') connection established to 96c49f4b-8727 tcp://10.10.1.101:4567 2023-07-07 15:51:19 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.83971S), skipping check /usr/sbin/mariadbd(_Z16dispatch_command19enum_server_commandP3THDPcjb+0x29e1)[0x5608f5984c31] 2023-07-07 15:51:20 0 [Warning] WSREP: evs::proto(6c357751-8d5f, GATHER, view_id(REG,6c357751-8d5f,51)) install timer expired evs::proto(evs::proto(6c357751-8d5f, GATHER, view_id(REG,6c357751-8d5f,51)), GATHER) { current_view=view(view_id(REG,6c357751-8d5f,51) memb { 6c357751-8d5f,0 } joined { } left { } partitioned { }), input_map=evs::input_map: {aru_seq=2,safe_seq=2,node_index=node: {idx=0,range=[3,2],safe_seq=2} }, fifo_seq=1874639154, last_sent=2, known: 5dda822d-b4c3 at tcp://10.10.1.104:4567 {o=1,s=0,i=0,fs=834192942,jm= {v=1,t=4,ut=255,o=1,s=122,sr=-1,as=122,f=4,src=5dda822d-b4c3,srcvid=view_id(REG,5dda822d-b4c3,52),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=834192942,nl=( 5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],} 6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],} 844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],} 96c49f4b-8727, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],} ) }, } 6c357751-8d5f at {o=1,s=0,i=0,fs=-1,jm= {v=1,t=4,ut=255,o=1,s=2,sr=-1,as=2,f=0,src=6c357751-8d5f,srcvid=view_id(REG,6c357751-8d5f,51),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=1874639154,nl=( 5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],} 6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],} 844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],} 96c49f4b-8727, {o=0,s=0,e=0,ls=-1,vid=view_id(REG,00000000-0000,0),ss=-1,ir=[-1,-1],} ) }, } 844de70f-8aaf at tcp://10.10.1.103:4567 {o=1,s=0,i=0,fs=1475544355,jm= {v=1,t=4,ut=255,o=1,s=122,sr=-1,as=122,f=4,src=844de70f-8aaf,srcvid=view_id(REG,5dda822d-b4c3,52),insvid=view_id(UNKNOWN,00000000-0000,0),ru=00000000-0000,r=[-1,-1],fs=1475544355,nl=( 5dda822d-b4c3, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],} 6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,6c357751-8d5f,51),ss=2,ir=[3,2],} 844de70f-8aaf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],} 96c49f4b-8727, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,52),ss=122,ir=[123,122],} ) }, } 96c49f4b-8727 at tcp://10.10.1.101:4567 {o=0,s=0,i=0,fs=101154494,} } 2023-07-07 15:51:20 0 [Note] WSREP: no install message received 2023-07-07 15:51:20 0 [Note] WSREP: view(view_id(NON_PRIM,6c357751-8d5f,52) memb { 6c357751-8d5f,0 } joined { } left { } partitioned { 5dda822d-b4c3,0 844de70f-8aaf,0 96c49f4b-8727,0 }) 2023-07-07 15:51:20 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2023-07-07 15:51:20 0 [Note] WSREP: Flow-control interval: [240, 300] 2023-07-07 15:51:20 0 [Note] WSREP: Received NON-PRIMARY. /usr/sbin/mariadbd(_Z10do_commandP3THDb+0x132)[0x5608f5985942] /usr/sbin/mariadbd(_Z24do_handle_one_connectionP7CONNECTb+0x3b7)[0x5608f5aa2dd7] 2023-07-07 15:51:23 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT4.34509S), skipping check /usr/sbin/mariadbd(handle_one_connection+0x5d)[0x5608f5aa311d] 2023-07-07 15:51:24 11986441 [Warning] WSREP: Send action {(nil), 139599322023528, WRITESET} returned -107 (Transport endpoint is not connected) /usr/sbin/mariadbd(+0xc839d2)[0x5608f5e239d2] /lib64/libpthread.so.0(+0x7ea5)[0x7f2011c0dea5] /lib64/libc.so.6(clone+0x6d)[0x7f2011128b0d]   Trying to get some variables. Some pointers may be invalid and cause the dump to abort. Query (0x7ef692470578): UPDATE DWHTmp.MAJ_EVENEMENTS_RAPPROCHEMENT SET TYPE_EVE = NAME_CONST('V_TYPE_EVE',_utf8mb3'D' COLLATE 'utf8mb3_general_ci') , DATE_EFFECTIVE = NAME_CONST('V_DATE_EFFECTIVE',TIMESTAMP'2023-07-07 15:50:20'), STATUT_DATAMART = NULL WHERE EVT_ID = NAME_CONST('V_EVT_ID',2051583478)   Connection ID (thread ID): 11985593 Status: NOT_KILLED   Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off   The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains information that should help you find out what is causing the crash. Writing a core file... Working directory at /data/mysql Resource Limits: Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 805978 805978 processes Max open files 1048576 1048576 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 805978 805978 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us Core pattern: core   Kernel version: Linux version 3.10.0-1160.88.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Mar 7 15:41:52 UTC 2023   2023-07-07 17:08:23 0 [Note] Starting MariaDB 10.6.12-7-MariaDB-enterprise-log source revision 8e2b75dad28995ab5f6e6acd436135420f7031c9 as process 1083

            rpizzi Based on these error logs remaining nodes were not in agreement of the absent node state, so they decided to exclude each other from the group. See here:

            	6c357751-8d5f, {o=0,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,50),ss=50608928,ir=[50608930,50608929],}
            vs.
            	6c357751-8d5f, {o=0,s=1,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,50),ss=50608928,ir=[50608930,50608929],}
            vs
            	6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,50),ss=50608928,ir=[50608930,50608929],}
            

            where o=operational and s = suspected. First thinks both are 0 (false), second thinks node is suspected and last thinks node is still operational. This depends on input and timing.

            Inconsistency issue found by ramesh is a bug (MDEV-32122) but not related to problem here.

            janlindstrom Jan Lindström added a comment - rpizzi Based on these error logs remaining nodes were not in agreement of the absent node state, so they decided to exclude each other from the group. See here: 6c357751-8d5f, {o=0,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,50),ss=50608928,ir=[50608930,50608929],} vs. 6c357751-8d5f, {o=0,s=1,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,50),ss=50608928,ir=[50608930,50608929],} vs 6c357751-8d5f, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,5dda822d-b4c3,50),ss=50608928,ir=[50608930,50608929],} where o=operational and s = suspected. First thinks both are 0 (false), second thinks node is suspected and last thinks node is still operational. This depends on input and timing. Inconsistency issue found by ramesh is a bug ( MDEV-32122 ) but not related to problem here.
            rpizzi Rick Pizzi (Inactive) added a comment - - edited

            Well, this confirms the issue doesn't it...
            I believe the disagreement comes from the fact that WSREP layer did not die when server did, hence responded to requests from other nodes in an inconsistent manner.

            rpizzi Rick Pizzi (Inactive) added a comment - - edited Well, this confirms the issue doesn't it... I believe the disagreement comes from the fact that WSREP layer did not die when server did, hence responded to requests from other nodes in an inconsistent manner.

            rpizzi Yes. serg signals that are raised or sent to the process (instead of a specific thread), will still be handled by a random thread (among those that do not block it). So is there any way to get those wsrep threads die faster? In the other nodes there is not much we can do as their knowledge on node state depends when information of state last time was received and/or requested but it would help if crashing node threads doing traffic to other nodes are down as soon as possible.

            janlindstrom Jan Lindström added a comment - rpizzi Yes. serg signals that are raised or sent to the process (instead of a specific thread), will still be handled by a random thread (among those that do not block it). So is there any way to get those wsrep threads die faster? In the other nodes there is not much we can do as their knowledge on node state depends when information of state last time was received and/or requested but it would help if crashing node threads doing traffic to other nodes are down as soon as possible.

            When the server crash, there are usually not much to do except printing a stack trace and call exit()
            In theory we could on crash send some kind of signal to WSREP threads (If they have a THD then we can mark it killed).
            Would marking the THD as killed help?

            monty Michael Widenius added a comment - When the server crash, there are usually not much to do except printing a stack trace and call exit() In theory we could on crash send some kind of signal to WSREP threads (If they have a THD then we can mark it killed). Would marking the THD as killed help?

            Marking THD as killed would help only for appliers but not for thread(s) used inside Galera library for connections to other nodes.

            janlindstrom Jan Lindström added a comment - Marking THD as killed would help only for appliers but not for thread(s) used inside Galera library for connections to other nodes.

            monty the issue here is that "printing a stack trace" takes, like, 5 minutes.
            In the meantime the WSREP threads are alive and sending inconsistent information to other nodes.

            rpizzi Rick Pizzi (Inactive) added a comment - monty the issue here is that "printing a stack trace" takes, like, 5 minutes. In the meantime the WSREP threads are alive and sending inconsistent information to other nodes.

            marko serg Is there anything we can do for this and we talk here Release builds assertions? Whatever you can do it needs to be done before we enter signal handler and start doing core dump if we do it as it could take long time. Currently, maybe we could set thd on killed state (not sure if this is enough) for wsrep applier threads. But this does not mean yet that node is not reachable on Galera point of view. There is not now a way to disconnect all incoming/out coming connections inside Galera from server code, this would partly solve the issue but again it depends on timing i.e. when other nodes ask/discover crashing node status and because this is asynchronous there is still risk that remaining nodes still do not agree state of crashing node (i.e is it operational or suspected).

            janlindstrom Jan Lindström added a comment - marko serg Is there anything we can do for this and we talk here Release builds assertions? Whatever you can do it needs to be done before we enter signal handler and start doing core dump if we do it as it could take long time. Currently, maybe we could set thd on killed state (not sure if this is enough) for wsrep applier threads. But this does not mean yet that node is not reachable on Galera point of view. There is not now a way to disconnect all incoming/out coming connections inside Galera from server code, this would partly solve the issue but again it depends on timing i.e. when other nodes ask/discover crashing node status and because this is asynchronous there is still risk that remaining nodes still do not agree state of crashing node (i.e is it operational or suspected).

            janlindstrom, why remaining nodes do not agree on state of crashing node?

            serg Sergei Golubchik added a comment - janlindstrom , why remaining nodes do not agree on state of crashing node?

            serg See my comment on 2023-09-07 10:47, there one node thinks node is down, second thinks node is suspected and last thinks node is still operational. This is because nodes notice that node is down or suspected to be down based on information they have received and this when they receive is timing dependent. Why one node thinks node crashing is still operational, it must be based on fact that we have received something from that node before (or during) crashing and information that node is down has not yet arrived or some timeout on connection is not yet reached.

            My questions was not how to improve this agreement on node states, it was how to make crashing node more unreachable e.g. by killing appliers and closing all incoming and outgoing connections earlier.

            janlindstrom Jan Lindström added a comment - serg See my comment on 2023-09-07 10:47, there one node thinks node is down, second thinks node is suspected and last thinks node is still operational. This is because nodes notice that node is down or suspected to be down based on information they have received and this when they receive is timing dependent. Why one node thinks node crashing is still operational, it must be based on fact that we have received something from that node before (or during) crashing and information that node is down has not yet arrived or some timeout on connection is not yet reached. My questions was not how to improve this agreement on node states, it was how to make crashing node more unreachable e.g. by killing appliers and closing all incoming and outgoing connections earlier.

            If the crashing node needs to send three messages, sequentially, to three different nodes, then there always will be a race condition. You kill them earlier, you kill them later, whatever you do, the node won't die instantly as a whole. Galera must be able to cope with it, otherwise any node crash can break the cluster.

            But I don't understand why Galera cannot cope with it. Nodes send messages to peers in a specific order. So if a node A with a lower number thinks that some node X is up and the node B with a higher number thinks that the node X is down, it means that the node X crashed after sending a message to A, but before sending a message to B. This is easy to detect.

            serg Sergei Golubchik added a comment - If the crashing node needs to send three messages, sequentially, to three different nodes, then there always will be a race condition. You kill them earlier, you kill them later, whatever you do, the node won't die instantly as a whole. Galera must be able to cope with it, otherwise any node crash can break the cluster. But I don't understand why Galera cannot cope with it. Nodes send messages to peers in a specific order. So if a node A with a lower number thinks that some node X is up and the node B with a higher number thinks that the node X is down, it means that the node X crashed after sending a message to A, but before sending a message to B. This is easy to detect.

            teemu.ollakka Can you explain why remaining nodes fail to agree crashing node state and start to self leave.

            janlindstrom Jan Lindström added a comment - teemu.ollakka Can you explain why remaining nodes fail to agree crashing node state and start to self leave.

            The reported assertion failure here occurs when a record is being inserted into a corrupted ROW_FORMAT=COMPRESSED InnoDB page. This crash was not removed in MDEV-13542. An obvious workaround for this particular case would be to avoid using ROW_FORMAT=COMPRESSED tables. Some design mistakes are not easy to fix; see MDEV-30882 and MDEV-31574.

            marko Marko Mäkelä added a comment - The reported assertion failure here occurs when a record is being inserted into a corrupted ROW_FORMAT=COMPRESSED InnoDB page. This crash was not removed in MDEV-13542 . An obvious workaround for this particular case would be to avoid using ROW_FORMAT=COMPRESSED tables. Some design mistakes are not easy to fix; see MDEV-30882 and MDEV-31574 .

            rpizzi I tried to reproduce this issue with latest 10.6 and Galera library 26.4.17 with attached test case. After several hours of testing, I still could not reproduce the issue.

            janlindstrom Jan Lindström added a comment - rpizzi I tried to reproduce this issue with latest 10.6 and Galera library 26.4.17 with attached test case. After several hours of testing, I still could not reproduce the issue.

            janlindstrom, I see that galera_crash_node.test uses debug injection for crashing a node at a specific point of execution. I think that a more realistic test scenario would be to run CMAKE_BUILD_TYPE=RelWithDebInfo executables and randomly kill one of the cluster nodes externally (by kill -KILL).

            marko Marko Mäkelä added a comment - janlindstrom , I see that galera_crash_node.test uses debug injection for crashing a node at a specific point of execution. I think that a more realistic test scenario would be to run CMAKE_BUILD_TYPE=RelWithDebInfo executables and randomly kill one of the cluster nodes externally (by kill -KILL ).
            janlindstrom Jan Lindström added a comment - - edited

            I tested again hours with following setup

            • 10.6 commit bde552ae RelWithDebInfo build
            • Galera library 26.4.17 release build
            • 3 node cluster
            • sysbench load to node_1 and node_3 (oltp_read_write)
            • kill -9 node_3 after a while and restart node_3 + sysbench load

            Result: remaining nodes stayed up and running as expected i.e. I could not reproduce.

            janlindstrom Jan Lindström added a comment - - edited I tested again hours with following setup 10.6 commit bde552ae RelWithDebInfo build Galera library 26.4.17 release build 3 node cluster sysbench load to node_1 and node_3 (oltp_read_write) kill -9 node_3 after a while and restart node_3 + sysbench load Result: remaining nodes stayed up and running as expected i.e. I could not reproduce.

            I don't think that killing the node with kill -9 will ever reproduce it.
            As explained it has to be a code assertion.

            Rick

            rpizzi Rick Pizzi (Inactive) added a comment - I don't think that killing the node with kill -9 will ever reproduce it. As explained it has to be a code assertion. Rick

            The whole point of this ticket is that WSREP layer remains active after the assertion generates the trap.
            Killing the process with SIGKILL will not allow the code to do anything, including executing the trap code.

            rpizzi Rick Pizzi (Inactive) added a comment - The whole point of this ticket is that WSREP layer remains active after the assertion generates the trap. Killing the process with SIGKILL will not allow the code to do anything, including executing the trap code.

            janlindstrom, did you test with kill -ABRT as well? I think that it should trigger our built-in stack trace reporter, which depending on the circumstances, could hang or cause unexpected behaviour.

            marko Marko Mäkelä added a comment - janlindstrom , did you test with kill -ABRT as well? I think that it should trigger our built-in stack trace reporter, which depending on the circumstances, could hang or cause unexpected behaviour.

            rpizzimarko Both cases were tested (with several test rounds) and remaining nodes stayed on cluster normally i.e. I could not reproduce case where all nodes leave the cluster.

            janlindstrom Jan Lindström added a comment - rpizzi marko Both cases were tested (with several test rounds) and remaining nodes stayed on cluster normally i.e. I could not reproduce case where all nodes leave the cluster.

            teemu.ollakka Can you try to explain why nodes could disagree state of the crashing node and what we can do for wsrep connections and threads at signal handler?

            janlindstrom Jan Lindström added a comment - teemu.ollakka Can you try to explain why nodes could disagree state of the crashing node and what we can do for wsrep connections and threads at signal handler?

            It just occurred to me that according to the error log excerpt in the Description, the Galera node was stuck for more than 60 seconds, trying to produce a stack trace of the crashing thread, which in my experience would be mostly useless for anything that involves InnoDB or Galera, because it covers only one thread and typically resolves many stack traces incorrectly. Besides, some of the invoked functions would seem to be unsafe according to man 7 signal-safety.

            Would this problem be alleviated by configuring the nodes with --skip-stack-trace so that they would fail faster in the event of a fatal error? For post-mortem analysis, core dumps could still be generated independent of this option.

            marko Marko Mäkelä added a comment - It just occurred to me that according to the error log excerpt in the Description, the Galera node was stuck for more than 60 seconds, trying to produce a stack trace of the crashing thread, which in my experience would be mostly useless for anything that involves InnoDB or Galera, because it covers only one thread and typically resolves many stack traces incorrectly. Besides, some of the invoked functions would seem to be unsafe according to man 7 signal-safety . Would this problem be alleviated by configuring the nodes with --skip-stack-trace so that they would fail faster in the event of a fatal error? For post-mortem analysis, core dumps could still be generated independent of this option.

            MDEV-21010 appears to be a very similar bug report about the built-in stack trace reporter.

            marko Marko Mäkelä added a comment - MDEV-21010 appears to be a very similar bug report about the built-in stack trace reporter.

            Fixed on Galera libarary 24.6.19.

            janlindstrom Jan Lindström added a comment - Fixed on Galera libarary 24.6.19.

            janlindstrom, can you comment on my observation about the time to attempt to produce stack traces in the Description, which apparently ran for 64 seconds between 15:50:20 and 15:51:24? The stack trace output is interleaved with other messages.

            I would think that when a process is killed, all connections will be torn down and any peer processes will be notified. If an assertion failure causes the process to stop serving requests, but the connection sockets will be held open (in a stuck state) while the built-in stack trace reporter is running (and I have seen it actually hang in other cases), then the peer processes could remain blocked for a long time.

            What exactly do you think would be fixed in the Galera library? Would there be some kind of inactivity timeouts in the peer process?

            marko Marko Mäkelä added a comment - janlindstrom , can you comment on my observation about the time to attempt to produce stack traces in the Description, which apparently ran for 64 seconds between 15:50:20 and 15:51:24? The stack trace output is interleaved with other messages. I would think that when a process is killed, all connections will be torn down and any peer processes will be notified. If an assertion failure causes the process to stop serving requests, but the connection sockets will be held open (in a stuck state) while the built-in stack trace reporter is running (and I have seen it actually hang in other cases), then the peer processes could remain blocked for a long time. What exactly do you think would be fixed in the Galera library? Would there be some kind of inactivity timeouts in the peer process?

            marko Galera library has method that can be called so that all connections are closed i.e. node is isolated from rest of the cluster. However, actual server code part is still missing (I did not notice it first). This works as follows: from signal handler inside a server code we call this node isolation function to isolate node from rest of the cluster.

            I have not been able to reproduce the problem locally, I have tried with different methods several times.

            janlindstrom Jan Lindström added a comment - marko Galera library has method that can be called so that all connections are closed i.e. node is isolated from rest of the cluster. However, actual server code part is still missing (I did not notice it first). This works as follows: from signal handler inside a server code we call this node isolation function to isolate node from rest of the cluster. I have not been able to reproduce the problem locally, I have tried with different methods several times.
            janlindstrom Jan Lindström added a comment - https://github.com/MariaDB/server/pull/3437
            sysprg Julius Goryavsky added a comment - The fix has been merged with the head revision: https://github.com/MariaDB/server/commit/54a10a429334a9579558a5d284c510d6f8b5bc97

            People

              sysprg Julius Goryavsky
              rpizzi Rick Pizzi (Inactive)
              Votes:
              5 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.