Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33349

Crash when a new node attempts to join the Galera cluster

Details

    Description

      The Kubernetes operator for MariaDB is able to provision MariaDB clusters by creating containers one by one waiting until the `wsrep_ready` variable is enabled. This is to ensure that just one node is attempting to join the cluster at a given time.

      When a new node joins the cluster, Galera performs an SST choosing an existing node as donor and transferring the state to the new node in order to initialize it. We've seen this process failing sometimes when bootstrapping the cluster or when a node goes down, so the unhealthy container gets restarted by Kubernetes and a new one is created, which implies that the SST is retried again. This situation is repeated until the container reaches a healthy state and the node is part of the cluster:

      Here there are the configuration files for each node:

      mariadb-galera-0

      [mariadb]
      bind-address=0.0.0.0
      default_storage_engine=InnoDB
      binlog_format=row
      innodb_autoinc_lock_mode=2
       
      # Cluster configuration
      wsrep_on=ON
      wsrep_provider=/usr/lib/galera/libgalera_smm.so
      wsrep_cluster_address="gcomm://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-1.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local"
      wsrep_cluster_name=mariadb-operator
      wsrep_slave_threads=1
       
      # Node configuration
      wsrep_node_address="mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local"
      wsrep_node_name="mariadb-galera-0"
      wsrep_sst_method="mariabackup"
      wsrep_sst_auth="<user>:<password>"
      

      mariadb-galera-1

      [mariadb]
      bind-address=0.0.0.0
      default_storage_engine=InnoDB
      binlog_format=row
      innodb_autoinc_lock_mode=2
       
      # Cluster configuration
      wsrep_on=ON
      wsrep_provider=/usr/lib/galera/libgalera_smm.so
      wsrep_cluster_address="gcomm://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-1.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local"
      wsrep_cluster_name=mariadb-operator
      wsrep_slave_threads=1
       
      # Node configuration
      wsrep_node_address="mariadb-galera-1.mariadb-galera-internal.default.svc.cluster.local"
      wsrep_node_name="mariadb-galera-1"
      wsrep_sst_method="mariabackup"
      wsrep_sst_auth="<user>:<password>"
      

      mariadb-galera-2

      [mariadb]
      bind-address=0.0.0.0
      default_storage_engine=InnoDB
      binlog_format=row
      innodb_autoinc_lock_mode=2
       
      # Cluster configuration
      wsrep_on=ON
      wsrep_provider=/usr/lib/galera/libgalera_smm.so
      wsrep_cluster_address="gcomm://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-1.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local"
      wsrep_cluster_name=mariadb-operator
      wsrep_slave_threads=1
       
      # Node configuration
      wsrep_node_address="mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local"
      wsrep_node_name="mariadb-galera-2"
      wsrep_sst_method="mariabackup"
      wsrep_sst_auth="<user>:<password>"
      

      It is important to note that we are using DNS names in the cluster address, which get resolved to the IP of the containers, but every time a container is restarted by Kubernetes it gets a new IP assigned.

      There are some log files attached to the current Jira showing the crash happening after a node went down and attempting to rejoin the cluster for a while. Also, after ~30m or so, it finally managed to join.

      I have also tried to use rsync instead of mariabackup, but it didn't help.

      Attachments

        Activity

          I was a bit curious about the InnoDB part of this and initially thought that there was something strange. But that part looks fine after all.

          In mariadb-galera-2-recovered.log we can see that wsrep_sst_method=mariabackup did invoke mariadb-backup --prepare, because InnoDB starts up with a dummy ib_logfile0:

          mariadb-11.2.2

          2024-01-31 10:29:35 0 [Note] InnoDB: End of log at LSN=1623012
          2024-01-31 10:29:35 0 [Note] InnoDB: Resizing redo log from 12.016KiB to 96.000MiB; LSN=1623012
          

          Starting with MDEV-14425, the normal log file size should be a multiple of 4096 bytes. Here, the file size seems to be 12288+16=12304 bytes. Starting with MDEV-27199, a log file is mandatory. Such dummy 12304-byte files are created at the end of mariadb-backup --prepare in order to pass a start LSN to the server.

          marko Marko Mäkelä added a comment - I was a bit curious about the InnoDB part of this and initially thought that there was something strange. But that part looks fine after all. In mariadb-galera-2-recovered.log we can see that wsrep_sst_method=mariabackup did invoke mariadb-backup --prepare , because InnoDB starts up with a dummy ib_logfile0 : mariadb-11.2.2 2024-01-31 10:29:35 0 [Note] InnoDB: End of log at LSN=1623012 2024-01-31 10:29:35 0 [Note] InnoDB: Resizing redo log from 12.016KiB to 96.000MiB; LSN=1623012 Starting with MDEV-14425 , the normal log file size should be a multiple of 4096 bytes. Here, the file size seems to be 12288+16=12304 bytes. Starting with MDEV-27199 , a log file is mandatory. Such dummy 12304-byte files are created at the end of mariadb-backup --prepare in order to pass a start LSN to the server.
          martin.montes Martin Montes added a comment -

          Thanks for taking a look Marko, this might be an issue related to Galera and not to InnoDB, let me know if you need more debug info if you think it is worth digging deeper on the InnoDB side.

          This morning I've managed to reproduce the problem in MariaDB 10.6. I have attached the donor and joiner logs.

          mariadb-galera-2-joiner.log
          mariadb-galera-1-donor.log

          martin.montes Martin Montes added a comment - Thanks for taking a look Marko, this might be an issue related to Galera and not to InnoDB, let me know if you need more debug info if you think it is worth digging deeper on the InnoDB side. This morning I've managed to reproduce the problem in MariaDB 10.6. I have attached the donor and joiner logs. mariadb-galera-2-joiner.log mariadb-galera-1-donor.log
          martin.montes Martin Montes added a comment - - edited

          I've tried with a different test scenario:

          • Provision a Galera cluster in Kubernetes without liveness/readiness probes, meaning that Kubernetes will not interfere with the container (no restarts and no traffic stopped when the container is considered unhealthy). The only way for the container to stop is returning a non zero status code.
          • After provisioning, I've created some data with sysbench for warming up the database
          • Stopped the sysbench tests
          • Deleted the node marked as primary by MaxScale (mariadb-galera-2)
          • The failover is effectively performed to another node (mariadb-galera-0)
          • mariadb-galera-2 comes back online and requests and SST from mariadb-galera-1
          • The crash starts happening again:

          WSREP_SST: [ERROR] previous SST script still running. (20240205 09:57:30.706)
          2024-02-05  9:57:30 0 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_mariabackup --role 'joiner' --address 'mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local' --datadir '/var/lib/mysql/' --parent 1 --progress 0
          	Read: '(null)'
          2024-02-05  9:57:30 0 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'joiner' --address 'mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local' --datadir '/var/lib/mysql/' --parent 1 --progress 0: 114 (Operation already in progress)
          2024-02-05  9:57:30 2 [ERROR] WSREP: Failed to prepare for 'mariabackup' SST. Unrecoverable.
          2024-02-05  9:57:30 2 [ERROR] WSREP: SST request callback failed. This is unrecoverable, restart required.
          2024-02-05  9:57:30 2 [Note] WSREP: ReplicatorSMM::abort()
          2024-02-05  9:57:30 2 [Note] WSREP: Closing send monitor...
          2024-02-05  9:57:30 2 [Note] WSREP: Closed send monitor.
          2024-02-05  9:57:30 2 [Note] WSREP: gcomm: terminating thread
          2024-02-05  9:57:30 2 [Note] WSREP: gcomm: joining thread
          2024-02-05  9:57:30 2 [Note] WSREP: gcomm: closing backend
          2024-02-05  9:57:30 2 [Note] WSREP: view(view_id(NON_PRIM,ae347157-9c57,13) memb {
          	f3b5e06d-a33c,0
          } joined {
          } left {
          } partitioned {
          	ae347157-9c57,0
          	b1ca18b3-8a25,0
          })
          2024-02-05  9:57:30 2 [Note] WSREP: PC protocol downgrade 1 -> 0
          2024-02-05  9:57:30 2 [Note] WSREP: view((empty))
          2024-02-05  9:57:30 2 [Note] WSREP: gcomm: closed
          2024-02-05  9:57:30 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
          2024-02-05  9:57:30 0 [Note] WSREP: Flow-control interval: [16, 16]
          2024-02-05  9:57:30 0 [Note] WSREP: Received NON-PRIMARY.
          2024-02-05  9:57:30 0 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 122675)
          2024-02-05  9:57:30 0 [Note] WSREP: New SELF-LEAVE.
          2024-02-05  9:57:30 0 [Note] WSREP: Flow-control interval: [0, 0]
          2024-02-05  9:57:30 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
          2024-02-05  9:57:30 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 122675)
          2024-02-05  9:57:30 0 [Note] WSREP: RECV thread exiting 0: Success
          2024-02-05  9:57:30 2 [Note] WSREP: recv_thread() joined.
          2024-02-05  9:57:30 2 [Note] WSREP: Closing replication queue.
          2024-02-05  9:57:30 2 [Note] WSREP: Closing slave action queue.
          2024-02-05  9:57:30 2 [Note] WSREP: mariadbd: Terminated.
          240205  9:57:30 [ERROR] mysqld got signal 11 ;
          Sorry, we probably made a mistake, and this is a bug.
           
          Your assistance in bug reporting will enable us to fix this for the next release.
          To report this bug, see https://mariadb.com/kb/en/reporting-bugs
           
          We will try our best to scrape up some info that will hopefully help
          diagnose the problem, but since we have already crashed, 
          something is definitely wrong and this may fail.
           
          Server version: 11.2.2-MariaDB-1:11.2.2+maria~ubu2204 source revision: 929532a9426d085111c24c63de9c23cc54382259
          key_buffer_size=0
          read_buffer_size=131072
          max_used_connections=0
          max_threads=153
          thread_count=3
          It is possible that mysqld could use up to 
          key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 337017 K  bytes of memory
          Hope that's ok; if not, decrease some variables in the equation.
           
          Thread pointer: 0x7f8758000c68
          Attempting backtrace. You can use the following information to find out
          where mysqld died. If you see no messages after this, something went
          terribly wrong...
          stack_bottom = 0x7f877c25ac68 thread_stack 0x49000
          Printing to addr2line failed
          mariadbd(my_print_stacktrace+0x32)[0x55d207c2e032]
          mariadbd(handle_fatal_signal+0x478)[0x55d207702158]
          /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f877ea24520]
          /lib/x86_64-linux-gnu/libc.so.6(abort+0x178)[0x7f877ea0a898]
          /usr/lib/galera/libgalera_smm.so(+0x157602)[0x7f877e4a6602]
          /usr/lib/galera/libgalera_smm.so(+0x700e1)[0x7f877e3bf0e1]
          /usr/lib/galera/libgalera_smm.so(+0x6cc94)[0x7f877e3bbc94]
          /usr/lib/galera/libgalera_smm.so(+0x8b311)[0x7f877e3da311]
          /usr/lib/galera/libgalera_smm.so(+0x604a0)[0x7f877e3af4a0]
          /usr/lib/galera/libgalera_smm.so(+0x48261)[0x7f877e397261]
          mariadbd(_ZN5wsrep18wsrep_provider_v2611run_applierEPNS_21high_priority_serviceE+0x12)[0x55d207cecb02]
          mariadbd(+0xd7ff31)[0x55d2079bff31]
          mariadbd(_Z15start_wsrep_THDPv+0x26b)[0x55d2079adcfb]
          mariadbd(+0xcf24c6)[0x55d2079324c6]
          /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f877ea76ac3]
          /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7f877eb08850]
           
          Trying to get some variables.
          Some pointers may be invalid and cause the dump to abort.
          Query (0x0): (null)
          Connection ID (thread ID): 2
          Status: NOT_KILLED
           
          Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off,hash_join_cardinality=on,cset_narrowing=off
           
          The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/ contains
          information that should help you find out what is causing the crash.
           
          We think the query pointer is invalid, but we will try to print it anyway. 
          Query: 
           
          Writing a core file...
          Working directory at /var/lib/mysql
          Resource Limits:
          Limit                     Soft Limit           Hard Limit           Units     
          Max cpu time              unlimited            unlimited            seconds   
          Max file size             unlimited            unlimited            bytes     
          Max data size             unlimited            unlimited            bytes     
          Max stack size            8388608              unlimited            bytes     
          Max core file size        0                    0                    bytes     
          Max resident set          unlimited            unlimited            bytes     
          Max processes             unlimited            unlimited            processes 
          Max open files            1048576              1048576              files     
          Max locked memory         8388608              8388608              bytes     
          Max address space         unlimited            unlimited            bytes     
          Max file locks            unlimited            unlimited            locks     
          Max pending signals       126293               126293               signals   
          Max msgqueue size         819200               819200               bytes     
          Max nice priority         0                    0                    
          Max realtime priority     0                    0                    
          Max realtime timeout      unlimited            unlimited            us        
          Core pattern: |/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E
           
          Kernel version: Linux version 6.5.0-15-generic (buildd@bos03-amd64-040) (x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #15~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 12 18:54:30 UTC 2
          

          See node status in MaxScale after the crash:

          See the full logs here:
          mariadb-galera-0-no-k8s-probes.log mariadb-galera-1-no-k8s-probes.log mariadb-galera-2-no-k8s-probes.log

          martin.montes Martin Montes added a comment - - edited I've tried with a different test scenario: Provision a Galera cluster in Kubernetes without liveness/readiness probes, meaning that Kubernetes will not interfere with the container (no restarts and no traffic stopped when the container is considered unhealthy). The only way for the container to stop is returning a non zero status code. After provisioning, I've created some data with sysbench for warming up the database Stopped the sysbench tests Deleted the node marked as primary by MaxScale ( mariadb-galera-2 ) The failover is effectively performed to another node ( mariadb-galera-0 ) mariadb-galera-2 comes back online and requests and SST from mariadb-galera-1 The crash starts happening again: WSREP_SST: [ERROR] previous SST script still running. ( 20240205 09 : 57 : 30.706 ) 2024 - 02 - 05 9 : 57 : 30 0 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_mariabackup --role 'joiner' --address 'mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local' --datadir '/var/lib/mysql/' --parent 1 --progress 0 Read: '(null)' 2024 - 02 - 05 9 : 57 : 30 0 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'joiner' --address 'mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local' --datadir '/var/lib/mysql/' --parent 1 --progress 0 : 114 (Operation already in progress) 2024 - 02 - 05 9 : 57 : 30 2 [ERROR] WSREP: Failed to prepare for 'mariabackup' SST. Unrecoverable. 2024 - 02 - 05 9 : 57 : 30 2 [ERROR] WSREP: SST request callback failed. This is unrecoverable, restart required. 2024 - 02 - 05 9 : 57 : 30 2 [Note] WSREP: ReplicatorSMM::abort() 2024 - 02 - 05 9 : 57 : 30 2 [Note] WSREP: Closing send monitor... 2024 - 02 - 05 9 : 57 : 30 2 [Note] WSREP: Closed send monitor. 2024 - 02 - 05 9 : 57 : 30 2 [Note] WSREP: gcomm: terminating thread 2024 - 02 - 05 9 : 57 : 30 2 [Note] WSREP: gcomm: joining thread 2024 - 02 - 05 9 : 57 : 30 2 [Note] WSREP: gcomm: closing backend 2024 - 02 - 05 9 : 57 : 30 2 [Note] WSREP: view(view_id(NON_PRIM,ae347157-9c57, 13 ) memb { f3b5e06d-a33c, 0 } joined { } left { } partitioned { ae347157-9c57, 0 b1ca18b3-8a25, 0 }) 2024 - 02 - 05 9 : 57 : 30 2 [Note] WSREP: PC protocol downgrade 1 -> 0 2024 - 02 - 05 9 : 57 : 30 2 [Note] WSREP: view((empty)) 2024 - 02 - 05 9 : 57 : 30 2 [Note] WSREP: gcomm: closed 2024 - 02 - 05 9 : 57 : 30 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0 , memb_num = 1 2024 - 02 - 05 9 : 57 : 30 0 [Note] WSREP: Flow-control interval: [ 16 , 16 ] 2024 - 02 - 05 9 : 57 : 30 0 [Note] WSREP: Received NON-PRIMARY. 2024 - 02 - 05 9 : 57 : 30 0 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 122675 ) 2024 - 02 - 05 9 : 57 : 30 0 [Note] WSREP: New SELF-LEAVE. 2024 - 02 - 05 9 : 57 : 30 0 [Note] WSREP: Flow-control interval: [ 0 , 0 ] 2024 - 02 - 05 9 : 57 : 30 0 [Note] WSREP: Received SELF-LEAVE. Closing connection. 2024 - 02 - 05 9 : 57 : 30 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 122675 ) 2024 - 02 - 05 9 : 57 : 30 0 [Note] WSREP: RECV thread exiting 0 : Success 2024 - 02 - 05 9 : 57 : 30 2 [Note] WSREP: recv_thread() joined. 2024 - 02 - 05 9 : 57 : 30 2 [Note] WSREP: Closing replication queue. 2024 - 02 - 05 9 : 57 : 30 2 [Note] WSREP: Closing slave action queue. 2024 - 02 - 05 9 : 57 : 30 2 [Note] WSREP: mariadbd: Terminated. 240205 9 : 57 : 30 [ERROR] mysqld got signal 11 ; Sorry, we probably made a mistake, and this is a bug.   Your assistance in bug reporting will enable us to fix this for the next release. To report this bug, see https: //mariadb.com/kb/en/reporting-bugs   We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail.   Server version: 11.2 . 2 -MariaDB- 1 : 11.2 . 2 +maria~ubu2204 source revision: 929532a9426d085111c24c63de9c23cc54382259 key_buffer_size= 0 read_buffer_size= 131072 max_used_connections= 0 max_threads= 153 thread_count= 3 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 337017 K bytes of memory Hope that's ok; if not, decrease some variables in the equation.   Thread pointer: 0x7f8758000c68 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this , something went terribly wrong... stack_bottom = 0x7f877c25ac68 thread_stack 0x49000 Printing to addr2line failed mariadbd(my_print_stacktrace+ 0x32 )[ 0x55d207c2e032 ] mariadbd(handle_fatal_signal+ 0x478 )[ 0x55d207702158 ] /lib/x86_64-linux-gnu/libc.so. 6 (+ 0x42520 )[ 0x7f877ea24520 ] /lib/x86_64-linux-gnu/libc.so. 6 (abort+ 0x178 )[ 0x7f877ea0a898 ] /usr/lib/galera/libgalera_smm.so(+ 0x157602 )[ 0x7f877e4a6602 ] /usr/lib/galera/libgalera_smm.so(+ 0x700e1 )[ 0x7f877e3bf0e1 ] /usr/lib/galera/libgalera_smm.so(+ 0x6cc94 )[ 0x7f877e3bbc94 ] /usr/lib/galera/libgalera_smm.so(+ 0x8b311 )[ 0x7f877e3da311 ] /usr/lib/galera/libgalera_smm.so(+ 0x604a0 )[ 0x7f877e3af4a0 ] /usr/lib/galera/libgalera_smm.so(+ 0x48261 )[ 0x7f877e397261 ] mariadbd(_ZN5wsrep18wsrep_provider_v2611run_applierEPNS_21high_priority_serviceE+ 0x12 )[ 0x55d207cecb02 ] mariadbd(+ 0xd7ff31 )[ 0x55d2079bff31 ] mariadbd(_Z15start_wsrep_THDPv+ 0x26b )[ 0x55d2079adcfb ] mariadbd(+ 0xcf24c6 )[ 0x55d2079324c6 ] /lib/x86_64-linux-gnu/libc.so. 6 (+ 0x94ac3 )[ 0x7f877ea76ac3 ] /lib/x86_64-linux-gnu/libc.so. 6 (+ 0x126850 )[ 0x7f877eb08850 ]   Trying to get some variables. Some pointers may be invalid and cause the dump to abort. Query ( 0x0 ): ( null ) Connection ID (thread ID): 2 Status: NOT_KILLED   Optimizer switch : index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off,hash_join_cardinality=on,cset_narrowing=off   The manual page at https: //mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/ contains information that should help you find out what is causing the crash.   We think the query pointer is invalid, but we will try to print it anyway. Query:   Writing a core file... Working directory at /var/lib/mysql Resource Limits: Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 0 bytes Max resident set unlimited unlimited bytes Max processes unlimited unlimited processes Max open files 1048576 1048576 files Max locked memory 8388608 8388608 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 126293 126293 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us Core pattern: |/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E   Kernel version: Linux version 6.5 . 0 - 15 -generic (buildd @bos03 -amd64- 040 ) (x86_64-linux-gnu-gcc- 12 (Ubuntu 12.3 . 0 -1ubuntu1~ 22.04 ) 12.3 . 0 , GNU ld (GNU Binutils for Ubuntu) 2.38 ) # 15 ~ 22.04 . 1 -Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 12 18 : 54 : 30 UTC 2 See node status in MaxScale after the crash: See the full logs here: mariadb-galera-0-no-k8s-probes.log mariadb-galera-1-no-k8s-probes.log mariadb-galera-2-no-k8s-probes.log
          martin.montes Martin Montes added a comment -

          I've tested the same scenario, but instead of providing DNS names in the Galera config file, resolving the DNS in the operator and providing the IPs:

          mariadb-galera-0

          [mariadb]
          bind-address=0.0.0.0
          default_storage_engine=InnoDB
          binlog_format=row
          innodb_autoinc_lock_mode=2
           
          # Cluster configuration
          wsrep_on=ON
          wsrep_provider=/usr/lib/galera/libgalera_smm.so
          wsrep_cluster_address="gcomm://10.244.0.157,10.244.0.48,10.244.0.153"
          wsrep_cluster_name=mariadb-operator
          wsrep_slave_threads=1
           
          # Node configuration
          wsrep_node_address="10.244.0.157"
          wsrep_node_name="mariadb-galera-0"
          wsrep_sst_method="mariabackup"
          wsrep_sst_auth="<user>:<password>"
          

          mariadb-galera-1

          [mariadb]
          bind-address=0.0.0.0
          default_storage_engine=InnoDB
          binlog_format=row
          innodb_autoinc_lock_mode=2
           
          # Cluster configuration
          wsrep_on=ON
          wsrep_provider=/usr/lib/galera/libgalera_smm.so
          wsrep_cluster_address="gcomm://10.244.0.36,10.244.0.48,10.244.0.31"
          wsrep_cluster_name=mariadb-operator
          wsrep_slave_threads=1
           
          # Node configuration
          wsrep_node_address="10.244.0.48"
          wsrep_node_name="mariadb-galera-1"
          wsrep_sst_method="mariabackup"
          wsrep_sst_auth="<user>:<password>"
          

          mariadb-galera-2

          [mariadb]
          bind-address=0.0.0.0
          default_storage_engine=InnoDB
          binlog_format=row
          innodb_autoinc_lock_mode=2
           
          # Cluster configuration
          wsrep_on=ON
          wsrep_provider=/usr/lib/galera/libgalera_smm.so
          wsrep_cluster_address="gcomm://10.244.0.157,10.244.0.48,10.244.0.159"
          wsrep_cluster_name=mariadb-operator
          wsrep_slave_threads=1
           
          # Node configuration
          wsrep_node_address="10.244.0.159"
          wsrep_node_name="mariadb-galera-2"
          wsrep_sst_method="mariabackup"
          wsrep_sst_auth="<user>:<password>"
          

          After killing the primary multiple times, even with sysbench tests running, I didn't manage to reproduce the issue. So the issue has to do with DNS resolution either inside the MariaDB container or in the Galera lib.

          martin.montes Martin Montes added a comment - I've tested the same scenario, but instead of providing DNS names in the Galera config file, resolving the DNS in the operator and providing the IPs: mariadb-galera-0 [mariadb] bind-address= 0.0 . 0.0 default_storage_engine=InnoDB binlog_format=row innodb_autoinc_lock_mode= 2   # Cluster configuration wsrep_on=ON wsrep_provider=/usr/lib/galera/libgalera_smm.so wsrep_cluster_address= "gcomm://10.244.0.157,10.244.0.48,10.244.0.153" wsrep_cluster_name=mariadb-operator wsrep_slave_threads= 1   # Node configuration wsrep_node_address= "10.244.0.157" wsrep_node_name= "mariadb-galera-0" wsrep_sst_method= "mariabackup" wsrep_sst_auth= "<user>:<password>" mariadb-galera-1 [mariadb] bind-address= 0.0 . 0.0 default_storage_engine=InnoDB binlog_format=row innodb_autoinc_lock_mode= 2   # Cluster configuration wsrep_on=ON wsrep_provider=/usr/lib/galera/libgalera_smm.so wsrep_cluster_address= "gcomm://10.244.0.36,10.244.0.48,10.244.0.31" wsrep_cluster_name=mariadb-operator wsrep_slave_threads= 1   # Node configuration wsrep_node_address= "10.244.0.48" wsrep_node_name= "mariadb-galera-1" wsrep_sst_method= "mariabackup" wsrep_sst_auth= "<user>:<password>" mariadb-galera-2 [mariadb] bind-address= 0.0 . 0.0 default_storage_engine=InnoDB binlog_format=row innodb_autoinc_lock_mode= 2   # Cluster configuration wsrep_on=ON wsrep_provider=/usr/lib/galera/libgalera_smm.so wsrep_cluster_address= "gcomm://10.244.0.157,10.244.0.48,10.244.0.159" wsrep_cluster_name=mariadb-operator wsrep_slave_threads= 1   # Node configuration wsrep_node_address= "10.244.0.159" wsrep_node_name= "mariadb-galera-2" wsrep_sst_method= "mariabackup" wsrep_sst_auth= "<user>:<password>" After killing the primary multiple times, even with sysbench tests running, I didn't manage to reproduce the issue. So the issue has to do with DNS resolution either inside the MariaDB container or in the Galera lib.
          martin.montes Martin Montes added a comment -

          Actually, only specifying an IP in wsrep_node_address also prevents the crash. It seems that we can still use DNS names in wsrep_cluster_address

          martin.montes Martin Montes added a comment - Actually, only specifying an IP in wsrep_node_address also prevents the crash. It seems that we can still use DNS names in wsrep_cluster_address
          danblack Daniel Black added a comment -

          Galera 2 - /var/lib/mysql/sst_in_progress - this is on the startup in the logs. There would be any joiner in progress from the early state. As such the stale sst in progress is indeed stale.

          The locking mechanism here on the existence of the file has no reliable cleanup mechanism.

          Recommend something tied to processes like flock (additional dependency util-linux-core on RPM distros - important for RHEL9 used in UBI container images).

          danblack Daniel Black added a comment - Galera 2 - /var/lib/mysql/sst_in_progress - this is on the startup in the logs. There would be any joiner in progress from the early state. As such the stale sst in progress is indeed stale. The locking mechanism here on the existence of the file has no reliable cleanup mechanism. Recommend something tied to processes like flock (additional dependency util-linux-core on RPM distros - important for RHEL9 used in UBI container images).

          People

            martin.montes Martin Montes
            martin.montes Martin Montes
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.