Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33349

Crash when a new node attempts to join the Galera cluster

    XMLWordPrintable

Details

    Description

      The Kubernetes operator for MariaDB is able to provision MariaDB clusters by creating containers one by one waiting until the `wsrep_ready` variable is enabled. This is to ensure that just one node is attempting to join the cluster at a given time.

      When a new node joins the cluster, Galera performs an SST choosing an existing node as donor and transferring the state to the new node in order to initialize it. We've seen this process failing sometimes when bootstrapping the cluster or when a node goes down, so the unhealthy container gets restarted by Kubernetes and a new one is created, which implies that the SST is retried again. This situation is repeated until the container reaches a healthy state and the node is part of the cluster:

      Here there are the configuration files for each node:

      mariadb-galera-0

      [mariadb]
      bind-address=0.0.0.0
      default_storage_engine=InnoDB
      binlog_format=row
      innodb_autoinc_lock_mode=2
       
      # Cluster configuration
      wsrep_on=ON
      wsrep_provider=/usr/lib/galera/libgalera_smm.so
      wsrep_cluster_address="gcomm://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-1.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local"
      wsrep_cluster_name=mariadb-operator
      wsrep_slave_threads=1
       
      # Node configuration
      wsrep_node_address="mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local"
      wsrep_node_name="mariadb-galera-0"
      wsrep_sst_method="mariabackup"
      wsrep_sst_auth="<user>:<password>"
      

      mariadb-galera-1

      [mariadb]
      bind-address=0.0.0.0
      default_storage_engine=InnoDB
      binlog_format=row
      innodb_autoinc_lock_mode=2
       
      # Cluster configuration
      wsrep_on=ON
      wsrep_provider=/usr/lib/galera/libgalera_smm.so
      wsrep_cluster_address="gcomm://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-1.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local"
      wsrep_cluster_name=mariadb-operator
      wsrep_slave_threads=1
       
      # Node configuration
      wsrep_node_address="mariadb-galera-1.mariadb-galera-internal.default.svc.cluster.local"
      wsrep_node_name="mariadb-galera-1"
      wsrep_sst_method="mariabackup"
      wsrep_sst_auth="<user>:<password>"
      

      mariadb-galera-2

      [mariadb]
      bind-address=0.0.0.0
      default_storage_engine=InnoDB
      binlog_format=row
      innodb_autoinc_lock_mode=2
       
      # Cluster configuration
      wsrep_on=ON
      wsrep_provider=/usr/lib/galera/libgalera_smm.so
      wsrep_cluster_address="gcomm://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-1.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local"
      wsrep_cluster_name=mariadb-operator
      wsrep_slave_threads=1
       
      # Node configuration
      wsrep_node_address="mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local"
      wsrep_node_name="mariadb-galera-2"
      wsrep_sst_method="mariabackup"
      wsrep_sst_auth="<user>:<password>"
      

      It is important to note that we are using DNS names in the cluster address, which get resolved to the IP of the containers, but every time a container is restarted by Kubernetes it gets a new IP assigned.

      There are some log files attached to the current Jira showing the crash happening after a node went down and attempting to rejoin the cluster for a while. Also, after ~30m or so, it finally managed to join.

      I have also tried to use rsync instead of mariabackup, but it didn't help.

      Attachments

        1. mariadb-galera-0-no-k8s-probes.log
          71 kB
          Martin Montes
        2. mariadb-galera-1-donor.log
          2 kB
          Martin Montes
        3. mariadb-galera-1-no-k8s-probes.log
          72 kB
          Martin Montes
        4. mariadb-galera-2.log
          18 kB
          Martin Montes
        5. mariadb-galera-2-joiner.log
          0.8 kB
          Martin Montes
        6. mariadb-galera-2-no-k8s-probes.log
          18 kB
          Martin Montes
        7. mariadb-galera-2-recovered.log
          22 kB
          Martin Montes
        8. Screenshot from 2024-02-05 11-07-25.png
          103 kB
          Martin Montes

        Activity

          People

            martin.montes Martin Montes
            martin.montes Martin Montes
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.