Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30888

Node restarting causes cluster to crash

    XMLWordPrintable

Details

    Description

      NOTE: This is to fix our issue and understand it more/understand if we are doing something wrong. ty for the help and sorry if bad issue, first time on jira.

      Seems to be similar if not the exact thing (but with a bigger cluster) as the following issue:
      https://github.com/codership/galera/issues/623
      And this issue:
      https://github.com/codership/galera/issues/410

      This issue seems to only re-occur when a non-clean shutdown occurs (I.e, the shutdown of VM via killing the process, disconnection from power, etc...)

      Recently we had a couple of problems with our Galera cluster, we have added a 3rd region and to it 3 more nodes, (we used to have 3 nodes on 2 regions, and 1 garbd on one of those regions.)

      A few days ago the compute the VM was on crashed, when the node went back up it crashed the cluster with SST problems and caused the cluster to go down being READ-only and needing to be bootstrapped.

      we are using :
      Galera 26.4.4
      MariaDB 10.4.13

      The configuration is as follows and the same on all nodes (different ist.recv_bind ip and wsrep_node_address)

      my.cnf:
      ```
      [galera]
      wsrep_on=ON
      wsrep_cluster_name="powerdns"
      binlog_format=ROW
      default_storage_enginge=InnoDB
      innodb_autoinc_lock_mode=2
      innodb_doublewrite=1
      query_cache_size=0
      wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so
      wsrep_cluster_address=gcomm://<9 ips of nodes>
      wsrep_notify_cmd=/usr/bin/get-status.sh

      wsrep_provider_options="gmcast.segment=<segment>; ist.recv_bind=<ip>; socket.ssl_cert=/etc/ssl/mysql/server-cert.pem;socket.ssl_key=/etc/ssl/mysql/server-key.pem;socket.ssl_ca=/etc/ssl/mysql/ca-cert.pem"
      wsrep_dirty_reads=ON
      wsrep-sync-wait=0
      wsrep_node_address="<node_ip>"

      [mysqld]
      ssl-ca = /etc/ssl/mysq/ca-cert.pem
      ssl-key = /etc/ssl/mysql/server-key.pem
      ssl-ccert = /etc/ssl/mysql/server-cert.pem

      [client]
      ssl-ca = /etc/ssl/mysql/ca-cert.pem
      ssl-key = /etc/ssl/mysql/client-key.pem
      ssl-cert = /etc/ssl/mysql/client-cert.pem
      ```

      The logs we see on the nodes that causes the crash: (JOINER nodes)
      ```
      WSREP: Member 7.1 (db-<region-1>1) request state transfer from 'any'. Selected 6.1 (db<region-1>-2)(SYNCED) as donor.
      WSREP: Shifting PRIMARY -> JOINER (TO: 59319)
      WSREP: Requesting state transfer: success, donor: 6
      WSREP: forgetting f46bc950-abe6 (ssl://<ip>:4567)
      version= 6,
      component = PRIMARY,
      conf_id = 75
      members = 6/7 (joined/total),
      act_id = 59324
      last_appl. = 59214
      protocols = 2/10/4 (gcs/repl/appl),
      [Warning] WSREP: Donor f46bc950-9d7f-11ed-abe6-57fe7b2de322 is no longer in the group. State transfer cannot be completed, need to abort. Aborting
      WSREP: /usr/bin/mysql: Terminated
      systemd: mariadb.service: main process exited, code=killed, status=6/ABRT
      mysqld: Terminated
      WSREP_SST: [INFO] Joined cleanup. rsync PID:4389
      rsyncd[4389]: sent 0 bytes recieved 0 bytes total size 0
      mysql: WSREP_SST:[INFO] Joined cleanup done.
      Failed to start MariaDB 10.4.13
      ```

      The logs we see on the donor LOGS:
      ```
      WSREP: Member 7.1 (db-<region-1>1) request state transfer from 'any'. Selected 6.1 (db<region-1>-2)(SYNCED) as donor.
      Shifting SYNCED -> DONOR/DESYNCED (TO: 59319)
      WSREP: Detected STR version: 1, req_len: 120, req: STRv1
      Cert index preload: 59215 -> 59319
      IST sender using ssl
      [ERROR] WSREP: Failed to process action STATE_REQUEST, g:59319, l:5187, ptr:0x7f6322974e78, size: 120: IST sender, failed to connect 'ssl://<server_ip>:4568': connect: No router to hose: 113 (No route to host)
      ```

      Then after that, the node continued each one in the "line" of DONORS until he reached one that he didn't crash (the one we bootstrapped from).

      The second time (after it restarts) we can see normal logs up until the following log:
      `[Warning] WSREP: Donor <id> is no longer in the group. State transfer cannot be completed, need to abort. Aborting...`
      This seems to be because the connecting node caused it to crash, then we see the same log on all of the other nodes that it crashes.

      This already happened twice to us and causes a lot of problems and downtime, what is the cause of this? why does this sometimes happen?

      Why sometimes the node succeeds and is able to sync, and other times it goes 1 by 1 to the nodes and causes them to crash?
      Ty

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Shalev Ben Shalev
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.