Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34561

MariaDB Galera Cluster State Transfer Fails on Specific Hardware Configurations on AlmaLinux 9.4

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.11.6, 10.11.7, 10.11.8
    • None
    • Configuration, Galera SST

    Description

      Issue:
      Setting up a MariaDB Galera cluster with 3+ nodes. The first node starts successfully with the new_cluster option and is operational. However, when attempting to add a second node, the process fails consistently with the following errors:

      Error Logs:

      WSREP_SST: [ERROR] Possible timeout in receiving first data from donor in gtid stage: exit codes: 124 0 (20240708 20:03:10.560)
      WSREP_SST: [ERROR] Cleanup after exit with status: 32 (20240708 20:03:10.568)
      WSREP_SST: [ERROR] Removing /tmp/tmp.9nWF3ySGO6/xtrabackup_galera_info file due to signal (20240708 20:03:10.577)
      WSREP_SST: [INFO] Removing the sst_in_progress file (20240708 20:03:10.585)
      WSREP_SST: [INFO] Cleaning up temporary directories (20240708 20:03:10.597)
      2024-07-08 20:04:39 0 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_mariabackup --role 'joiner' --address '10.1.11.242' --datadir '/var/lib/mysql/' 
               --parent 18883 --progress 0 --binlog '/var/lib/mysql/mysql-bin' --mysqld-args --basedir=/usr --wsrep_start_position=53b03802-3d42-11ef-bdc9-9b83e6aba1c4:657
                 Read: '(null)'
      2024-07-08 20:04:39 0 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'joiner' --address '10.1.11.242' --datadir '/var/lib/mysql/' 
               --parent 18883 --progress 0 --binlog '/var/lib/mysql/mysql-bin' --mysqld-args --basedir=/usr --wsrep_start_position=53b03802-3d42-11ef-bdc9-9b83e6aba1c4:657: 32 (Broken pipe)
      2024-07-08 20:04:39 2 [ERROR] WSREP: Failed to prepare for 'mariabackup' SST. Unrecoverable.
      2024-07-08 20:04:39 2 [ERROR] WSREP: SST request callback failed. This is unrecoverable, restart required.
      

      Configuration Information:

      The configurations for all nodes are identical, except for node-specific addresses and names.

      Configuration File Snippet:

      [galera]
      #wasrep settings
      wsrep_on = ON
      wsrep_provider=/usr/lib64/galera/libgalera_smm.so
      wsrep_cluster_address=gcomm://10.1.11.240
      wsrep_cluster_name=Dobby
      wsrep_node_address=10.1.11.240
      wsrep_node_name=node-240
      wsrep_sst_method=mariabackup
      wsrep_sst_auth=xxxx:xxxxx
      wsrep_slave_threads=4
      default_storage_engine=InnoDB
      innodb_autoinc_lock_mode=2
      innodb_flush_log_at_trx_commit=0
      bind-address=0.0.0.0
      innodb_doublewrite=1
      wsrep_provider_options="...socket.ssl_key=/Certs/mariadb/server-key.pem;socket.ssl_cert=/Certs/mariadb/server-cert.pem;
           socket.ssl_ca=/Certs/mariadb/ca-cert.pem;evs.keepalive_period=PT1S;evs.suspect_timeout=PT30S;evs.inactive_timeout=PT1M;evs.install_timeout=PT1M;
      	 evs.join_retrans_period=PT1S;evs.delay_margin=PT2S;gcache.size=2G;gcache.page_size=8M;gcache.keep_pages_size=33M;gcs.fc_factor=0.8;evs.send_window=256;evs.user_send_window=128"
      #wsrep_log_conflicts=ON
      wsrep_retry_autocommit=4
      #wsrep_debug=ON
      wsrep_notify_cmd=/scripts/nodeStatusChange.sh
      wsrep_auto_increment_control=ON
       
      [sst]
      encrypt = 3
      tca = /Certs/mariadb/ca-cert.pem
      tkey = /Certs/mariadb/server-key.pem
      tcert = /Certs/mariadb/server-cert.pem
      

      Observations:
      1. The trigger to start the backup on the donor node is missing in the non-working environment.

      It seems either mariadbd or mariabackup is supposed to send this message from the joiner node:

      2024-07-09 19:25:30 0 [Note] WSREP: Member 1.0 (us-10-152) requested state transfer from 'any'. Selected 0.0 (us-10-170)(SYNCED) as donor.

      2. The environment is on VMware VMs hosted on Dell 7xx and 4xx servers (ESXi). The 7xx series-based hardware/VMs can act as both Donor and Joiner. The 4xx series can act as Donor but as a Joiner, they are not sending the trigger for WSREP.

      Donor Log Snippet:

      2024-07-08 19:58:08 0 [Note] WSREP: (05b79a6d-b75b, 'ssl://0.0.0.0:4567') connection established to 65e318e6-b781 ssl://10.1.11.242:4567
      2024-07-08 19:58:08 0 [Note] WSREP: (05b79a6d-b75b, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
      2024-07-08 19:58:09 0 [Note] WSREP: declaring 65e318e6-b781 at ssl://10.1.11.242:4567 stable
      2024-07-08 19:58:09 0 [Note] WSREP: Node 05b79a6d-b75b state prim
      2024-07-08 19:58:09 0 [Note] WSREP: view(view_id(PRIM,05b79a6d-b75b,4) memb {
              05b79a6d-b75b,0
              65e318e6-b781,0
      } joined {
      } left {
      } partitioned {
      })
      2024-07-08 19:58:09 0 [Note] WSREP: save pc into disk
      2024-07-08 19:58:09 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
      2024-07-08 19:58:09 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 6631c9ef-3d64-11ef-ae5a-a73a5959da0c
      2024-07-08 19:58:09 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 6631c9ef-3d64-11ef-ae5a-a73a5959da0c
      2024-07-08 19:58:09 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6631c9ef-3d64-11ef-ae5a-a73a5959da0c from 0 (node-240)
      2024-07-08 19:58:09 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6631c9ef-3d64-11ef-ae5a-a73a5959da0c from 1 (node-242)
      2024-07-08 19:58:09 0 [Note] WSREP: Quorum results:
              version    = 6,
              component  = PRIMARY,
              conf_id    = 3,
              members    = 1/2 (joined/total),
              act_id     = 210,
              last_appl. = 130,
              protocols  = 2/10/4 (gcs/repl/appl),
              vote policy= 0,
              group UUID = 05b85e89-3d60-11ef-a7dc-7fe580428470
      2024-07-08 19:58:09 0 [Note] WSREP: Flow-control interval: [28, 35]
      2024-07-08 19:58:09 10 [Note] WSREP: ####### processing CC 211, local, ordered
      2024-07-08 19:58:09 10 [Note] WSREP: ####### My UUID: 05b79a6d-3d60-11ef-b75b-83b55130a66a
      2024-07-08 19:58:09 10 [Note] WSREP: Skipping cert index reset
      2024-07-08 19:58:09 10 [Note] WSREP: REPL Protocols: 10 (5)
      2024-07-08 19:58:09 10 [Note] WSREP: ####### Adjusting cert position: 210 -> 211
      2024-07-08 19:58:09 0 [Note] WSREP: Service thread queue flushed.
      2024-07-08 19:58:09 10 [Note] WSREP: ================================================
      View:
        id: 05b85e89-3d60-11ef-a7dc-7fe580428470:211
        status: primary
        protocol_version: 4
        capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
        final: no
        own_index: 0
        members(2):
              0: 05b79a6d-3d60-11ef-b75b-83b55130a66a, node-240
              1: 65e318e6-3d64-11ef-b781-3b1e31d3c126, node-242
      =================================================
      2024-07-08 19:58:09 10 [Note] WSREP: Lowest cert index boundary for CC from group: 131
      2024-07-08 19:58:09 10 [Note] WSREP: Min available from gcache for CC from group: 1
      2024-07-08 19:58:11 0 [Note] WSREP: (05b79a6d-b75b, 'ssl://0.0.0.0:4567') turning message relay requesting off
      2024-07-08 19:59:27 33 [Note] WSREP: Starting applier thread 33
      2024-07-08 19:59:27 34 [Note] WSREP: Starting applier thread 34
      2024-07-08 19:59:27 32 [Note] WSREP: Starting applier thread 32
      2024-07-08 19:59:27 20 [Note] WSREP: Flow-control interval: [46, 57]
      2024-07-08 20:04:40 0 [Note] WSREP: forgetting 65e318e6-b781 (ssl://10.1.11.242:4567)
      2024-07-08 20:04:40 0 [Note] WSREP: Node 05b79a6d-b75b state prim
      2024-07-08 20:04:40 0 [Note] WSREP: view(view_id(PRIM,05b79a6d-b75b,5) memb {
              05b79a6d-b75b,0
      } joined {
      } left {
      } partitioned {
              65e318e6-b781,0
      })
      2024-07-08 20:04:40 0 [Note] WSREP: save pc into disk
      2024-07-08 20:04:40 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
      2024-07-08 20:04:40 0 [Note] WSREP: forgetting 65e318e6-b781 (ssl://10.1.11.242:4567)
      2024-07-08 20:04:40 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 4f537133-3d65-11ef-9fc2-b654173a47ac
      2024-07-08 20:04:40 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 4f537133-3d65-11ef-9fc2-b654173a47ac
      2024-07-08 20:04:40 0 [Note] WSREP: STATE EXCHANGE: got state msg: 4f537133-3d65-11ef-9fc2-b654173a47ac from 0 (node-240)
      2024-07-08 20:04:40 0 [Note] WSREP: Quorum results:
              version    = 6,
              component  = PRIMARY,
              conf_id    = 4,
              members    = 1/1 (joined/total),
              act_id     = 271,
              last_appl. = 259,
              protocols  = 2/10/4 (gcs/repl/appl),
              vote policy= 0,
              group UUID = 05b85e89-3d60-11ef-a7dc-7fe580428470
      2024-07-08 20:04:40 0 [Note] WSREP: Flow-control interval: [32, 40]
      2024-07-08 20:04:40 2 [Note] WSREP: ####### processing CC 272, local, ordered
      2024-07-08 20:04:40 2 [Note] WSREP: ####### My UUID: 05b79a6d-3d60-11ef-b75b-83b55130a66a
      2024-07-08 20:04:40 2 [Note] WSREP: Skipping cert index reset
      2024-07-08 20:04:40 2 [Note] WSREP: REPL Protocols: 10 (5)
      2024-07-08 20:04:40 2 [Note] WSREP: ####### Adjusting cert position: 271 -> 272
      2024-07-08 20:04:40 0 [Note] WSREP: Service thread queue flushed.
      2024-07-08 20:04:40 2 [Note] WSREP: ================================================
      View:
        id: 05b85e89-3d60-11ef-a7dc-7fe580428470:272
        status: primary
        protocol_version: 4
        capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
        final: no
        own_index: 0
        members(1):
              0: 05b79a6d-3d60-11ef-b75b-83b55130a66a, node-240
      =================================================
      2024-07-08 20:04:40 2 [Note] WSREP: Lowest cert index boundary for CC from group: 260
      2024-07-08 20:04:40 2 [Note] WSREP: Min available from gcache for CC from group: 1
      2024-07-08 20:04:45 0 [Note] WSREP:  cleaning up 65e318e6-b781 (ssl://10.1.11.242:4567)
      

      Joiner Log Snippet:

      2024-07-08 19:58:09 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
      2024-07-08 19:58:09 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 6631c9ef-3d64-11ef-ae5a-a73a5959da0c
      2024-07-08 19:58:09 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6631c9ef-3d64-11ef-ae5a-a73a5959da0c from 0 (node-240)
      2024-07-08 19:58:09 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6631c9ef-3d64-11ef-ae5a-a73a5959da0c from 1 (node-242)
      2024-07-08 19:58:09 0 [Note] WSREP: Quorum results:
              version    = 6,
              component  = PRIMARY,
              conf_id    = 3,
              members    = 1/2 (joined/total),
              act_id     = 210,
              last_appl. = 130,
              protocols  = 2/10/4 (gcs/repl/appl),
              vote policy= 0,
              group UUID = 05b85e89-3d60-11ef-a7dc-7fe580428470
      2024-07-08 19:58:09 0 [Note] WSREP: Flow-control interval: [18, 23]
      2024-07-08 19:58:09 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 211)
      2024-07-08 19:58:09 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 211)
      2024-07-08 19:58:09 2 [Note] WSREP: ####### processing CC 211, local, ordered
      2024-07-08 19:58:09 2 [Note] WSREP: Process first view: 05b85e89-3d60-11ef-a7dc-7fe580428470 my uuid: 65e318e6-3d64-11ef-b781-3b1e31d3c126
      2024-07-08 19:58:09 2 [Note] WSREP: Server node-242 connected to cluster at position 05b85e89-3d60-11ef-a7dc-7fe580428470:211 with ID 65e318e6-3d64-11ef-b781-3b1e31d3c126
      2024-07-08 19:58:09 2 [Note] WSREP: Server status change disconnected -> connected
      2024-07-08 19:58:09 2 [Note] WSREP: ####### My UUID: 65e318e6-3d64-11ef-b781-3b1e31d3c126
      2024-07-08 19:58:09 2 [Note] WSREP: Cert index reset to 00000000-0000-0000-0000-000000000000:-1 (proto: 10), state transfer needed: yes
      2024-07-08 19:58:09 0 [Note] WSREP: Service thread queue flushed.
      2024-07-08 19:58:09 2 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: -1
      2024-07-08 19:58:09 2 [Note] WSREP: State transfer required:
              Group state: 05b85e89-3d60-11ef-a7dc-7fe580428470:211
              Local state: 00000000-0000-0000-0000-000000000000:-1
      2024-07-08 19:58:09 2 [Note] WSREP: Server status change connected -> joiner
      2024-07-08 19:58:09 0 [Note] WSREP: Joiner monitor thread started to monitor
      2024-07-08 19:58:09 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '10.1.11.242' --datadir '/var/lib/mysql/' --parent 18883 --progress 0 
      	--binlog '/var/lib/mysql/mysql-bin' --mysqld-args --basedir=/usr --wsrep_start_position=53b03802-3d42-11ef-bdc9-9b83e6aba1c4:657'
      WSREP_SST: [INFO] mariabackup SST started on joiner (20240708 19:58:09.781)
      WSREP_SST: [INFO] SSL configuration: CA='/certs/mariadb/ca-cert.pem', CAPATH='', CERT='/Certs/mariadb/server-cert.pem', KEY='/Certs/mariadb/server-key.pem', MODE='DISABLED', 
      			encrypt='3' (20240708 19:58:09.995)
      WSREP_SST: [INFO] Progress reporting tool pv not found in path: /usr//bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/usr/bin:/sbin:/bin (20240708 19:58:10.246)
      WSREP_SST: [INFO] Disabling all progress/rate-limiting (20240708 19:58:10.251)
      WSREP_SST: [INFO] Moving '/var/lib/mysql/mariabackup.backup.log' to '//mariadb/mariabackup.backup.log.2024.07.08-19.58.10.282426237' (20240708 19:58:10.289)
      WSREP_SST: [INFO] Streaming with mbstream (20240708 19:58:10.314)
      WSREP_SST: [INFO] Using socat as streamer (20240708 19:58:10.318)
      WSREP_SST: [INFO] Using openssl based encryption with socat: with key and crt (20240708 19:58:10.341)
      WSREP_SST: [INFO] Decrypting with cert='/Certs/mariadb/server-cert.pem', key='/Certs/mariadb/server-key.pem', ca='/Certs/mariadb/ca-cert.pem', capath='' (20240708 19:58:10.473)
      WSREP_SST: [INFO] Evaluating timeout -k 310 300 socat -u openssl-listen:4444,reuseaddr,cert='/Certs/mariadb/server-cert.pem',key='/Certs/mariadb/server-key.pem',
      			cafile='/Certs/mariadb/ca-cert.pem',commonname='' stdio | '/usr//bin/mbstream' -x; RC=( ${PIPESTATUS[@]} ) (20240708 19:58:10.547)
      2024-07-08 19:58:12 0 [Note] WSREP: (65e318e6-b781, 'ssl://0.0.0.0:4567') turning message relay requesting off
      WSREP_SST: [ERROR] Possible timeout in receiving first data from donor in gtid stage: exit codes: 124 0 (20240708 20:03:10.560)
      WSREP_SST: [ERROR] Cleanup after exit with status: 32 (20240708 20:03:10.568)
      WSREP_SST: [ERROR] Removing /tmp/tmp.9nWF3ySGO6/xtrabackup_galera_info file due to signal (20240708 20:03:10.577)
      WSREP_SST: [INFO] Removing the sst_in_progress file (20240708 20:03:10.585)
      WSREP_SST: [INFO] Cleaning up temporary directories (20240708 20:03:10.597)
      2024-07-08 20:04:39 0 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_mariabackup --role 'joiner' --address '10.1.11.242' --datadir '/var/lib/mysql/' 
      		--parent 18883 --progress 0 --binlog '/var/lib/mysql/mysql-bin' --mysqld-args --basedir=/usr --wsrep_start_position=53b03802-3d42-11ef-bdc9-9b83e6aba1c4:657
              Read: '(null)'
      2024-07-08 20:04:39 0 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'joiner' --address '10.1.11.242' --datadir '/var/lib/mysql/' 
      		--parent 18883 --progress 0 --binlog '/var/lib/mysql/mysql-bin' --mysqld-args --basedir=/usr --wsrep_start_position=53b03802-3d42-11ef-bdc9-9b83e6aba1c4:657: 32 (Broken pipe)
      2024-07-08 20:04:39 2 [ERROR] WSREP: Failed to prepare for 'mariabackup' SST. Unrecoverable.
      2024-07-08 20:04:39 2 [ERROR] WSREP: SST request callback failed. This is unrecoverable, restart required.
      2024-07-08 20:04:39 2 [Note] WSREP: ReplicatorSMM::abort()
      2024-07-08 20:04:39 2 [Note] WSREP: Closing send monitor...
      2024-07-08 20:04:39 2 [Note] WSREP: Closed send monitor.
      2024-07-08 20:04:39 2 [Note] WSREP: gcomm: terminating thread
      2024-07-08 20:04:39 2 [Note] WSREP: gcomm: joining thread
      2024-07-08 20:04:39 2 [Note] WSREP: gcomm: closing backend
      2024-07-08 20:04:40 2 [Note] WSREP: view(view_id(NON_PRIM,05b79a6d-b75b,4) memb {
              65e318e6-b781,0
      } joined {
      } left {
      } partitioned {
              05b79a6d-b75b,0
      })
      2024-07-08 20:04:40 2 [Note] WSREP: PC protocol downgrade 1 -> 0
      2024-07-08 20:04:40 2 [Note] WSREP: view((empty))
      2024-07-08 20:04:40 2 [Note] WSREP: gcomm: closed
      2024-07-08 20:04:40 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
      2024-07-08 20:04:40 0 [Note] WSREP: Flow-control interval: [13, 16]
      2024-07-08 20:04:40 0 [Note] WSREP: Received NON-PRIMARY.
      2024-07-08 20:04:40 0 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 271)
      2024-07-08 20:04:40 0 [Note] WSREP: New SELF-LEAVE.
      2024-07-08 20:04:40 0 [Note] WSREP: Flow-control interval: [0, 0]
      2024-07-08 20:04:40 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
      2024-07-08 20:04:40 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 271)
      2024-07-08 20:04:40 0 [Note] WSREP: RECV thread exiting 0: Success
      2024-07-08 20:04:40 2 [Note] WSREP: recv_thread() joined.
      2024-07-08 20:04:40 2 [Note] WSREP: Closing replication queue.
      2024-07-08 20:04:40 2 [Note] WSREP: Closing slave action queue.
      2024-07-08 20:04:40 2 [Note] WSREP: /usr/libexec/mariadbd: Terminated.
      

      I am also attaching the CPU flag differences between the hardware, as I observed that recent MariaDB builds are compiled with CPU optimizations. Specifically, it crashes if the fma flag is missing. This information might be relevant to this issue.

      Request:
      I am seeking assistance to determine the root cause of this issue. Any suggestions on additional logs or configurations to check would be greatly appreciated.

      Attachments

        1. CPU_Flag_Differences_between_hardwares.png
          CPU_Flag_Differences_between_hardwares.png
          84 kB
        2. Config.txt
          1 kB
        3. DonorLog.txt
          6 kB
        4. JoinerLog.txt
          7 kB
        5. Problem.txt
          2 kB

        Activity

          People

            Unassigned Unassigned
            vire vijay r
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.