Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.11.6, 10.11.7, 10.11.8
-
None
Description
Issue:
Setting up a MariaDB Galera cluster with 3+ nodes. The first node starts successfully with the new_cluster option and is operational. However, when attempting to add a second node, the process fails consistently with the following errors:
Error Logs:
WSREP_SST: [ERROR] Possible timeout in receiving first data from donor in gtid stage: exit codes: 124 0 (20240708 20:03:10.560)
|
WSREP_SST: [ERROR] Cleanup after exit with status: 32 (20240708 20:03:10.568)
|
WSREP_SST: [ERROR] Removing /tmp/tmp.9nWF3ySGO6/xtrabackup_galera_info file due to signal (20240708 20:03:10.577)
|
WSREP_SST: [INFO] Removing the sst_in_progress file (20240708 20:03:10.585)
|
WSREP_SST: [INFO] Cleaning up temporary directories (20240708 20:03:10.597)
|
2024-07-08 20:04:39 0 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_mariabackup --role 'joiner' --address '10.1.11.242' --datadir '/var/lib/mysql/'
|
--parent 18883 --progress 0 --binlog '/var/lib/mysql/mysql-bin' --mysqld-args --basedir=/usr --wsrep_start_position=53b03802-3d42-11ef-bdc9-9b83e6aba1c4:657
|
Read: '(null)'
|
2024-07-08 20:04:39 0 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'joiner' --address '10.1.11.242' --datadir '/var/lib/mysql/'
|
--parent 18883 --progress 0 --binlog '/var/lib/mysql/mysql-bin' --mysqld-args --basedir=/usr --wsrep_start_position=53b03802-3d42-11ef-bdc9-9b83e6aba1c4:657: 32 (Broken pipe)
|
2024-07-08 20:04:39 2 [ERROR] WSREP: Failed to prepare for 'mariabackup' SST. Unrecoverable.
|
2024-07-08 20:04:39 2 [ERROR] WSREP: SST request callback failed. This is unrecoverable, restart required.
|
Configuration Information:
The configurations for all nodes are identical, except for node-specific addresses and names.
Configuration File Snippet:
[galera]
|
#wasrep settings
|
wsrep_on = ON
|
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
|
wsrep_cluster_address=gcomm://10.1.11.240
|
wsrep_cluster_name=Dobby
|
wsrep_node_address=10.1.11.240
|
wsrep_node_name=node-240
|
wsrep_sst_method=mariabackup
|
wsrep_sst_auth=xxxx:xxxxx
|
wsrep_slave_threads=4
|
default_storage_engine=InnoDB
|
innodb_autoinc_lock_mode=2
|
innodb_flush_log_at_trx_commit=0
|
bind-address=0.0.0.0
|
innodb_doublewrite=1
|
wsrep_provider_options="...socket.ssl_key=/Certs/mariadb/server-key.pem;socket.ssl_cert=/Certs/mariadb/server-cert.pem;
|
socket.ssl_ca=/Certs/mariadb/ca-cert.pem;evs.keepalive_period=PT1S;evs.suspect_timeout=PT30S;evs.inactive_timeout=PT1M;evs.install_timeout=PT1M;
|
evs.join_retrans_period=PT1S;evs.delay_margin=PT2S;gcache.size=2G;gcache.page_size=8M;gcache.keep_pages_size=33M;gcs.fc_factor=0.8;evs.send_window=256;evs.user_send_window=128"
|
#wsrep_log_conflicts=ON
|
wsrep_retry_autocommit=4
|
#wsrep_debug=ON
|
wsrep_notify_cmd=/scripts/nodeStatusChange.sh
|
wsrep_auto_increment_control=ON
|
 |
[sst]
|
encrypt = 3
|
tca = /Certs/mariadb/ca-cert.pem
|
tkey = /Certs/mariadb/server-key.pem
|
tcert = /Certs/mariadb/server-cert.pem
|
Observations:
1. The trigger to start the backup on the donor node is missing in the non-working environment.
It seems either mariadbd or mariabackup is supposed to send this message from the joiner node:
2024-07-09 19:25:30 0 [Note] WSREP: Member 1.0 (us-10-152) requested state transfer from 'any'. Selected 0.0 (us-10-170)(SYNCED) as donor.
2. The environment is on VMware VMs hosted on Dell 7xx and 4xx servers (ESXi). The 7xx series-based hardware/VMs can act as both Donor and Joiner. The 4xx series can act as Donor but as a Joiner, they are not sending the trigger for WSREP.
Donor Log Snippet:
2024-07-08 19:58:08 0 [Note] WSREP: (05b79a6d-b75b, 'ssl://0.0.0.0:4567') connection established to 65e318e6-b781 ssl://10.1.11.242:4567
|
2024-07-08 19:58:08 0 [Note] WSREP: (05b79a6d-b75b, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
|
2024-07-08 19:58:09 0 [Note] WSREP: declaring 65e318e6-b781 at ssl://10.1.11.242:4567 stable
|
2024-07-08 19:58:09 0 [Note] WSREP: Node 05b79a6d-b75b state prim
|
2024-07-08 19:58:09 0 [Note] WSREP: view(view_id(PRIM,05b79a6d-b75b,4) memb {
|
05b79a6d-b75b,0
|
65e318e6-b781,0
|
} joined {
|
} left {
|
} partitioned {
|
})
|
2024-07-08 19:58:09 0 [Note] WSREP: save pc into disk
|
2024-07-08 19:58:09 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
|
2024-07-08 19:58:09 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 6631c9ef-3d64-11ef-ae5a-a73a5959da0c
|
2024-07-08 19:58:09 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 6631c9ef-3d64-11ef-ae5a-a73a5959da0c
|
2024-07-08 19:58:09 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6631c9ef-3d64-11ef-ae5a-a73a5959da0c from 0 (node-240)
|
2024-07-08 19:58:09 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6631c9ef-3d64-11ef-ae5a-a73a5959da0c from 1 (node-242)
|
2024-07-08 19:58:09 0 [Note] WSREP: Quorum results:
|
version = 6,
|
component = PRIMARY,
|
conf_id = 3,
|
members = 1/2 (joined/total),
|
act_id = 210,
|
last_appl. = 130,
|
protocols = 2/10/4 (gcs/repl/appl),
|
vote policy= 0,
|
group UUID = 05b85e89-3d60-11ef-a7dc-7fe580428470
|
2024-07-08 19:58:09 0 [Note] WSREP: Flow-control interval: [28, 35]
|
2024-07-08 19:58:09 10 [Note] WSREP: ####### processing CC 211, local, ordered
|
2024-07-08 19:58:09 10 [Note] WSREP: ####### My UUID: 05b79a6d-3d60-11ef-b75b-83b55130a66a
|
2024-07-08 19:58:09 10 [Note] WSREP: Skipping cert index reset
|
2024-07-08 19:58:09 10 [Note] WSREP: REPL Protocols: 10 (5)
|
2024-07-08 19:58:09 10 [Note] WSREP: ####### Adjusting cert position: 210 -> 211
|
2024-07-08 19:58:09 0 [Note] WSREP: Service thread queue flushed.
|
2024-07-08 19:58:09 10 [Note] WSREP: ================================================
|
View:
|
id: 05b85e89-3d60-11ef-a7dc-7fe580428470:211
|
status: primary
|
protocol_version: 4
|
capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
|
final: no
|
own_index: 0
|
members(2):
|
0: 05b79a6d-3d60-11ef-b75b-83b55130a66a, node-240
|
1: 65e318e6-3d64-11ef-b781-3b1e31d3c126, node-242
|
=================================================
|
2024-07-08 19:58:09 10 [Note] WSREP: Lowest cert index boundary for CC from group: 131
|
2024-07-08 19:58:09 10 [Note] WSREP: Min available from gcache for CC from group: 1
|
2024-07-08 19:58:11 0 [Note] WSREP: (05b79a6d-b75b, 'ssl://0.0.0.0:4567') turning message relay requesting off
|
2024-07-08 19:59:27 33 [Note] WSREP: Starting applier thread 33
|
2024-07-08 19:59:27 34 [Note] WSREP: Starting applier thread 34
|
2024-07-08 19:59:27 32 [Note] WSREP: Starting applier thread 32
|
2024-07-08 19:59:27 20 [Note] WSREP: Flow-control interval: [46, 57]
|
2024-07-08 20:04:40 0 [Note] WSREP: forgetting 65e318e6-b781 (ssl://10.1.11.242:4567)
|
2024-07-08 20:04:40 0 [Note] WSREP: Node 05b79a6d-b75b state prim
|
2024-07-08 20:04:40 0 [Note] WSREP: view(view_id(PRIM,05b79a6d-b75b,5) memb {
|
05b79a6d-b75b,0
|
} joined {
|
} left {
|
} partitioned {
|
65e318e6-b781,0
|
})
|
2024-07-08 20:04:40 0 [Note] WSREP: save pc into disk
|
2024-07-08 20:04:40 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
|
2024-07-08 20:04:40 0 [Note] WSREP: forgetting 65e318e6-b781 (ssl://10.1.11.242:4567)
|
2024-07-08 20:04:40 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 4f537133-3d65-11ef-9fc2-b654173a47ac
|
2024-07-08 20:04:40 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 4f537133-3d65-11ef-9fc2-b654173a47ac
|
2024-07-08 20:04:40 0 [Note] WSREP: STATE EXCHANGE: got state msg: 4f537133-3d65-11ef-9fc2-b654173a47ac from 0 (node-240)
|
2024-07-08 20:04:40 0 [Note] WSREP: Quorum results:
|
version = 6,
|
component = PRIMARY,
|
conf_id = 4,
|
members = 1/1 (joined/total),
|
act_id = 271,
|
last_appl. = 259,
|
protocols = 2/10/4 (gcs/repl/appl),
|
vote policy= 0,
|
group UUID = 05b85e89-3d60-11ef-a7dc-7fe580428470
|
2024-07-08 20:04:40 0 [Note] WSREP: Flow-control interval: [32, 40]
|
2024-07-08 20:04:40 2 [Note] WSREP: ####### processing CC 272, local, ordered
|
2024-07-08 20:04:40 2 [Note] WSREP: ####### My UUID: 05b79a6d-3d60-11ef-b75b-83b55130a66a
|
2024-07-08 20:04:40 2 [Note] WSREP: Skipping cert index reset
|
2024-07-08 20:04:40 2 [Note] WSREP: REPL Protocols: 10 (5)
|
2024-07-08 20:04:40 2 [Note] WSREP: ####### Adjusting cert position: 271 -> 272
|
2024-07-08 20:04:40 0 [Note] WSREP: Service thread queue flushed.
|
2024-07-08 20:04:40 2 [Note] WSREP: ================================================
|
View:
|
id: 05b85e89-3d60-11ef-a7dc-7fe580428470:272
|
status: primary
|
protocol_version: 4
|
capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
|
final: no
|
own_index: 0
|
members(1):
|
0: 05b79a6d-3d60-11ef-b75b-83b55130a66a, node-240
|
=================================================
|
2024-07-08 20:04:40 2 [Note] WSREP: Lowest cert index boundary for CC from group: 260
|
2024-07-08 20:04:40 2 [Note] WSREP: Min available from gcache for CC from group: 1
|
2024-07-08 20:04:45 0 [Note] WSREP: cleaning up 65e318e6-b781 (ssl://10.1.11.242:4567)
|
Joiner Log Snippet:
2024-07-08 19:58:09 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
|
2024-07-08 19:58:09 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 6631c9ef-3d64-11ef-ae5a-a73a5959da0c
|
2024-07-08 19:58:09 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6631c9ef-3d64-11ef-ae5a-a73a5959da0c from 0 (node-240)
|
2024-07-08 19:58:09 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6631c9ef-3d64-11ef-ae5a-a73a5959da0c from 1 (node-242)
|
2024-07-08 19:58:09 0 [Note] WSREP: Quorum results:
|
version = 6,
|
component = PRIMARY,
|
conf_id = 3,
|
members = 1/2 (joined/total),
|
act_id = 210,
|
last_appl. = 130,
|
protocols = 2/10/4 (gcs/repl/appl),
|
vote policy= 0,
|
group UUID = 05b85e89-3d60-11ef-a7dc-7fe580428470
|
2024-07-08 19:58:09 0 [Note] WSREP: Flow-control interval: [18, 23]
|
2024-07-08 19:58:09 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 211)
|
2024-07-08 19:58:09 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 211)
|
2024-07-08 19:58:09 2 [Note] WSREP: ####### processing CC 211, local, ordered
|
2024-07-08 19:58:09 2 [Note] WSREP: Process first view: 05b85e89-3d60-11ef-a7dc-7fe580428470 my uuid: 65e318e6-3d64-11ef-b781-3b1e31d3c126
|
2024-07-08 19:58:09 2 [Note] WSREP: Server node-242 connected to cluster at position 05b85e89-3d60-11ef-a7dc-7fe580428470:211 with ID 65e318e6-3d64-11ef-b781-3b1e31d3c126
|
2024-07-08 19:58:09 2 [Note] WSREP: Server status change disconnected -> connected
|
2024-07-08 19:58:09 2 [Note] WSREP: ####### My UUID: 65e318e6-3d64-11ef-b781-3b1e31d3c126
|
2024-07-08 19:58:09 2 [Note] WSREP: Cert index reset to 00000000-0000-0000-0000-000000000000:-1 (proto: 10), state transfer needed: yes
|
2024-07-08 19:58:09 0 [Note] WSREP: Service thread queue flushed.
|
2024-07-08 19:58:09 2 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: -1
|
2024-07-08 19:58:09 2 [Note] WSREP: State transfer required:
|
Group state: 05b85e89-3d60-11ef-a7dc-7fe580428470:211
|
Local state: 00000000-0000-0000-0000-000000000000:-1
|
2024-07-08 19:58:09 2 [Note] WSREP: Server status change connected -> joiner
|
2024-07-08 19:58:09 0 [Note] WSREP: Joiner monitor thread started to monitor
|
2024-07-08 19:58:09 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '10.1.11.242' --datadir '/var/lib/mysql/' --parent 18883 --progress 0
|
--binlog '/var/lib/mysql/mysql-bin' --mysqld-args --basedir=/usr --wsrep_start_position=53b03802-3d42-11ef-bdc9-9b83e6aba1c4:657'
|
WSREP_SST: [INFO] mariabackup SST started on joiner (20240708 19:58:09.781)
|
WSREP_SST: [INFO] SSL configuration: CA='/certs/mariadb/ca-cert.pem', CAPATH='', CERT='/Certs/mariadb/server-cert.pem', KEY='/Certs/mariadb/server-key.pem', MODE='DISABLED',
|
encrypt='3' (20240708 19:58:09.995)
|
WSREP_SST: [INFO] Progress reporting tool pv not found in path: /usr//bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/usr/bin:/sbin:/bin (20240708 19:58:10.246)
|
WSREP_SST: [INFO] Disabling all progress/rate-limiting (20240708 19:58:10.251)
|
WSREP_SST: [INFO] Moving '/var/lib/mysql/mariabackup.backup.log' to '//mariadb/mariabackup.backup.log.2024.07.08-19.58.10.282426237' (20240708 19:58:10.289)
|
WSREP_SST: [INFO] Streaming with mbstream (20240708 19:58:10.314)
|
WSREP_SST: [INFO] Using socat as streamer (20240708 19:58:10.318)
|
WSREP_SST: [INFO] Using openssl based encryption with socat: with key and crt (20240708 19:58:10.341)
|
WSREP_SST: [INFO] Decrypting with cert='/Certs/mariadb/server-cert.pem', key='/Certs/mariadb/server-key.pem', ca='/Certs/mariadb/ca-cert.pem', capath='' (20240708 19:58:10.473)
|
WSREP_SST: [INFO] Evaluating timeout -k 310 300 socat -u openssl-listen:4444,reuseaddr,cert='/Certs/mariadb/server-cert.pem',key='/Certs/mariadb/server-key.pem',
|
cafile='/Certs/mariadb/ca-cert.pem',commonname='' stdio | '/usr//bin/mbstream' -x; RC=( ${PIPESTATUS[@]} ) (20240708 19:58:10.547)
|
2024-07-08 19:58:12 0 [Note] WSREP: (65e318e6-b781, 'ssl://0.0.0.0:4567') turning message relay requesting off
|
WSREP_SST: [ERROR] Possible timeout in receiving first data from donor in gtid stage: exit codes: 124 0 (20240708 20:03:10.560)
|
WSREP_SST: [ERROR] Cleanup after exit with status: 32 (20240708 20:03:10.568)
|
WSREP_SST: [ERROR] Removing /tmp/tmp.9nWF3ySGO6/xtrabackup_galera_info file due to signal (20240708 20:03:10.577)
|
WSREP_SST: [INFO] Removing the sst_in_progress file (20240708 20:03:10.585)
|
WSREP_SST: [INFO] Cleaning up temporary directories (20240708 20:03:10.597)
|
2024-07-08 20:04:39 0 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_mariabackup --role 'joiner' --address '10.1.11.242' --datadir '/var/lib/mysql/'
|
--parent 18883 --progress 0 --binlog '/var/lib/mysql/mysql-bin' --mysqld-args --basedir=/usr --wsrep_start_position=53b03802-3d42-11ef-bdc9-9b83e6aba1c4:657
|
Read: '(null)'
|
2024-07-08 20:04:39 0 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'joiner' --address '10.1.11.242' --datadir '/var/lib/mysql/'
|
--parent 18883 --progress 0 --binlog '/var/lib/mysql/mysql-bin' --mysqld-args --basedir=/usr --wsrep_start_position=53b03802-3d42-11ef-bdc9-9b83e6aba1c4:657: 32 (Broken pipe)
|
2024-07-08 20:04:39 2 [ERROR] WSREP: Failed to prepare for 'mariabackup' SST. Unrecoverable.
|
2024-07-08 20:04:39 2 [ERROR] WSREP: SST request callback failed. This is unrecoverable, restart required.
|
2024-07-08 20:04:39 2 [Note] WSREP: ReplicatorSMM::abort()
|
2024-07-08 20:04:39 2 [Note] WSREP: Closing send monitor...
|
2024-07-08 20:04:39 2 [Note] WSREP: Closed send monitor.
|
2024-07-08 20:04:39 2 [Note] WSREP: gcomm: terminating thread
|
2024-07-08 20:04:39 2 [Note] WSREP: gcomm: joining thread
|
2024-07-08 20:04:39 2 [Note] WSREP: gcomm: closing backend
|
2024-07-08 20:04:40 2 [Note] WSREP: view(view_id(NON_PRIM,05b79a6d-b75b,4) memb {
|
65e318e6-b781,0
|
} joined {
|
} left {
|
} partitioned {
|
05b79a6d-b75b,0
|
})
|
2024-07-08 20:04:40 2 [Note] WSREP: PC protocol downgrade 1 -> 0
|
2024-07-08 20:04:40 2 [Note] WSREP: view((empty))
|
2024-07-08 20:04:40 2 [Note] WSREP: gcomm: closed
|
2024-07-08 20:04:40 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
|
2024-07-08 20:04:40 0 [Note] WSREP: Flow-control interval: [13, 16]
|
2024-07-08 20:04:40 0 [Note] WSREP: Received NON-PRIMARY.
|
2024-07-08 20:04:40 0 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 271)
|
2024-07-08 20:04:40 0 [Note] WSREP: New SELF-LEAVE.
|
2024-07-08 20:04:40 0 [Note] WSREP: Flow-control interval: [0, 0]
|
2024-07-08 20:04:40 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
|
2024-07-08 20:04:40 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 271)
|
2024-07-08 20:04:40 0 [Note] WSREP: RECV thread exiting 0: Success
|
2024-07-08 20:04:40 2 [Note] WSREP: recv_thread() joined.
|
2024-07-08 20:04:40 2 [Note] WSREP: Closing replication queue.
|
2024-07-08 20:04:40 2 [Note] WSREP: Closing slave action queue.
|
2024-07-08 20:04:40 2 [Note] WSREP: /usr/libexec/mariadbd: Terminated.
|
I am also attaching the CPU flag differences between the hardware, as I observed that recent MariaDB builds are compiled with CPU optimizations. Specifically, it crashes if the fma flag is missing. This information might be relevant to this issue.
Request:
I am seeking assistance to determine the root cause of this issue. Any suggestions on additional logs or configurations to check would be greatly appreciated.