[MDEV-38058] SST to a joiner with a very high SQL load leads to server termination after detected node inconsistency - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.3.32
Fix Version/s: None
Component/s: Galera SST
Labels:
None

Description

Apologies for the rather long description, I'm posting it here in for reference, in the hope that someone has already hit the issue and could help fix it.

system: RHEL-8.4
mariadb: 10.3.32
galera: 25.3.34

Observed on a 3-node galera cluster with a very loaded workload with unpredictable peaks. It seems that when bootstrapping a 3-node cluster, under heavy load, the SST to a joiner might be impacted and yield inconsistent state on the joiner which prevents it to continue integrating write-sets.

The galera cluster mentionned is used in a Openstack workload. This is a fairly standard deployment for Openstack:

3-node cluster ctrl1,ctrl2,ctrl3, bootstrapped and monitored by pacemaker.
mysql/galera listening on softwware-defined NICs, backed by OVN.
rsync method used for SST.
No Gcache recovery used on restart.
HAProxy in front of the galera cluster to balance traffic to available galera node.
galera is used as Active/Hot-standby, meaning only one galera node at a time is being forward traffic by HAproxy.
HAproxy polls every galera node every 2 seconds, and is responsible for failing over traffic when the currently targeted node is no longer in "Primary" component and "Joined" stated.
If the node receiving traffic switches to "Donor" state, HAProxy takes up to 2s to react and switch traffic to another available node (Primary/Joined).

Due to an unrelated ongoing upgrade activity, this environment experienced network interface issues used which caused various disconnections of galera node in the cluster.

This issue reported here seems to have started after an IST between ctrl1<->ctrl3 could not finish due to a network disconnection. The IST donor and joiner stopped, which left the 3rd node in non-Primary, which caused pacemaker to stop it as well.

Right when pacemaker re-bootstrapped the galera cluster from ctrl2, traffic got processed again, and anoter ctrl3 joined the cluster and requested a SST. When the SST finished, InnoDB initialized, node started receiving write-sets for replication, and almost immediately failed to replicate because of a missing data locally.

2023-11-29 9:23:39 2 [ERROR] Slave SQL: Could not execute Update_rows_v1 event on table nova_api.instance_mappings; Can't find record in 'instance_mappings', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 546, Internal MariaDB>
[...]
2023-11-29 9:23:39 2 [Warning] WSREP: RBR event 3 Update_rows_v1 apply warning: 120, 879573306
2023-11-29 9:23:39 2 [ERROR] WSREP: Failed to apply trx: source: lpctrl15001:d560a523-7b0f-11ee-8376-eafbe7079ce5 version: 4 local: 0 state: APPLYING flags: 1 conn_id: 141 trx_id: 1821280904 seqnos (l: 38, g: 879573306, s: 879573305, d: 879573286, ts: 24350914700820336)
2023-11-29 9:23:39 2 [ERROR] WSREP: Failed to apply trx 879573306 4 times
2023-11-29 9:23:39 2 [ERROR] WSREP: Node consistency compromised, aborting...

What is surprising to me but I can't tell whether this is normal or not, is at the time the SST finished succesfully and InnoDB restarted, some transactions in InnoDB seem to be marked as non-prepared and were rolled back. Since rsync performs a full database sync, it looks to me that those were the missing transaction that prevent the next write-set to be committed.

I'm attaching the logs from the 3 controllers with obfuscated hostname in it. I have also tried my best to annotate those logs to reconstruct the timeline of all the network disconnections and map names to internal galera id in 3 separate log files, in case it can be of any use.

I'm also attaching the mysql config used for this cluster. These are using pretty much standard settings AFAICT, expect for the following settings innodb and gcache setting that may be out of the ordinary and were needed for sustaining the unpredictable load nature of this environment:

innodb_buffer_pool_size = 24576M
innodb_log_file_size = 1024M
wsrep_provider_options = gcache.size=5G;gcache.recover=no;gmcast.listen_addr=tcp://10.30.97.21:4567;

What is unclear to me is that by design, there's a step in the SST process which forces the DB to flush its table before rsync starts transferring data, so even if receiving unprepared transaction is something to be expected when the Donor was serving traffic, there should always be a IST next to catch up whatever data is not sync'd, but this isn't the case in those logs.

I'm out of ideas as to what might have caused this node inconsistency:

pacemaker has bootstrapped ctrl2 which was the last node to be stopped.
there was no network issue at the time of the SST.
HAProxy likely stopped incoming SQL traffic after 2s since only one node was up and sync at the time the SST was initiated.

It all looks like the heavy load on the donor node somehow had an influenced on the SST and the joining process.

I could not find anything similar reported upstream, and I have also witnessed this same behaviour once on another environment, which a similar workload (thousands of Openstack orders for VM creation/deletion, resulting in unpredictable spikes of database traffic).

Have you already seen that behaviour in Galera? Is there a chance that this specific issue was fixed in a Jira already that I could have missed? It seems to me that it's a perfectly valid use case here. Assuming it is, could this issue be specific to rsync SST, or is it more likely to be a race condition in the SST mechanism itself (or even InnoDB, which seems less likely to me)?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

galera.cnf
2 kB
2025-11-10 12:52
ctrl3-annotated.log
233 kB
2025-11-10 12:52
ctrl3.log
155 kB
2025-11-10 12:52
ctrl2-annotated.log
220 kB
2025-11-10 12:52
ctrl2.log
139 kB
2025-11-10 12:52
ctrl1-annotated.log
176 kB
2025-11-10 12:52
ctrl1.log
112 kB
2025-11-10 12:52

Issue Links

relates to

MDEV-24845 Oddities around innodb_fatal_semaphore_wait_threshold and global.innodb_disallow_writes

Closed

MDEV-32115 Assertion `!recv_no_log_write' failed in bool log_checkpoint(bool)

In Testing

Activity

People

Assignee:: Unassigned

Reporter:: Damien Ciabrini

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2025-11-10 12:53

Updated:: 2025-11-10 13:11

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.