Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-38058

SST to a joiner with a very high SQL load leads to server termination after detected node inconsistency

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.3.32
    • None
    • Galera SST
    • None

    Description

      Apologies for the rather long description, I'm posting it here in for reference, in the hope that someone has already hit the issue and could help fix it.

      system: RHEL-8.4
      mariadb: 10.3.32
      galera: 25.3.34

      Observed on a 3-node galera cluster with a very loaded workload with unpredictable peaks. It seems that when bootstrapping a 3-node cluster, under heavy load, the SST to a joiner might be impacted and yield inconsistent state on the joiner which prevents it to continue integrating write-sets.

      The galera cluster mentionned is used in a Openstack workload. This is a fairly standard deployment for Openstack:

      • 3-node cluster ctrl1,ctrl2,ctrl3, bootstrapped and monitored by pacemaker.
      • mysql/galera listening on softwware-defined NICs, backed by OVN.
      • rsync method used for SST.
      • No Gcache recovery used on restart.
      • HAProxy in front of the galera cluster to balance traffic to available galera node.
      • galera is used as Active/Hot-standby, meaning only one galera node at a time is being forward traffic by HAproxy.
      • HAproxy polls every galera node every 2 seconds, and is responsible for failing over traffic when the currently targeted node is no longer in "Primary" component and "Joined" stated.
      • If the node receiving traffic switches to "Donor" state, HAProxy takes up to 2s to react and switch traffic to another available node (Primary/Joined).

      Due to an unrelated ongoing upgrade activity, this environment experienced network interface issues used which caused various disconnections of galera node in the cluster.

      This issue reported here seems to have started after an IST between ctrl1<->ctrl3 could not finish due to a network disconnection. The IST donor and joiner stopped, which left the 3rd node in non-Primary, which caused pacemaker to stop it as well.

      Right when pacemaker re-bootstrapped the galera cluster from ctrl2, traffic got processed again, and anoter ctrl3 joined the cluster and requested a SST. When the SST finished, InnoDB initialized, node started receiving write-sets for replication, and almost immediately failed to replicate because of a missing data locally.

      2023-11-29 9:23:39 2 [ERROR] Slave SQL: Could not execute Update_rows_v1 event on table nova_api.instance_mappings; Can't find record in 'instance_mappings', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 546, Internal MariaDB>
      [...]
      2023-11-29 9:23:39 2 [Warning] WSREP: RBR event 3 Update_rows_v1 apply warning: 120, 879573306
      2023-11-29 9:23:39 2 [ERROR] WSREP: Failed to apply trx: source: lpctrl15001:d560a523-7b0f-11ee-8376-eafbe7079ce5 version: 4 local: 0 state: APPLYING flags: 1 conn_id: 141 trx_id: 1821280904 seqnos (l: 38, g: 879573306, s: 879573305, d: 879573286, ts: 24350914700820336)
      2023-11-29 9:23:39 2 [ERROR] WSREP: Failed to apply trx 879573306 4 times
      2023-11-29 9:23:39 2 [ERROR] WSREP: Node consistency compromised, aborting...

      What is surprising to me but I can't tell whether this is normal or not, is at the time the SST finished succesfully and InnoDB restarted, some transactions in InnoDB seem to be marked as non-prepared and were rolled back. Since rsync performs a full database sync, it looks to me that those were the missing transaction that prevent the next write-set to be committed.

      I'm attaching the logs from the 3 controllers with obfuscated hostname in it. I have also tried my best to annotate those logs to reconstruct the timeline of all the network disconnections and map names to internal galera id in 3 separate log files, in case it can be of any use.

      I'm also attaching the mysql config used for this cluster. These are using pretty much standard settings AFAICT, expect for the following settings innodb and gcache setting that may be out of the ordinary and were needed for sustaining the unpredictable load nature of this environment:

      innodb_buffer_pool_size = 24576M
      innodb_log_file_size = 1024M
      wsrep_provider_options = gcache.size=5G;gcache.recover=no;gmcast.listen_addr=tcp://10.30.97.21:4567;

      What is unclear to me is that by design, there's a step in the SST process which forces the DB to flush its table before rsync starts transferring data, so even if receiving unprepared transaction is something to be expected when the Donor was serving traffic, there should always be a IST next to catch up whatever data is not sync'd, but this isn't the case in those logs.

      I'm out of ideas as to what might have caused this node inconsistency:

      • pacemaker has bootstrapped ctrl2 which was the last node to be stopped.
      • there was no network issue at the time of the SST.
      • HAProxy likely stopped incoming SQL traffic after 2s since only one node was up and sync at the time the SST was initiated.

      It all looks like the heavy load on the donor node somehow had an influenced on the SST and the joining process.

      I could not find anything similar reported upstream, and I have also witnessed this same behaviour once on another environment, which a similar workload (thousands of Openstack orders for VM creation/deletion, resulting in unpredictable spikes of database traffic).

      Have you already seen that behaviour in Galera? Is there a chance that this specific issue was fixed in a Jira already that I could have missed? It seems to me that it's a perfectly valid use case here. Assuming it is, could this issue be specific to rsync SST, or is it more likely to be a race condition in the SST mechanism itself (or even InnoDB, which seems less likely to me)?

      Attachments

        1. galera.cnf
          2 kB
        2. ctrl3-annotated.log
          233 kB
        3. ctrl3.log
          155 kB
        4. ctrl2-annotated.log
          220 kB
        5. ctrl2.log
          139 kB
        6. ctrl1-annotated.log
          176 kB
        7. ctrl1.log
          112 kB

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dciabrin Damien Ciabrini
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.