Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-5804

If same GTID is received on multiple master connections in multi-source replication, the event is double-executed causing corruption or replication failure

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 10.0.9
    • Fix Version/s: 10.0.10
    • Component/s: None
    • Labels:
      None

      Description

      With multi-source replication, it is possible on a slave to receive the same
      GTID twice, if there are multiple paths through which events can arrive from a
      master. For example a three-node multimaster setup where each server
      replicates as a slave from the two others.

      With correctly configured GTID, it is easy to detect that an event has already
      been received once; just compare the sequence number against the previous GTID
      received within that domain, and ignore the event if already applied. But
      currently, the code does not handle this correctly; instead it doubly-applies
      the event, or in gtid strict mode stops with an error.

      We cannot really fix this as default behaviour, as this could break upgrade of
      existing setups, and also conflicts with the behaviour in strict mode of
      giving an error, which is desired by some user. However, we can implement a
      --gtid-ignore-duplicates option, which will enable this behaviour.

      So instead of the gtid strict mode behaviour, which fails if we see D-S1-M
      after D-S2-N (M<=N), we will handle it as follows:

      If we receive T1=D-S1-M, and current position in D is D-S2-N with N >= M, we
      drop T1.

      If M > N, then the event needs to be applied, however we need to protect
      against two different master connections trying to apply the same GTID at the
      same time. So the first connection to see the new GTID with M > N is set as
      the current owner of the domain, and starts applying the transaction. When it
      is done and has committed, the domain is released and the owner is
      cleared. Any second connection that received a GTID with M > N while the
      domain is already reserved by another owner will need to wait until the
      current owner is done, and then make the decision to either discard, or apply
      (then becoming the new owner).

      A good way to implement it seems to be to have a current owner of each domain,
      in the form of the Relay_log_info. In parallel replication, we also have a
      reference count of worker threads active in that domain. When a worker thread
      gets the lock on the domain, it sets the owner (if unset), and increases the
      reference count. When it is done, it decreases the count, and if it reaches
      zero it removes the owner and signals any other waiters that the domain is now
      free to grab vy someone else.

      Normally, if a slave asks to connect at GTID position D-S2-N, but the master is
      only at D-S1-M, M < N, then the slave will get an error that it is ahead of
      the master. However, if --gtid-ignore-duplicates, then this is a normal
      situation (eg. the GTID came directly from A->C, it did not yet come A->B, and
      now C wants to connect as a slave to B). So we need to in this case have the
      slave tell the master not to give an error; instead the master must simply
      wait for the GTID D-S2-N to turn up and then start sending events to the
      slave.

        Attachments

          Activity

            People

            Assignee:
            knielsen Kristian Nielsen
            Reporter:
            knielsen Kristian Nielsen
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: