Uploaded image for project: 'MariaDB MaxScale'
  1. MariaDB MaxScale
  2. MXS-4542

Duplicate GTID on replicas

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Cannot Reproduce
    • None
    • N/A
    • N/A
    • None

    Description

      This is a continuation of the story from https://jira.mariadb.org/browse/MXS-4421 : replicas under GTID strict mode stop with duplicate transaction ID.

      This time the error happened during nighttime operations, no human action involved, so at least we can lift the blame from loading a dump or using a particular SQL client (the clients were regular PHP web sites).

      Just like before, this killed multiple replicas at the same time with the same transaction number error. The odd thing this time is the error message on the replica:

      Last_Error: An attempt was made to binlog GTID 0-11-3379168833 which would create an out-of-order sequence number with existing GTID 0-11-3379168833, and gtid strict mode is enabled

      Note that both the "existing" and "new" transaction have the SAME origin server (ID=11). At first, we thought this is a bug of some kind in the logging. Then, we inspected the binlog on the master, the relay log on the replica (as expected, found to be identical to the binlog of the master) and the binlog on the replica. To our amazement, we found GTID 0-11-3379168833 to have completely different content in the relay log and in the binlog of the replica!

      The relay log entry is a long transaction, starting like this:

      1. at 7190887
        #230305 4:48:42 server id 11 end_log_pos 7190929 CRC32 0xc59e2961 GTID 0-11-3379168833 trans
        /Unable to render embedded object: File (100001 SET @@session.gtid_seq_no=3379168833*//*) not found./;
        START TRANSACTION
        /!/;
      2. at 7190929
        #230305 4:48:42 server id 11 end_log_pos 7196331 CRC32 0x41db85a7 Query thread_id=5012 exec_time=0 error_code=0 xid=0
        use `db_XXX`/!/;
        SET TIMESTAMP=1677988122/!/;
        SET @@session.sql_mode=0/!/;
        /Unable to render embedded object: File (\C utf8mb4 *//*) not found./;
        SET @@session.character_set_client=45,@@session.collation_connection=45,@@session.collation_server=45/!/;
        ...

      and the local binlog has this

      1. at 14151595
        #230305 4:48:42 server id 11 end_log_pos 14151637 CRC32 0x7de723d3 GTID 0-11-3379168833
        /Unable to render embedded object: File (100001 SET @@session.gtid_seq_no=3379168833*//*) not found./;
        START TRANSACTION
        /!/;
      2. at 14151637
        #230305 4:48:42 server id 11 end_log_pos 14151754 CRC32 0x525a85fc Query thread_id=5015 exec_time=0 error_code=0 xid=0
        use `mysql`/!/;
        SET TIMESTAMP=1677988122/!/;
        DELETE FROM `db_YYY`.`wp_wfls_role_counts`
        /!/;

      These two seem to originate from different SQL clients, as they refer to different databases (there is a strict policy of one SQL account per DB and there are no cross-schema queries or clients that have access to both db_XXX and db_YYY).

      It is completely unclear how the entry in the replica's binlog came to be, because it has the server ID of the master (11) while the local one is different (12 and 13 on the affected replicas).

      The next puzzling thing is that the replicas continued to receive DML for short time even after replication had stopped and logged two more statements - this time, under their own server ID:

      1. at 14151828
        #230305 4:53:57 server id 13 end_log_pos 14151870 CRC32 0x0072b77c GTID 0-13-3379168834
        /Unable to render embedded object: File (100001 SET @@session.server_id=13*//*) not found./;
        /Unable to render embedded object: File (100001 SET @@session.gtid_seq_no=3379168834*//*) not found./;
      2. at 14151870
        #230305 4:53:57 server id 13 end_log_pos 14151989 CRC32 0x6937c9a8 Query thread_id=6056 exec_time=0 error_code=0 xid=0
        SET TIMESTAMP=1677988437/!/;
        DELETE FROM `db_AAA`.`wp_wfls_role_counts`
        /!/;
      3. at 14151989
        #230305 5:00:29 server id 13 end_log_pos 14152031 CRC32 0xcf4643bd GTID 0-13-3379168835
        /Unable to render embedded object: File (100001 SET @@session.gtid_seq_no=3379168835*//*) not found./;
      4. at 14152031
        #230305 5:00:29 server id 13 end_log_pos 14152149 CRC32 0x4e82ff39 Query thread_id=6146 exec_time=0 error_code=0 xid=0
        SET TIMESTAMP=1677988829/!/;
        DELETE FROM `db_BBB`.`wp_wfls_role_counts`
        /!/;
        DELIMITER ;
      5. End of log file

      Again, we have two other databases involved, meaning two more SQL clients: db_AAA and db_BBB.

      This is the end of the log, since with the replication stopped, MaxScale has removed the replicas from routing.

      We don't have much more clues to offer, but at least we know that this mysterious issue is still around.

      Attachments

        Activity

          People

            johan.wikman Johan Wikman
            Totin Assen Totin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.