Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9670

server_id mysteriously set to 0 in binlog causes GTID & multisource replication to break

    Details

    • Sprint:
      10.2.7-1

      Description

      A user experienced some strange GTID out-of-order sequence number errors, which caused very strange symptoms, and in order to fix the problems, the server process to be killed.

      The user has the following replication topology with 8 servers total in 2 data centers (4 nodes each DC). Node1 in each data center are masters that replicate from each other using circular replication. Nodes 2,3 and 7 in each data center are slaves with read-only enabled.

      The gtid_domain_id and server_ids values are the following:

      DC1:
      Gtid domain: 15
      Node1 server_id: 7715
      Node2 server_id: 7716
      Node3 server_id: 7717
      Node7 server_id: 7718

      DC2:
      Gtid domain: 17
      Node1 server_id: 7725
      Node2 server_id: 7726
      Node3 server_id: 7727
      Node7 server_id: 7728

      Somehow, DC1-Node1 (server_id 7715) received an event from DC2-Node1 (server_id 7725) that had a server_id value of 0:

      # at 73459651
      #160301 14:56:32 server id 0  end_log_pos 73459689      GTID 15-0-6377149
      /*!100001 SET @@session.gtid_domain_id=15*//*!*/;
      /*!100001 SET @@session.server_id=0*//*!*/;
      /*!100001 SET @@session.gtid_seq_no=6377149*//*!*/;
      BEGIN
      /*!*/;
      # at 73459689
      # at 73459783
      # at 73460738
      # at 73461693
      # at 73462648
      # at 73463603
      # at 73464558
      # at 73465513
      # at 73466468
      # at 73467423
      # at 73468378
      # at 73469333
      # at 73470288
      # at 73471243
      # at 73472198
      # at 73473153
      # at 73474108
      

      This caused replication to break:

      MariaDB [(none)]> show all slaves status \G;
      *************************** 1. row ***************************
                    Connection_name: dc2master
                    Slave_SQL_State:
                     Slave_IO_State: Waiting for master to send event
                        Master_Host: 10.55.20.41
                        Master_User: replicator
                        Master_Port: 3306
                      Connect_Retry: 60
                    Master_Log_File: mariadb-bin.007285
                Read_Master_Log_Pos: 74024310
                     Relay_Log_File: relay-bin-dc2master.000026
                      Relay_Log_Pos: 113584
              Relay_Master_Log_File: mariadb-bin.007285
                   Slave_IO_Running: Yes
                  Slave_SQL_Running: No
                    Replicate_Do_DB:
                Replicate_Ignore_DB: db1,db2
                 Replicate_Do_Table:
             Replicate_Ignore_Table:
            Replicate_Wild_Do_Table:
        Replicate_Wild_Ignore_Table:
                         Last_Errno: 1950
                         Last_Error: An attempt was made to binlog GTID 15-0-6377149 which would create an out-of-ord sequence number with existing GTID 15-7715-6377149, and gtid strict mode is enabled.
                       Skip_Counter: 0
                Exec_Master_Log_Pos: 73459651
                    Relay_Log_Space: 354309
                    Until_Condition: None
                     Until_Log_File:
                      Until_Log_Pos: 0
                 Master_SSL_Allowed: No
                 Master_SSL_CA_File:
                 Master_SSL_CA_Path:
                    Master_SSL_Cert:
                  Master_SSL_Cipher:
                     Master_SSL_Key:
              Seconds_Behind_Master: NULL
      Master_SSL_Verify_Server_Cert: No
                      Last_IO_Errno: 0
                      Last_IO_Error:
                     Last_SQL_Errno: 1950
                     Last_SQL_Error: An attempt was made to binlog GTID 15-0-6377149 which would create an out-of-ord sequence number with existing GTID 15-7715-6377149, and gtid strict mode is enabled.
        Replicate_Ignore_Server_Ids:
                   Master_Server_Id: 7725
                     Master_SSL_Crl:
                 Master_SSL_Crlpath:
                         Using_Gtid: Slave_Pos
                        Gtid_IO_Pos: 21-7521-7596336,0-7725-126,11-7511-6574751,15-7715-6377150,17-7725-2432887
               Retried_transactions: 0
                 Max_relay_log_size: 104857600
               Executed_log_entries: 14671605
          Slave_received_heartbeats: 22207
             Slave_heartbeat_period: 15.000
                     Gtid_Slave_Pos: 21-7521-7596336,0-7725-126,15-7715-6377677,11-7511-6574751,17-7725-2432353
      1 row in set (0.00 sec)
      

      Some things to note here:

      • The server_id is somehow set to 0, but the application is not setting the variable.
      • The GTID for this transaction has a gtid_domain_id of 15, which means that it should have originated on DC1-Node1. In that case, it should have server_id set to 7715.
      • DC1-Node1 has gtid_ignore_duplicates enabled
      • DC1-Node1 shouldn't try to re-run its own transaction when gtid_ignore_duplicates is enabled. It only seems to try to do so because the transaction's server_id is incorrect.

      The other events within the transaction do have the proper server_id set:

      #160301 14:56:32 server id 7715  end_log_pos 73459783   Table_map: `db3`.`tab1` mapped to number 108
      #160301 14:56:32 server id 7715  end_log_pos 73460738   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73461693   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73462648   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73463603   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73464558   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73465513   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73466468   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73467423   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73468378   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73469333   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73470288   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73471243   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73472198   Write_rows: table id 108
      

      When I look at the same event in the binary log of DC2-Node1, it looks like that server has the correct server_id for the transaction:

      # at 98082295
      #160301 14:56:33 server id 7715  end_log_pos 98082333   GTID 15-7715-6377149
      /*!100001 SET @@session.gtid_domain_id=15*//*!*/;
      /*!100001 SET @@session.server_id=7715*//*!*/;
      /*!100001 SET @@session.gtid_seq_no=6377149*//*!*/;
      BEGIN
      /*!*/;
      # at 98082333
      # at 98082427
      # at 98083382
      # at 98084337
      # at 98085292
      # at 98086247
      # at 98087202
      # at 98088157
      # at 98089112
      # at 98090067
      # at 98091022
      # at 98091977
      # at 98092932
      # at 98093887
      # at 98094842
      # at 98095797
      # at 98096752
      # at 98097707
      # at 98098662
      # at 98099617
      # at 98100572
      # at 98101527
      # at 98102482
      # at 98103437
      # at 98104392
      # at 98105347
      

      The user was able to skip this transaction on DC1-Node1 by stopping the slave, setting sql_slave_skip_counter to 1, and then starting the slave.

      Shortly after DC1-Node1 ran into the out-of-order sequence GTID error, DC2-Node2 ran into an out-of-order sequence GTID error for a different transaction:

      MariaDB [(none)]> show all slaves status \G;
      *************************** 1. row ***************************
                    Connection_name: dc1master
                    Slave_SQL_State:
                     Slave_IO_State: Waiting for master to send event
                        Master_Host: 10.35.20.41
                        Master_User: replicator
                        Master_Port: 3306
                      Connect_Retry: 60
                    Master_Log_File: mariadb-bin.006328
                Read_Master_Log_Pos: 44256878
                     Relay_Log_File: relay-bin-dc1master.000026
                      Relay_Log_Pos: 98265194
              Relay_Master_Log_File: mariadb-bin.006326
                   Slave_IO_Running: Yes
                  Slave_SQL_Running: No
                    Replicate_Do_DB:
                Replicate_Ignore_DB: db1,db2
                 Replicate_Do_Table:
             Replicate_Ignore_Table:
            Replicate_Wild_Do_Table:
        Replicate_Wild_Ignore_Table:
                         Last_Errno: 1950
                         Last_Error: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled.
                       Skip_Counter: 0
                Exec_Master_Log_Pos: 98406481
                    Relay_Log_Space: 254089255
                    Until_Condition: None
                     Until_Log_File:
                      Until_Log_Pos: 0
                 Master_SSL_Allowed: No
                 Master_SSL_CA_File:
                 Master_SSL_CA_Path:
                    Master_SSL_Cert:
                  Master_SSL_Cipher:
                     Master_SSL_Key:
              Seconds_Behind_Master: NULL
      Master_SSL_Verify_Server_Cert: No
                      Last_IO_Errno: 0
                      Last_IO_Error:
                     Last_SQL_Errno: 1950
                     Last_SQL_Error: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled.
        Replicate_Ignore_Server_Ids:
                   Master_Server_Id: 7715
                     Master_SSL_Crl:
                 Master_SSL_Crlpath:
                         Using_Gtid: Slave_Pos
                        Gtid_IO_Pos: 21-7521-7596336,0-7725-126,11-7511-6574751,15-7715-6377636,17-7725-2432353
               Retried_transactions: 0
                 Max_relay_log_size: 104857600
               Executed_log_entries: 1371851
          Slave_received_heartbeats: 1276
             Slave_heartbeat_period: 15.000
                     Gtid_Slave_Pos: 21-7521-7596336,0-7725-126,15-7715-6377149,11-7511-6574751,17-7725-2432837
      1 row in set (0.00 sec)
      

      This transaction looks normal in DC1-Node1's binary log, and it only appears once, so it seems strange that it would trigger this error:

      # at 98406481
      #160301 14:56:33 server id 7715  end_log_pos 98406519   GTID 15-7715-6377150
      /*!100001 SET @@session.gtid_seq_no=6377150*//*!*/;
      BEGIN
      /*!*/;
      # at 98406519
      # at 98406613
      # at 98407568
      # at 98408523
      # at 98409478
      # at 98410433
      # at 98411388
      # at 98412343
      # at 98413298
      # at 98414253
      # at 98415208
      # at 98416163
      # at 98417118
      # at 98418073
      # at 98419028
      # at 98419983
      # at 98420938
      # at 98421893
      # at 98422848
      # at 98423803
      # at 98424758
      

      Not only did this break replication on DC2-Node1, but the error log also hinted at a more serious problem:

      160301 19:56:33 [ERROR] Master 'dc1master': Slave SQL: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled. Gtid 15-7715-6377150, Internal MariaDB error code: 1950
      160301 19:56:33 [Warning] Master 'dc1master': Slave: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled. Error_code: 1950
      160301 19:56:33 [ERROR] Master 'dc1master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.006326' position 98406481; GTID position '21-7521-7596336,0-7725-126,15-7715-6377149,11-7511-6574751,17-7725-2432353'
      160301 19:56:33 [ERROR] Master 'dc1master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 7.
      TRANSACTION 1046771904, not started
      mysql tables in use 1, locked 7
      MySQL thread id 4, OS thread handle 0x7fee9ea4f700, query id 20972814 Waiting for slave mutex on exit
       len 1280; hex XXXXXXXXXXXXXXX; asc  \n      !                      ('                      + ^                                                                                                              [                        Z                                                                                      '                                       xd>       V     xd>                            MySQLXid-         @                                                                                                             J % /                           b e                                                     h i     h                                                a                                                               !                        A     p$              7 ^                                                                                     H                                                                                                                                                                                                                                                                                                                                                                                                                                                       ;
      

      After this, we noticed the following issues:

      • The broken slave thread held onto a slave mutex, which meant the slave couldn't be stopped or killed.
      • The broken slave thread had locks on some tables, which kept other transactions from using those tables.
      • The server wouldn't even cleanly shut down, so the server process was killed.

      The user is not using parallel replication.

      The user is using multi-source replication.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                monty Michael Widenius
                Reporter:
                GeoffMontee Geoff Montee
              • Votes:
                2 Vote for this issue
                Watchers:
                14 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: