Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9670

server_id mysteriously set to 0 in binlog causes GTID & multisource replication to break

Details

    • 10.2.7-1

    Description

      A user experienced some strange GTID out-of-order sequence number errors, which caused very strange symptoms, and in order to fix the problems, the server process to be killed.

      The user has the following replication topology with 8 servers total in 2 data centers (4 nodes each DC). Node1 in each data center are masters that replicate from each other using circular replication. Nodes 2,3 and 7 in each data center are slaves with read-only enabled.

      The gtid_domain_id and server_ids values are the following:

      DC1:
      Gtid domain: 15
      Node1 server_id: 7715
      Node2 server_id: 7716
      Node3 server_id: 7717
      Node7 server_id: 7718

      DC2:
      Gtid domain: 17
      Node1 server_id: 7725
      Node2 server_id: 7726
      Node3 server_id: 7727
      Node7 server_id: 7728

      Somehow, DC1-Node1 (server_id 7715) received an event from DC2-Node1 (server_id 7725) that had a server_id value of 0:

      # at 73459651
      #160301 14:56:32 server id 0  end_log_pos 73459689      GTID 15-0-6377149
      /*!100001 SET @@session.gtid_domain_id=15*//*!*/;
      /*!100001 SET @@session.server_id=0*//*!*/;
      /*!100001 SET @@session.gtid_seq_no=6377149*//*!*/;
      BEGIN
      /*!*/;
      # at 73459689
      # at 73459783
      # at 73460738
      # at 73461693
      # at 73462648
      # at 73463603
      # at 73464558
      # at 73465513
      # at 73466468
      # at 73467423
      # at 73468378
      # at 73469333
      # at 73470288
      # at 73471243
      # at 73472198
      # at 73473153
      # at 73474108
      

      This caused replication to break:

      MariaDB [(none)]> show all slaves status \G;
      *************************** 1. row ***************************
                    Connection_name: dc2master
                    Slave_SQL_State:
                     Slave_IO_State: Waiting for master to send event
                        Master_Host: 10.55.20.41
                        Master_User: replicator
                        Master_Port: 3306
                      Connect_Retry: 60
                    Master_Log_File: mariadb-bin.007285
                Read_Master_Log_Pos: 74024310
                     Relay_Log_File: relay-bin-dc2master.000026
                      Relay_Log_Pos: 113584
              Relay_Master_Log_File: mariadb-bin.007285
                   Slave_IO_Running: Yes
                  Slave_SQL_Running: No
                    Replicate_Do_DB:
                Replicate_Ignore_DB: db1,db2
                 Replicate_Do_Table:
             Replicate_Ignore_Table:
            Replicate_Wild_Do_Table:
        Replicate_Wild_Ignore_Table:
                         Last_Errno: 1950
                         Last_Error: An attempt was made to binlog GTID 15-0-6377149 which would create an out-of-ord sequence number with existing GTID 15-7715-6377149, and gtid strict mode is enabled.
                       Skip_Counter: 0
                Exec_Master_Log_Pos: 73459651
                    Relay_Log_Space: 354309
                    Until_Condition: None
                     Until_Log_File:
                      Until_Log_Pos: 0
                 Master_SSL_Allowed: No
                 Master_SSL_CA_File:
                 Master_SSL_CA_Path:
                    Master_SSL_Cert:
                  Master_SSL_Cipher:
                     Master_SSL_Key:
              Seconds_Behind_Master: NULL
      Master_SSL_Verify_Server_Cert: No
                      Last_IO_Errno: 0
                      Last_IO_Error:
                     Last_SQL_Errno: 1950
                     Last_SQL_Error: An attempt was made to binlog GTID 15-0-6377149 which would create an out-of-ord sequence number with existing GTID 15-7715-6377149, and gtid strict mode is enabled.
        Replicate_Ignore_Server_Ids:
                   Master_Server_Id: 7725
                     Master_SSL_Crl:
                 Master_SSL_Crlpath:
                         Using_Gtid: Slave_Pos
                        Gtid_IO_Pos: 21-7521-7596336,0-7725-126,11-7511-6574751,15-7715-6377150,17-7725-2432887
               Retried_transactions: 0
                 Max_relay_log_size: 104857600
               Executed_log_entries: 14671605
          Slave_received_heartbeats: 22207
             Slave_heartbeat_period: 15.000
                     Gtid_Slave_Pos: 21-7521-7596336,0-7725-126,15-7715-6377677,11-7511-6574751,17-7725-2432353
      1 row in set (0.00 sec)
      

      Some things to note here:

      • The server_id is somehow set to 0, but the application is not setting the variable.
      • The GTID for this transaction has a gtid_domain_id of 15, which means that it should have originated on DC1-Node1. In that case, it should have server_id set to 7715.
      • DC1-Node1 has gtid_ignore_duplicates enabled
      • DC1-Node1 shouldn't try to re-run its own transaction when gtid_ignore_duplicates is enabled. It only seems to try to do so because the transaction's server_id is incorrect.

      The other events within the transaction do have the proper server_id set:

      #160301 14:56:32 server id 7715  end_log_pos 73459783   Table_map: `db3`.`tab1` mapped to number 108
      #160301 14:56:32 server id 7715  end_log_pos 73460738   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73461693   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73462648   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73463603   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73464558   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73465513   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73466468   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73467423   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73468378   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73469333   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73470288   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73471243   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73472198   Write_rows: table id 108
      

      When I look at the same event in the binary log of DC2-Node1, it looks like that server has the correct server_id for the transaction:

      # at 98082295
      #160301 14:56:33 server id 7715  end_log_pos 98082333   GTID 15-7715-6377149
      /*!100001 SET @@session.gtid_domain_id=15*//*!*/;
      /*!100001 SET @@session.server_id=7715*//*!*/;
      /*!100001 SET @@session.gtid_seq_no=6377149*//*!*/;
      BEGIN
      /*!*/;
      # at 98082333
      # at 98082427
      # at 98083382
      # at 98084337
      # at 98085292
      # at 98086247
      # at 98087202
      # at 98088157
      # at 98089112
      # at 98090067
      # at 98091022
      # at 98091977
      # at 98092932
      # at 98093887
      # at 98094842
      # at 98095797
      # at 98096752
      # at 98097707
      # at 98098662
      # at 98099617
      # at 98100572
      # at 98101527
      # at 98102482
      # at 98103437
      # at 98104392
      # at 98105347
      

      The user was able to skip this transaction on DC1-Node1 by stopping the slave, setting sql_slave_skip_counter to 1, and then starting the slave.

      Shortly after DC1-Node1 ran into the out-of-order sequence GTID error, DC2-Node2 ran into an out-of-order sequence GTID error for a different transaction:

      MariaDB [(none)]> show all slaves status \G;
      *************************** 1. row ***************************
                    Connection_name: dc1master
                    Slave_SQL_State:
                     Slave_IO_State: Waiting for master to send event
                        Master_Host: 10.35.20.41
                        Master_User: replicator
                        Master_Port: 3306
                      Connect_Retry: 60
                    Master_Log_File: mariadb-bin.006328
                Read_Master_Log_Pos: 44256878
                     Relay_Log_File: relay-bin-dc1master.000026
                      Relay_Log_Pos: 98265194
              Relay_Master_Log_File: mariadb-bin.006326
                   Slave_IO_Running: Yes
                  Slave_SQL_Running: No
                    Replicate_Do_DB:
                Replicate_Ignore_DB: db1,db2
                 Replicate_Do_Table:
             Replicate_Ignore_Table:
            Replicate_Wild_Do_Table:
        Replicate_Wild_Ignore_Table:
                         Last_Errno: 1950
                         Last_Error: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled.
                       Skip_Counter: 0
                Exec_Master_Log_Pos: 98406481
                    Relay_Log_Space: 254089255
                    Until_Condition: None
                     Until_Log_File:
                      Until_Log_Pos: 0
                 Master_SSL_Allowed: No
                 Master_SSL_CA_File:
                 Master_SSL_CA_Path:
                    Master_SSL_Cert:
                  Master_SSL_Cipher:
                     Master_SSL_Key:
              Seconds_Behind_Master: NULL
      Master_SSL_Verify_Server_Cert: No
                      Last_IO_Errno: 0
                      Last_IO_Error:
                     Last_SQL_Errno: 1950
                     Last_SQL_Error: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled.
        Replicate_Ignore_Server_Ids:
                   Master_Server_Id: 7715
                     Master_SSL_Crl:
                 Master_SSL_Crlpath:
                         Using_Gtid: Slave_Pos
                        Gtid_IO_Pos: 21-7521-7596336,0-7725-126,11-7511-6574751,15-7715-6377636,17-7725-2432353
               Retried_transactions: 0
                 Max_relay_log_size: 104857600
               Executed_log_entries: 1371851
          Slave_received_heartbeats: 1276
             Slave_heartbeat_period: 15.000
                     Gtid_Slave_Pos: 21-7521-7596336,0-7725-126,15-7715-6377149,11-7511-6574751,17-7725-2432837
      1 row in set (0.00 sec)
      

      This transaction looks normal in DC1-Node1's binary log, and it only appears once, so it seems strange that it would trigger this error:

      # at 98406481
      #160301 14:56:33 server id 7715  end_log_pos 98406519   GTID 15-7715-6377150
      /*!100001 SET @@session.gtid_seq_no=6377150*//*!*/;
      BEGIN
      /*!*/;
      # at 98406519
      # at 98406613
      # at 98407568
      # at 98408523
      # at 98409478
      # at 98410433
      # at 98411388
      # at 98412343
      # at 98413298
      # at 98414253
      # at 98415208
      # at 98416163
      # at 98417118
      # at 98418073
      # at 98419028
      # at 98419983
      # at 98420938
      # at 98421893
      # at 98422848
      # at 98423803
      # at 98424758
      

      Not only did this break replication on DC2-Node1, but the error log also hinted at a more serious problem:

      160301 19:56:33 [ERROR] Master 'dc1master': Slave SQL: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled. Gtid 15-7715-6377150, Internal MariaDB error code: 1950
      160301 19:56:33 [Warning] Master 'dc1master': Slave: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled. Error_code: 1950
      160301 19:56:33 [ERROR] Master 'dc1master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.006326' position 98406481; GTID position '21-7521-7596336,0-7725-126,15-7715-6377149,11-7511-6574751,17-7725-2432353'
      160301 19:56:33 [ERROR] Master 'dc1master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 7.
      TRANSACTION 1046771904, not started
      mysql tables in use 1, locked 7
      MySQL thread id 4, OS thread handle 0x7fee9ea4f700, query id 20972814 Waiting for slave mutex on exit
       len 1280; hex XXXXXXXXXXXXXXX; asc  \n      !                      ('                      + ^                                                                                                              [                        Z                                                                                      '                                       xd>       V     xd>                            MySQLXid-         @                                                                                                             J % /                           b e                                                     h i     h                                                a                                                               !                        A     p$              7 ^                                                                                     H                                                                                                                                                                                                                                                                                                                                                                                                                                                       ;
      

      After this, we noticed the following issues:

      • The broken slave thread held onto a slave mutex, which meant the slave couldn't be stopped or killed.
      • The broken slave thread had locks on some tables, which kept other transactions from using those tables.
      • The server wouldn't even cleanly shut down, so the server process was killed.

      The user is not using parallel replication.

      The user is using multi-source replication.

      Attachments

        Issue Links

          Activity

            Elkin Andrei Elkin added a comment -

            There are some promising news. I've discovered one way to split a group into two a way we needed.
            I had to control the SQL thread manually with gdb though to simulate a situation
            as if the SQL thread reaches EOF having a not-NULL value of rli->ign_master_log_pos_end[0].

            I can't clarify it yet whether this is actually possible in any real execution. But the slave binlog shows
            the parts just perfectly.

            The Input:

            MariaDB [test]> show binlog events in 'master-bin.000001';
            ...
            | master-bin.000001 |  327 | Gtid              |         1 |         365 | BEGIN GTID 0-1-2                           |
            | master-bin.000001 |  365 | Table_map         |         1 |         408 | table_id: 70 (test.t2)                     |
            | master-bin.000001 |  408 | Write_rows_v1     |         1 |        1468 | table_id: 70                               |
            | master-bin.000001 | 1468 | Write_rows_v1     |         1 |        2528 | table_id: 70                               |
            | master-bin.000001 | 2528 | Write_rows_v1     |         1 |        3588 | table_id: 70 flags: STMT_END_F             |
            | master-bin.000001 | 3588 | Xid               |         1 |        3615 | COMMIT /* xid=17 */        
            

            and the output in the slave binlog:

            | mysqld-bin.000001 |  471 | Gtid              |         0 |         509 | BEGIN GTID 0-0-2                                                      |
            | mysqld-bin.000001 |  509 | Table_map         |         1 |         552 | table_id: 70 (test.t2)                                                |
            | mysqld-bin.000001 |  552 | Write_rows_v1     |         1 |        1612 | table_id: 70 flags: STMT_END_F                                        |
            | mysqld-bin.000001 | 1612 | Query             |         0 |        1672 | COMMIT                                                                |
            | mysqld-bin.000001 | 1672 | Gtid              |         1 |        1710 | BEGIN GTID 0-1-3                                                      |
            | mysqld-bin.000001 | 1710 | Table_map         |         1 |        1753 | table_id: 70 (test.t2)                                                |
            | mysqld-bin.000001 | 1753 | Write_rows_v1     |         1 |        2813 | table_id: 70                                                          |
            | mysqld-bin.000001 | 2813 | Write_rows_v1     |         1 |        3873 | table_id: 70 flags: STMT_END_F                                        |
            | mysqld-bin.000001 | 3873 | Xid               |         1 |        3900 | COMMIT /* xid=14 */         
            

            Now I am looking into how to create a real execution.

            Elkin Andrei Elkin added a comment - There are some promising news. I've discovered one way to split a group into two a way we needed. I had to control the SQL thread manually with gdb though to simulate a situation as if the SQL thread reaches EOF having a not-NULL value of rli->ign_master_log_pos_end [0] . I can't clarify it yet whether this is actually possible in any real execution. But the slave binlog shows the parts just perfectly. The Input: MariaDB [test]> show binlog events in 'master-bin.000001'; ... | master-bin.000001 | 327 | Gtid | 1 | 365 | BEGIN GTID 0-1-2 | | master-bin.000001 | 365 | Table_map | 1 | 408 | table_id: 70 (test.t2) | | master-bin.000001 | 408 | Write_rows_v1 | 1 | 1468 | table_id: 70 | | master-bin.000001 | 1468 | Write_rows_v1 | 1 | 2528 | table_id: 70 | | master-bin.000001 | 2528 | Write_rows_v1 | 1 | 3588 | table_id: 70 flags: STMT_END_F | | master-bin.000001 | 3588 | Xid | 1 | 3615 | COMMIT /* xid=17 */ and the output in the slave binlog: | mysqld-bin.000001 | 471 | Gtid | 0 | 509 | BEGIN GTID 0-0-2 | | mysqld-bin.000001 | 509 | Table_map | 1 | 552 | table_id: 70 (test.t2) | | mysqld-bin.000001 | 552 | Write_rows_v1 | 1 | 1612 | table_id: 70 flags: STMT_END_F | | mysqld-bin.000001 | 1612 | Query | 0 | 1672 | COMMIT | | mysqld-bin.000001 | 1672 | Gtid | 1 | 1710 | BEGIN GTID 0-1-3 | | mysqld-bin.000001 | 1710 | Table_map | 1 | 1753 | table_id: 70 (test.t2) | | mysqld-bin.000001 | 1753 | Write_rows_v1 | 1 | 2813 | table_id: 70 | | mysqld-bin.000001 | 2813 | Write_rows_v1 | 1 | 3873 | table_id: 70 flags: STMT_END_F | | mysqld-bin.000001 | 3873 | Xid | 1 | 3900 | COMMIT /* xid=14 */ Now I am looking into how to create a real execution.

            Problem was that in a circular replication setup the master remembers
            position to events it has generated itself when reading from a slave.
            If there are no new events in the queue from the slave, a
            Gtid_list_log_event is generated to remember the last skipped event.
            The problem happens if there is a network delay and we generate a
            Gtid_list_log_event in the middle of the transaction, in which case there
            will be an implicit comment and a new transaction with serverid=0 will be
            logged.

            The fix was to not generate any Gtid_list_log_events in the middle of a
            transaction.

            monty Michael Widenius added a comment - Problem was that in a circular replication setup the master remembers position to events it has generated itself when reading from a slave. If there are no new events in the queue from the slave, a Gtid_list_log_event is generated to remember the last skipped event. The problem happens if there is a network delay and we generate a Gtid_list_log_event in the middle of the transaction, in which case there will be an implicit comment and a new transaction with serverid=0 will be logged. The fix was to not generate any Gtid_list_log_events in the middle of a transaction.

            Fix pushed.
            Code from Andrei Elkin.

            monty Michael Widenius added a comment - Fix pushed. Code from Andrei Elkin.

            Test case fails if used with repeat , It fails with this error

            worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019                                                                                                                        
            rpl.circular_serverid0 'innodb,row'      [ pass ]   3301                                                                                                                                      
            rpl.circular_serverid0 'innodb,row'      [ 2 fail ]                                                                                                                                           
                    Test ended at 2018-07-17 15:53:36                                                                                                                                                     
                                                                                                                                                                                                          
            CURRENT_TEST: rpl.circular_serverid0                                                                                                                                                          
            --- /home/sachin/replication/server/mysql-test/suite/rpl/r/circular_serverid0.result    2018-07-17 14:26:54.744560346 +0530                                                                   
            +++ /home/sachin/replication/server/mysql-test/suite/rpl/r/circular_serverid0.reject    2018-07-17 15:53:36.162384904 +0530                                                                   
            @@ -14,11 +14,19 @@                                                                                                                                                                           
             INSERT INTO `t1` VALUES (null, repeat('a', 1024)), (null, repeat('b', 1024));                                                                                                                
             connection M2;                                                                                                                                                                               
             SET debug_sync='now WAIT_FOR start_sql_thread';                                                                                                                                              
            +Warnings:                                                                                                                                                                                    
            +Warning        1639    debug sync point wait timed out                                                                                                                                       
             START SLAVE SQL_THREAD;                                                                                                                                                                      
             include/show_binlog_events.inc                                                                                                                                                               
             Log_name       Pos     Event_type      Server_id       End_log_pos     Info                                                                                                                  
             slave-bin.000001       #       Gtid    #       #       GTID #-#-#                                                                                                                            
             slave-bin.000001       #       Query   #       #       use `test`; CREATE TABLE t1 (a INT AUTO_INCREMENT PRIMARY KEY, b VARCHAR(30000)) ENGINE=innodb                                        
            +slave-bin.000001       #       Gtid    #       #       BEGIN GTID #-#-#                                                                                                                      
            +slave-bin.000001       #       Annotate_rows   #       #       INSERT INTO `t1` VALUES (null, repeat('a', 1024)), (null, repeat('b', 1024))                                                  
            +slave-bin.000001       #       Table_map       #       #       table_id: # (test.t1)                                                                                                         
            +slave-bin.000001       #       Write_rows_v1   #       #       table_id: #                                                                                                                   
            +slave-bin.000001       #       Write_rows_v1   #       #       table_id: # flags: STMT_END_F                                                                                                 
            +slave-bin.000001       #       Xid     #       #       COMMIT /* XID */                                                                                                                      
             SET debug_sync='now SIGNAL go_on_relay_logging';    
            

            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - Test case fails if used with repeat , It fails with this error worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019 rpl.circular_serverid0 'innodb,row' [ pass ] 3301 rpl.circular_serverid0 'innodb,row' [ 2 fail ] Test ended at 2018-07-17 15:53:36 CURRENT_TEST: rpl.circular_serverid0 --- /home/sachin/replication/server/mysql-test/suite/rpl/r/circular_serverid0.result 2018-07-17 14:26:54.744560346 +0530 +++ /home/sachin/replication/server/mysql-test/suite/rpl/r/circular_serverid0.reject 2018-07-17 15:53:36.162384904 +0530 @@ -14,11 +14,19 @@ INSERT INTO `t1` VALUES (null, repeat('a', 1024)), (null, repeat('b', 1024)); connection M2; SET debug_sync='now WAIT_FOR start_sql_thread'; +Warnings: +Warning 1639 debug sync point wait timed out START SLAVE SQL_THREAD; include/show_binlog_events.inc Log_name Pos Event_type Server_id End_log_pos Info slave-bin.000001 # Gtid # # GTID #-#-# slave-bin.000001 # Query # # use `test`; CREATE TABLE t1 (a INT AUTO_INCREMENT PRIMARY KEY, b VARCHAR(30000)) ENGINE=innodb +slave-bin.000001 # Gtid # # BEGIN GTID #-#-# +slave-bin.000001 # Annotate_rows # # INSERT INTO `t1` VALUES (null, repeat('a', 1024)), (null, repeat('b', 1024)) +slave-bin.000001 # Table_map # # table_id: # (test.t1) +slave-bin.000001 # Write_rows_v1 # # table_id: # +slave-bin.000001 # Write_rows_v1 # # table_id: # flags: STMT_END_F +slave-bin.000001 # Xid # # COMMIT /* XID */ SET debug_sync='now SIGNAL go_on_relay_logging';

            --debug-sync-timeout=5 can be used in mtr argument to generate the failure early

            ./mtr circular_serverid0  --debug-sync-timeout=5 --repeat 2
            

            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - --debug-sync-timeout=5 can be used in mtr argument to generate the failure early ./mtr circular_serverid0 --debug-sync-timeout=5 --repeat 2

            People

              monty Michael Widenius
              GeoffMontee Geoff Montee (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.