Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9670

server_id mysteriously set to 0 in binlog causes GTID & multisource replication to break

Details

    • 10.2.7-1

    Description

      A user experienced some strange GTID out-of-order sequence number errors, which caused very strange symptoms, and in order to fix the problems, the server process to be killed.

      The user has the following replication topology with 8 servers total in 2 data centers (4 nodes each DC). Node1 in each data center are masters that replicate from each other using circular replication. Nodes 2,3 and 7 in each data center are slaves with read-only enabled.

      The gtid_domain_id and server_ids values are the following:

      DC1:
      Gtid domain: 15
      Node1 server_id: 7715
      Node2 server_id: 7716
      Node3 server_id: 7717
      Node7 server_id: 7718

      DC2:
      Gtid domain: 17
      Node1 server_id: 7725
      Node2 server_id: 7726
      Node3 server_id: 7727
      Node7 server_id: 7728

      Somehow, DC1-Node1 (server_id 7715) received an event from DC2-Node1 (server_id 7725) that had a server_id value of 0:

      # at 73459651
      #160301 14:56:32 server id 0  end_log_pos 73459689      GTID 15-0-6377149
      /*!100001 SET @@session.gtid_domain_id=15*//*!*/;
      /*!100001 SET @@session.server_id=0*//*!*/;
      /*!100001 SET @@session.gtid_seq_no=6377149*//*!*/;
      BEGIN
      /*!*/;
      # at 73459689
      # at 73459783
      # at 73460738
      # at 73461693
      # at 73462648
      # at 73463603
      # at 73464558
      # at 73465513
      # at 73466468
      # at 73467423
      # at 73468378
      # at 73469333
      # at 73470288
      # at 73471243
      # at 73472198
      # at 73473153
      # at 73474108
      

      This caused replication to break:

      MariaDB [(none)]> show all slaves status \G;
      *************************** 1. row ***************************
                    Connection_name: dc2master
                    Slave_SQL_State:
                     Slave_IO_State: Waiting for master to send event
                        Master_Host: 10.55.20.41
                        Master_User: replicator
                        Master_Port: 3306
                      Connect_Retry: 60
                    Master_Log_File: mariadb-bin.007285
                Read_Master_Log_Pos: 74024310
                     Relay_Log_File: relay-bin-dc2master.000026
                      Relay_Log_Pos: 113584
              Relay_Master_Log_File: mariadb-bin.007285
                   Slave_IO_Running: Yes
                  Slave_SQL_Running: No
                    Replicate_Do_DB:
                Replicate_Ignore_DB: db1,db2
                 Replicate_Do_Table:
             Replicate_Ignore_Table:
            Replicate_Wild_Do_Table:
        Replicate_Wild_Ignore_Table:
                         Last_Errno: 1950
                         Last_Error: An attempt was made to binlog GTID 15-0-6377149 which would create an out-of-ord sequence number with existing GTID 15-7715-6377149, and gtid strict mode is enabled.
                       Skip_Counter: 0
                Exec_Master_Log_Pos: 73459651
                    Relay_Log_Space: 354309
                    Until_Condition: None
                     Until_Log_File:
                      Until_Log_Pos: 0
                 Master_SSL_Allowed: No
                 Master_SSL_CA_File:
                 Master_SSL_CA_Path:
                    Master_SSL_Cert:
                  Master_SSL_Cipher:
                     Master_SSL_Key:
              Seconds_Behind_Master: NULL
      Master_SSL_Verify_Server_Cert: No
                      Last_IO_Errno: 0
                      Last_IO_Error:
                     Last_SQL_Errno: 1950
                     Last_SQL_Error: An attempt was made to binlog GTID 15-0-6377149 which would create an out-of-ord sequence number with existing GTID 15-7715-6377149, and gtid strict mode is enabled.
        Replicate_Ignore_Server_Ids:
                   Master_Server_Id: 7725
                     Master_SSL_Crl:
                 Master_SSL_Crlpath:
                         Using_Gtid: Slave_Pos
                        Gtid_IO_Pos: 21-7521-7596336,0-7725-126,11-7511-6574751,15-7715-6377150,17-7725-2432887
               Retried_transactions: 0
                 Max_relay_log_size: 104857600
               Executed_log_entries: 14671605
          Slave_received_heartbeats: 22207
             Slave_heartbeat_period: 15.000
                     Gtid_Slave_Pos: 21-7521-7596336,0-7725-126,15-7715-6377677,11-7511-6574751,17-7725-2432353
      1 row in set (0.00 sec)
      

      Some things to note here:

      • The server_id is somehow set to 0, but the application is not setting the variable.
      • The GTID for this transaction has a gtid_domain_id of 15, which means that it should have originated on DC1-Node1. In that case, it should have server_id set to 7715.
      • DC1-Node1 has gtid_ignore_duplicates enabled
      • DC1-Node1 shouldn't try to re-run its own transaction when gtid_ignore_duplicates is enabled. It only seems to try to do so because the transaction's server_id is incorrect.

      The other events within the transaction do have the proper server_id set:

      #160301 14:56:32 server id 7715  end_log_pos 73459783   Table_map: `db3`.`tab1` mapped to number 108
      #160301 14:56:32 server id 7715  end_log_pos 73460738   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73461693   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73462648   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73463603   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73464558   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73465513   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73466468   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73467423   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73468378   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73469333   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73470288   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73471243   Write_rows: table id 108
      #160301 14:56:32 server id 7715  end_log_pos 73472198   Write_rows: table id 108
      

      When I look at the same event in the binary log of DC2-Node1, it looks like that server has the correct server_id for the transaction:

      # at 98082295
      #160301 14:56:33 server id 7715  end_log_pos 98082333   GTID 15-7715-6377149
      /*!100001 SET @@session.gtid_domain_id=15*//*!*/;
      /*!100001 SET @@session.server_id=7715*//*!*/;
      /*!100001 SET @@session.gtid_seq_no=6377149*//*!*/;
      BEGIN
      /*!*/;
      # at 98082333
      # at 98082427
      # at 98083382
      # at 98084337
      # at 98085292
      # at 98086247
      # at 98087202
      # at 98088157
      # at 98089112
      # at 98090067
      # at 98091022
      # at 98091977
      # at 98092932
      # at 98093887
      # at 98094842
      # at 98095797
      # at 98096752
      # at 98097707
      # at 98098662
      # at 98099617
      # at 98100572
      # at 98101527
      # at 98102482
      # at 98103437
      # at 98104392
      # at 98105347
      

      The user was able to skip this transaction on DC1-Node1 by stopping the slave, setting sql_slave_skip_counter to 1, and then starting the slave.

      Shortly after DC1-Node1 ran into the out-of-order sequence GTID error, DC2-Node2 ran into an out-of-order sequence GTID error for a different transaction:

      MariaDB [(none)]> show all slaves status \G;
      *************************** 1. row ***************************
                    Connection_name: dc1master
                    Slave_SQL_State:
                     Slave_IO_State: Waiting for master to send event
                        Master_Host: 10.35.20.41
                        Master_User: replicator
                        Master_Port: 3306
                      Connect_Retry: 60
                    Master_Log_File: mariadb-bin.006328
                Read_Master_Log_Pos: 44256878
                     Relay_Log_File: relay-bin-dc1master.000026
                      Relay_Log_Pos: 98265194
              Relay_Master_Log_File: mariadb-bin.006326
                   Slave_IO_Running: Yes
                  Slave_SQL_Running: No
                    Replicate_Do_DB:
                Replicate_Ignore_DB: db1,db2
                 Replicate_Do_Table:
             Replicate_Ignore_Table:
            Replicate_Wild_Do_Table:
        Replicate_Wild_Ignore_Table:
                         Last_Errno: 1950
                         Last_Error: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled.
                       Skip_Counter: 0
                Exec_Master_Log_Pos: 98406481
                    Relay_Log_Space: 254089255
                    Until_Condition: None
                     Until_Log_File:
                      Until_Log_Pos: 0
                 Master_SSL_Allowed: No
                 Master_SSL_CA_File:
                 Master_SSL_CA_Path:
                    Master_SSL_Cert:
                  Master_SSL_Cipher:
                     Master_SSL_Key:
              Seconds_Behind_Master: NULL
      Master_SSL_Verify_Server_Cert: No
                      Last_IO_Errno: 0
                      Last_IO_Error:
                     Last_SQL_Errno: 1950
                     Last_SQL_Error: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled.
        Replicate_Ignore_Server_Ids:
                   Master_Server_Id: 7715
                     Master_SSL_Crl:
                 Master_SSL_Crlpath:
                         Using_Gtid: Slave_Pos
                        Gtid_IO_Pos: 21-7521-7596336,0-7725-126,11-7511-6574751,15-7715-6377636,17-7725-2432353
               Retried_transactions: 0
                 Max_relay_log_size: 104857600
               Executed_log_entries: 1371851
          Slave_received_heartbeats: 1276
             Slave_heartbeat_period: 15.000
                     Gtid_Slave_Pos: 21-7521-7596336,0-7725-126,15-7715-6377149,11-7511-6574751,17-7725-2432837
      1 row in set (0.00 sec)
      

      This transaction looks normal in DC1-Node1's binary log, and it only appears once, so it seems strange that it would trigger this error:

      # at 98406481
      #160301 14:56:33 server id 7715  end_log_pos 98406519   GTID 15-7715-6377150
      /*!100001 SET @@session.gtid_seq_no=6377150*//*!*/;
      BEGIN
      /*!*/;
      # at 98406519
      # at 98406613
      # at 98407568
      # at 98408523
      # at 98409478
      # at 98410433
      # at 98411388
      # at 98412343
      # at 98413298
      # at 98414253
      # at 98415208
      # at 98416163
      # at 98417118
      # at 98418073
      # at 98419028
      # at 98419983
      # at 98420938
      # at 98421893
      # at 98422848
      # at 98423803
      # at 98424758
      

      Not only did this break replication on DC2-Node1, but the error log also hinted at a more serious problem:

      160301 19:56:33 [ERROR] Master 'dc1master': Slave SQL: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled. Gtid 15-7715-6377150, Internal MariaDB error code: 1950
      160301 19:56:33 [Warning] Master 'dc1master': Slave: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled. Error_code: 1950
      160301 19:56:33 [ERROR] Master 'dc1master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.006326' position 98406481; GTID position '21-7521-7596336,0-7725-126,15-7715-6377149,11-7511-6574751,17-7725-2432353'
      160301 19:56:33 [ERROR] Master 'dc1master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 7.
      TRANSACTION 1046771904, not started
      mysql tables in use 1, locked 7
      MySQL thread id 4, OS thread handle 0x7fee9ea4f700, query id 20972814 Waiting for slave mutex on exit
       len 1280; hex XXXXXXXXXXXXXXX; asc  \n      !                      ('                      + ^                                                                                                              [                        Z                                                                                      '                                       xd>       V     xd>                            MySQLXid-         @                                                                                                             J % /                           b e                                                     h i     h                                                a                                                               !                        A     p$              7 ^                                                                                     H                                                                                                                                                                                                                                                                                                                                                                                                                                                       ;
      

      After this, we noticed the following issues:

      • The broken slave thread held onto a slave mutex, which meant the slave couldn't be stopped or killed.
      • The broken slave thread had locks on some tables, which kept other transactions from using those tables.
      • The server wouldn't even cleanly shut down, so the server process was killed.

      The user is not using parallel replication.

      The user is using multi-source replication.

      Attachments

        Issue Links

          Activity

            elenst Elena Stepanova added a comment - - edited

            support issue

            elenst Elena Stepanova added a comment - - edited support issue

            Server id 0 is never a valid value. But without some indication of what
            causes such id 0 to appear, it is hard to say more - is it binlog or network
            corruption, or some specific steps or queries lead to it?

            It is mentioned that gtid_ignore_duplicates and multi-source replication are
            used. However, the described topology is not multi-source, and should not
            need (or use) gtid_ignore_duplicates. This suggests either an incomplete
            description, or a misconfiguration?

            knielsen Kristian Nielsen added a comment - Server id 0 is never a valid value. But without some indication of what causes such id 0 to appear, it is hard to say more - is it binlog or network corruption, or some specific steps or queries lead to it? It is mentioned that gtid_ignore_duplicates and multi-source replication are used. However, the described topology is not multi-source, and should not need (or use) gtid_ignore_duplicates. This suggests either an incomplete description, or a misconfiguration?
            GeoffMontee Geoff Montee (Inactive) added a comment - - edited

            Hi knielsen,

            Server id 0 is never a valid value. But without some indication of what
            causes such id 0 to appear, it is hard to say more - is it binlog or network
            corruption, or some specific steps or queries lead to it?

            The binary logs are available for you to look at, if that would help. Information on how to access the logs was in the email that I sent to the engineering mailing list yesterday.

            If any other information would help, please let me know.

            It is mentioned that gtid_ignore_duplicates and multi-source replication are
            used. However, the described topology is not multi-source, and should not
            need (or use) gtid_ignore_duplicates. This suggests either an incomplete
            description, or a misconfiguration?

            The user is not using multi-source replication in the sense that they have multiple replication sources configured. Node1 in each data center is only replicating from each other, so each server only has one master.

            When I said that they are using multi-source replication, I only meant that their replication was configured using the new named connection syntax added for multi-source replication in MariaDB 10.0. For example, you can see that the connections are named dc1master and dc2master in the SHOW SLAVE STATUS output above.

            I do not know why gtid_ignore_duplicates is enabled. It may have been enabled in hopes that it would somehow fix this strange problem, since the user is running into the problem fairly often.

            GeoffMontee Geoff Montee (Inactive) added a comment - - edited Hi knielsen , Server id 0 is never a valid value. But without some indication of what causes such id 0 to appear, it is hard to say more - is it binlog or network corruption, or some specific steps or queries lead to it? The binary logs are available for you to look at, if that would help. Information on how to access the logs was in the email that I sent to the engineering mailing list yesterday. If any other information would help, please let me know. It is mentioned that gtid_ignore_duplicates and multi-source replication are used. However, the described topology is not multi-source, and should not need (or use) gtid_ignore_duplicates. This suggests either an incomplete description, or a misconfiguration? The user is not using multi-source replication in the sense that they have multiple replication sources configured. Node1 in each data center is only replicating from each other, so each server only has one master. When I said that they are using multi-source replication, I only meant that their replication was configured using the new named connection syntax added for multi-source replication in MariaDB 10.0. For example, you can see that the connections are named dc1master and dc2master in the SHOW SLAVE STATUS output above. I do not know why gtid_ignore_duplicates is enabled. It may have been enabled in hopes that it would somehow fix this strange problem, since the user is running into the problem fairly often.
            mneth1 Matt Neth added a comment -

            I uploaded some error logs/core dump/binlogs to ftp.askmonty.org/private/ to try to help with this bug:

            5-24-totalcrash.tar.gz - MySQL process totally crashed and when it restarted all of the slave connections were stopped with the out-of-order error from MDEV-9952 but the transaction it stopped on had the 0 Server ID bug from this one. There's a README in the tarball that explains things further. I unfortunately did not get a core dump for this one. The transaction in the server's relay log has the right GTID with the correct server ID, but in the binlog the server ID became 0.

            5-24-totalcrash-M3.tar.gz - Same issue as above, but this happened on a different server that is not using multisource replication. It just has one slave connection. There's a README in the tarball with more info as well. This one also has a core dump and binary in it.

            mneth1 Matt Neth added a comment - I uploaded some error logs/core dump/binlogs to ftp.askmonty.org/private/ to try to help with this bug: 5-24-totalcrash.tar.gz - MySQL process totally crashed and when it restarted all of the slave connections were stopped with the out-of-order error from MDEV-9952 but the transaction it stopped on had the 0 Server ID bug from this one. There's a README in the tarball that explains things further. I unfortunately did not get a core dump for this one. The transaction in the server's relay log has the right GTID with the correct server ID, but in the binlog the server ID became 0. 5-24-totalcrash-M3.tar.gz - Same issue as above, but this happened on a different server that is not using multisource replication. It just has one slave connection. There's a README in the tarball with more info as well. This one also has a core dump and binary in it.
            Elkin Andrei Elkin added a comment - - edited

            I've looked through the latest logs. Alas only 5-24-totalcrash.tar.gz contains some interesting data.
            Namely, M4 was processing a single 2-2-seqno group to binlog it partially as 2-0-seqno. There are no group split traces this time in binlog. Without the error log it's impossible to judge whether the missed part is still due to split (so sent "/dev/null" for some reason), or those remained Update events were ineffective.

            Elkin Andrei Elkin added a comment - - edited I've looked through the latest logs. Alas only 5-24-totalcrash.tar.gz contains some interesting data. Namely, M4 was processing a single 2-2-seqno group to binlog it partially as 2-0-seqno. There are no group split traces this time in binlog. Without the error log it's impossible to judge whether the missed part is still due to split (so sent "/dev/null" for some reason), or those remained Update events were ineffective.
            mneth1 Matt Neth added a comment -

            I uploaded MDEV9670-m4-errorlog-large to the FTP which is the entire error output from the day that 5-24-totalcrash.tar.gz happened, let me know if that helps.

            mneth1 Matt Neth added a comment - I uploaded MDEV9670-m4-errorlog-large to the FTP which is the entire error output from the day that 5-24-totalcrash.tar.gz happened, let me know if that helps.
            mneth1 Matt Neth added a comment -

            I also uploaded MDEV9952-6-8-outoforder-M4.tar.gz and MDEV9952-6-8-outoforder-M2.tar.gz a few days ago for Monty. The out of order happened on both M2 and M4 at the same time, and the M4 errorlog output showed the server ID 0 bug as well. Both of those have binlogs/relay logs/error logs so hopefully they can provide some good info.

            mneth1 Matt Neth added a comment - I also uploaded MDEV9952-6-8-outoforder-M4.tar.gz and MDEV9952-6-8-outoforder-M2.tar.gz a few days ago for Monty. The out of order happened on both M2 and M4 at the same time, and the M4 errorlog output showed the server ID 0 bug as well. Both of those have binlogs/relay logs/error logs so hopefully they can provide some good info.

            If it is about partial logging of a group (the last part?) and server_id=0, I would suggest looking at MDEV-10863.

            That bug can cause replication to start in the middle of an event group. Not sure if this could cause server_id=0, though, hard to say...

            knielsen Kristian Nielsen added a comment - If it is about partial logging of a group (the last part?) and server_id=0, I would suggest looking at MDEV-10863 . That bug can cause replication to start in the middle of an event group. Not sure if this could cause server_id=0, though, hard to say...
            Elkin Andrei Elkin added a comment - - edited

            Kristian, hello !

            Thanks for the point, but if the referred ticket is entirely about parallel repl:

            _ MariaDB ServerMDEV-10863 parallel replication tries to continue from wrong position _

            then it can't relate.

            To share more with you, the server_id 0 group is completed with COMMIT Query_log_event, not Xid_log_event. And in one of two such cases the zero-server_id group was followed by the correct server_id one, so the total sum of output (as seen in the binlog) events is equal to the sum of events in the input (seen in relay-log) single group.

            Andrei

            Elkin Andrei Elkin added a comment - - edited Kristian, hello ! Thanks for the point, but if the referred ticket is entirely about parallel repl: _ MariaDB ServerMDEV-10863 parallel replication tries to continue from wrong position _ then it can't relate. To share more with you, the server_id 0 group is completed with COMMIT Query_log_event, not Xid_log_event. And in one of two such cases the zero-server_id group was followed by the correct server_id one, so the total sum of output (as seen in the binlog) events is equal to the sum of events in the input (seen in relay-log) single group. Andrei

            Hm, I see.

            So it sounds like something in the code is somehow flushing the binlog cache to the binlog in the middle of the event group, in a completely incorrect way that does not create the proper GTID event, nor the correct commit event. Can't think of what it could be, unfortunately...

            knielsen Kristian Nielsen added a comment - Hm, I see. So it sounds like something in the code is somehow flushing the binlog cache to the binlog in the middle of the event group, in a completely incorrect way that does not create the proper GTID event, nor the correct commit event. Can't think of what it could be, unfortunately...
            Elkin Andrei Elkin added a comment - - edited

            Yes, this premature flushing cause has been my main suspect.

            Sources are quite explicit about where server_id is set to zero, and in our case I would relate just two places in next_event() which generate fake (it's crucial as it explains absence of any server_id=0 events in the relay-log) Rotate or Gtid_list events.
            However I could not built yet a theory involving either one.

            To a dilemma I left in my previous comments above,
            a recent detailed error log for 5-24-totalcrash hints on that the split indeed took place,
            and the 2nd piece was the transaction that asserted as

            ??InnoDB: Assertion failure in thread 1396802931
            60704 in file trx0rec.cc line 1242
            InnoDB: Failing assertion: DICT_TF2_FLAG_IS_SET(index->table, DICT_TF2_TEMPORARY
            )
             
             
            /usr/sbin/mysqld(_ZN7handler13ha_update_rowEPKhPh+0x412)[0x746ad2]
            /usr/sbin/mysqld(_ZN21Update_rows_log_event11do_exec_rowEP14rpl_group_info+0x151)[0x820511]
            /usr/sbin/mysqld(_ZN14Rows_log_event14do_apply_eventEP14rpl_group_info+0x27d)[0x811fed]
            /usr/sbin/mysqld(_Z26apply_event_and_update_posP9Log_eventP3THDP14rpl_group_infoP19rpl_parallel_thread+0x16d)[0x55b90d]
            /usr/sbin/mysqld(handle_slave_sql+0x1754)[0x55e234]
            ??
            

            which eventually got rolled back by Innodb/server recovery:

            InnoDB: Doing recovery: scanned up to log sequence number 462050329339
            InnoDB: Transaction 447118455 was in the XA prepared state.
            InnoDB: Transaction 447118455 was in the XA prepared state.
            InnoDB: 2 transaction(s) which must be rolled back or cleaned up
            InnoDB: in total 1831 row operations to undo
            InnoDB: Trx id counter is 447118848
            170524  8:48:22 [Note] InnoDB: Starting an apply batch of log records to the database...
            InnoDB: Progress in percent: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
            InnoDB: Apply batch completed
            InnoDB: Last MySQL binlog file position 0 115938667, file name ./wopr-binlog.000038
            InnoDB: Starting in background the rollback of uncommitted transactions
            170524  8:48:41 [Note] InnoDB: 128 rollback segment(s) are active.
            2017-05-24 08:48:41 7f81fbffe700  InnoDB: Rolling back trx with id 447118354, 1831 rows to undo
             
            InnoDB: Progress in percents: 1170524  8:48:41 [Note] InnoDB: Waiting for purge to start
             2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100170524  8:48:41 [Note] InnoDB: Rollback of trx with id 447118354 completed
            2017-05-24 08:48:41 7f81fbffe700  InnoDB: Rollback of non-prepared transactions completed
            ...
            170524  8:48:41 [Note] Starting crash recovery...
            2017-05-24 08:48:41 7f9d40b218c0  InnoDB: Starting recovery for XA transactions...
            2017-05-24 08:48:41 7f9d40b218c0  InnoDB: Transaction 447118455 in prepared state after recovery
            2017-05-24 08:48:41 7f9d40b218c0  InnoDB: Transaction contains changes to 3 rows
            2017-05-24 08:48:41 7f9d40b218c0  InnoDB: 1 transactions in prepared state after recovery
            170524  8:48:41 [Note] Found 1 prepared transaction(s) in InnoDB
            170524  8:48:41 [Note] Crash recovery finished.
            
            

            Notice that 1831 rows to undo is somewhat close to the number of missed rows event (to gain the same total with the input).

            Elkin Andrei Elkin added a comment - - edited Yes, this premature flushing cause has been my main suspect. Sources are quite explicit about where server_id is set to zero, and in our case I would relate just two places in next_event() which generate fake (it's crucial as it explains absence of any server_id=0 events in the relay-log) Rotate or Gtid_list events. However I could not built yet a theory involving either one. To a dilemma I left in my previous comments above, a recent detailed error log for 5-24-totalcrash hints on that the split indeed took place, and the 2nd piece was the transaction that asserted as ??InnoDB: Assertion failure in thread 1396802931 60704 in file trx0rec.cc line 1242 InnoDB: Failing assertion: DICT_TF2_FLAG_IS_SET(index->table, DICT_TF2_TEMPORARY )     /usr/sbin/mysqld(_ZN7handler13ha_update_rowEPKhPh+0x412)[0x746ad2] /usr/sbin/mysqld(_ZN21Update_rows_log_event11do_exec_rowEP14rpl_group_info+0x151)[0x820511] /usr/sbin/mysqld(_ZN14Rows_log_event14do_apply_eventEP14rpl_group_info+0x27d)[0x811fed] /usr/sbin/mysqld(_Z26apply_event_and_update_posP9Log_eventP3THDP14rpl_group_infoP19rpl_parallel_thread+0x16d)[0x55b90d] /usr/sbin/mysqld(handle_slave_sql+0x1754)[0x55e234] ?? which eventually got rolled back by Innodb/server recovery: InnoDB: Doing recovery: scanned up to log sequence number 462050329339 InnoDB: Transaction 447118455 was in the XA prepared state. InnoDB: Transaction 447118455 was in the XA prepared state. InnoDB: 2 transaction(s) which must be rolled back or cleaned up InnoDB: in total 1831 row operations to undo InnoDB: Trx id counter is 447118848 170524 8:48:22 [Note] InnoDB: Starting an apply batch of log records to the database... InnoDB: Progress in percent: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 InnoDB: Apply batch completed InnoDB: Last MySQL binlog file position 0 115938667, file name ./wopr-binlog.000038 InnoDB: Starting in background the rollback of uncommitted transactions 170524 8:48:41 [Note] InnoDB: 128 rollback segment(s) are active. 2017-05-24 08:48:41 7f81fbffe700 InnoDB: Rolling back trx with id 447118354, 1831 rows to undo   InnoDB: Progress in percents: 1170524 8:48:41 [Note] InnoDB: Waiting for purge to start 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100170524 8:48:41 [Note] InnoDB: Rollback of trx with id 447118354 completed 2017-05-24 08:48:41 7f81fbffe700 InnoDB: Rollback of non-prepared transactions completed ... 170524 8:48:41 [Note] Starting crash recovery... 2017-05-24 08:48:41 7f9d40b218c0 InnoDB: Starting recovery for XA transactions... 2017-05-24 08:48:41 7f9d40b218c0 InnoDB: Transaction 447118455 in prepared state after recovery 2017-05-24 08:48:41 7f9d40b218c0 InnoDB: Transaction contains changes to 3 rows 2017-05-24 08:48:41 7f9d40b218c0 InnoDB: 1 transactions in prepared state after recovery 170524 8:48:41 [Note] Found 1 prepared transaction(s) in InnoDB 170524 8:48:41 [Note] Crash recovery finished. Notice that 1831 rows to undo is somewhat close to the number of missed rows event (to gain the same total with the input).
            Elkin Andrei Elkin added a comment - - edited

            Matt, thanks for uploading MDEV9670-m4-errorlog-large more informative piece of log. It was already helpful. Yet do you have
            records of what was happening before this assert is reported:

            2017-05-24 08:45:38 7f09da418700  InnoDB: Assertion failure in thread 139680293160704 in file trx0rec.cc line 1242
            

            ?

            The newer file, alas starts from this line so we can't possibly see what might have caused an apparent premature commit of the 1st "part" of this split event group.

            Ideally it would be great to see all reported incidents in the error log since time the slave service is started, actually for all slaves of multisourced slave server as we can't exclude yet interference from different sources.

            Elkin Andrei Elkin added a comment - - edited Matt, thanks for uploading MDEV9670-m4-errorlog-large more informative piece of log. It was already helpful. Yet do you have records of what was happening before this assert is reported: 2017-05-24 08:45:38 7f09da418700 InnoDB: Assertion failure in thread 139680293160704 in file trx0rec.cc line 1242 ? The newer file, alas starts from this line so we can't possibly see what might have caused an apparent premature commit of the 1st "part" of this split event group. Ideally it would be great to see all reported incidents in the error log since time the slave service is started, actually for all slaves of multisourced slave server as we can't exclude yet interference from different sources.
            mneth1 Matt Neth added a comment -

            That's all that was in the log from around that time. That was the first entry in the log for that day and for that crash. I went ahead and uploaded the full current errorlog to the FTP as MDEV9670-full-errorlog-M4.

            mneth1 Matt Neth added a comment - That's all that was in the log from around that time. That was the first entry in the log for that day and for that crash. I went ahead and uploaded the full current errorlog to the FTP as MDEV9670-full-errorlog-M4.
            Elkin Andrei Elkin added a comment -

            There are some promising news. I've discovered one way to split a group into two a way we needed.
            I had to control the SQL thread manually with gdb though to simulate a situation
            as if the SQL thread reaches EOF having a not-NULL value of rli->ign_master_log_pos_end[0].

            I can't clarify it yet whether this is actually possible in any real execution. But the slave binlog shows
            the parts just perfectly.

            The Input:

            MariaDB [test]> show binlog events in 'master-bin.000001';
            ...
            | master-bin.000001 |  327 | Gtid              |         1 |         365 | BEGIN GTID 0-1-2                           |
            | master-bin.000001 |  365 | Table_map         |         1 |         408 | table_id: 70 (test.t2)                     |
            | master-bin.000001 |  408 | Write_rows_v1     |         1 |        1468 | table_id: 70                               |
            | master-bin.000001 | 1468 | Write_rows_v1     |         1 |        2528 | table_id: 70                               |
            | master-bin.000001 | 2528 | Write_rows_v1     |         1 |        3588 | table_id: 70 flags: STMT_END_F             |
            | master-bin.000001 | 3588 | Xid               |         1 |        3615 | COMMIT /* xid=17 */        
            

            and the output in the slave binlog:

            | mysqld-bin.000001 |  471 | Gtid              |         0 |         509 | BEGIN GTID 0-0-2                                                      |
            | mysqld-bin.000001 |  509 | Table_map         |         1 |         552 | table_id: 70 (test.t2)                                                |
            | mysqld-bin.000001 |  552 | Write_rows_v1     |         1 |        1612 | table_id: 70 flags: STMT_END_F                                        |
            | mysqld-bin.000001 | 1612 | Query             |         0 |        1672 | COMMIT                                                                |
            | mysqld-bin.000001 | 1672 | Gtid              |         1 |        1710 | BEGIN GTID 0-1-3                                                      |
            | mysqld-bin.000001 | 1710 | Table_map         |         1 |        1753 | table_id: 70 (test.t2)                                                |
            | mysqld-bin.000001 | 1753 | Write_rows_v1     |         1 |        2813 | table_id: 70                                                          |
            | mysqld-bin.000001 | 2813 | Write_rows_v1     |         1 |        3873 | table_id: 70 flags: STMT_END_F                                        |
            | mysqld-bin.000001 | 3873 | Xid               |         1 |        3900 | COMMIT /* xid=14 */         
            

            Now I am looking into how to create a real execution.

            Elkin Andrei Elkin added a comment - There are some promising news. I've discovered one way to split a group into two a way we needed. I had to control the SQL thread manually with gdb though to simulate a situation as if the SQL thread reaches EOF having a not-NULL value of rli->ign_master_log_pos_end [0] . I can't clarify it yet whether this is actually possible in any real execution. But the slave binlog shows the parts just perfectly. The Input: MariaDB [test]> show binlog events in 'master-bin.000001'; ... | master-bin.000001 | 327 | Gtid | 1 | 365 | BEGIN GTID 0-1-2 | | master-bin.000001 | 365 | Table_map | 1 | 408 | table_id: 70 (test.t2) | | master-bin.000001 | 408 | Write_rows_v1 | 1 | 1468 | table_id: 70 | | master-bin.000001 | 1468 | Write_rows_v1 | 1 | 2528 | table_id: 70 | | master-bin.000001 | 2528 | Write_rows_v1 | 1 | 3588 | table_id: 70 flags: STMT_END_F | | master-bin.000001 | 3588 | Xid | 1 | 3615 | COMMIT /* xid=17 */ and the output in the slave binlog: | mysqld-bin.000001 | 471 | Gtid | 0 | 509 | BEGIN GTID 0-0-2 | | mysqld-bin.000001 | 509 | Table_map | 1 | 552 | table_id: 70 (test.t2) | | mysqld-bin.000001 | 552 | Write_rows_v1 | 1 | 1612 | table_id: 70 flags: STMT_END_F | | mysqld-bin.000001 | 1612 | Query | 0 | 1672 | COMMIT | | mysqld-bin.000001 | 1672 | Gtid | 1 | 1710 | BEGIN GTID 0-1-3 | | mysqld-bin.000001 | 1710 | Table_map | 1 | 1753 | table_id: 70 (test.t2) | | mysqld-bin.000001 | 1753 | Write_rows_v1 | 1 | 2813 | table_id: 70 | | mysqld-bin.000001 | 2813 | Write_rows_v1 | 1 | 3873 | table_id: 70 flags: STMT_END_F | | mysqld-bin.000001 | 3873 | Xid | 1 | 3900 | COMMIT /* xid=14 */ Now I am looking into how to create a real execution.

            Problem was that in a circular replication setup the master remembers
            position to events it has generated itself when reading from a slave.
            If there are no new events in the queue from the slave, a
            Gtid_list_log_event is generated to remember the last skipped event.
            The problem happens if there is a network delay and we generate a
            Gtid_list_log_event in the middle of the transaction, in which case there
            will be an implicit comment and a new transaction with serverid=0 will be
            logged.

            The fix was to not generate any Gtid_list_log_events in the middle of a
            transaction.

            monty Michael Widenius added a comment - Problem was that in a circular replication setup the master remembers position to events it has generated itself when reading from a slave. If there are no new events in the queue from the slave, a Gtid_list_log_event is generated to remember the last skipped event. The problem happens if there is a network delay and we generate a Gtid_list_log_event in the middle of the transaction, in which case there will be an implicit comment and a new transaction with serverid=0 will be logged. The fix was to not generate any Gtid_list_log_events in the middle of a transaction.

            Fix pushed.
            Code from Andrei Elkin.

            monty Michael Widenius added a comment - Fix pushed. Code from Andrei Elkin.

            Test case fails if used with repeat , It fails with this error

            worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019                                                                                                                        
            rpl.circular_serverid0 'innodb,row'      [ pass ]   3301                                                                                                                                      
            rpl.circular_serverid0 'innodb,row'      [ 2 fail ]                                                                                                                                           
                    Test ended at 2018-07-17 15:53:36                                                                                                                                                     
                                                                                                                                                                                                          
            CURRENT_TEST: rpl.circular_serverid0                                                                                                                                                          
            --- /home/sachin/replication/server/mysql-test/suite/rpl/r/circular_serverid0.result    2018-07-17 14:26:54.744560346 +0530                                                                   
            +++ /home/sachin/replication/server/mysql-test/suite/rpl/r/circular_serverid0.reject    2018-07-17 15:53:36.162384904 +0530                                                                   
            @@ -14,11 +14,19 @@                                                                                                                                                                           
             INSERT INTO `t1` VALUES (null, repeat('a', 1024)), (null, repeat('b', 1024));                                                                                                                
             connection M2;                                                                                                                                                                               
             SET debug_sync='now WAIT_FOR start_sql_thread';                                                                                                                                              
            +Warnings:                                                                                                                                                                                    
            +Warning        1639    debug sync point wait timed out                                                                                                                                       
             START SLAVE SQL_THREAD;                                                                                                                                                                      
             include/show_binlog_events.inc                                                                                                                                                               
             Log_name       Pos     Event_type      Server_id       End_log_pos     Info                                                                                                                  
             slave-bin.000001       #       Gtid    #       #       GTID #-#-#                                                                                                                            
             slave-bin.000001       #       Query   #       #       use `test`; CREATE TABLE t1 (a INT AUTO_INCREMENT PRIMARY KEY, b VARCHAR(30000)) ENGINE=innodb                                        
            +slave-bin.000001       #       Gtid    #       #       BEGIN GTID #-#-#                                                                                                                      
            +slave-bin.000001       #       Annotate_rows   #       #       INSERT INTO `t1` VALUES (null, repeat('a', 1024)), (null, repeat('b', 1024))                                                  
            +slave-bin.000001       #       Table_map       #       #       table_id: # (test.t1)                                                                                                         
            +slave-bin.000001       #       Write_rows_v1   #       #       table_id: #                                                                                                                   
            +slave-bin.000001       #       Write_rows_v1   #       #       table_id: # flags: STMT_END_F                                                                                                 
            +slave-bin.000001       #       Xid     #       #       COMMIT /* XID */                                                                                                                      
             SET debug_sync='now SIGNAL go_on_relay_logging';    
            

            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - Test case fails if used with repeat , It fails with this error worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019 rpl.circular_serverid0 'innodb,row' [ pass ] 3301 rpl.circular_serverid0 'innodb,row' [ 2 fail ] Test ended at 2018-07-17 15:53:36 CURRENT_TEST: rpl.circular_serverid0 --- /home/sachin/replication/server/mysql-test/suite/rpl/r/circular_serverid0.result 2018-07-17 14:26:54.744560346 +0530 +++ /home/sachin/replication/server/mysql-test/suite/rpl/r/circular_serverid0.reject 2018-07-17 15:53:36.162384904 +0530 @@ -14,11 +14,19 @@ INSERT INTO `t1` VALUES (null, repeat('a', 1024)), (null, repeat('b', 1024)); connection M2; SET debug_sync='now WAIT_FOR start_sql_thread'; +Warnings: +Warning 1639 debug sync point wait timed out START SLAVE SQL_THREAD; include/show_binlog_events.inc Log_name Pos Event_type Server_id End_log_pos Info slave-bin.000001 # Gtid # # GTID #-#-# slave-bin.000001 # Query # # use `test`; CREATE TABLE t1 (a INT AUTO_INCREMENT PRIMARY KEY, b VARCHAR(30000)) ENGINE=innodb +slave-bin.000001 # Gtid # # BEGIN GTID #-#-# +slave-bin.000001 # Annotate_rows # # INSERT INTO `t1` VALUES (null, repeat('a', 1024)), (null, repeat('b', 1024)) +slave-bin.000001 # Table_map # # table_id: # (test.t1) +slave-bin.000001 # Write_rows_v1 # # table_id: # +slave-bin.000001 # Write_rows_v1 # # table_id: # flags: STMT_END_F +slave-bin.000001 # Xid # # COMMIT /* XID */ SET debug_sync='now SIGNAL go_on_relay_logging';

            --debug-sync-timeout=5 can be used in mtr argument to generate the failure early

            ./mtr circular_serverid0  --debug-sync-timeout=5 --repeat 2
            

            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - --debug-sync-timeout=5 can be used in mtr argument to generate the failure early ./mtr circular_serverid0 --debug-sync-timeout=5 --repeat 2

            People

              monty Michael Widenius
              GeoffMontee Geoff Montee (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.