[MDEV-9670] server_id mysteriously set to 0 in binlog causes GTID & multisource replication to break Created: 2016-03-02  Updated: 2024-01-17  Resolved: 2017-07-02

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0.22
Fix Version/s: 10.0.32

Type: Bug Priority: Major
Reporter: Geoff Montee (Inactive) Assignee: Michael Widenius
Resolution: Fixed Votes: 2
Labels: gtid, multisource, replication

Issue Links:
Duplicate
is duplicated by MDEV-9952 Strange out-of-order sequence errors ... Closed
Relates
relates to MDEV-9952 Strange out-of-order sequence errors ... Closed
relates to MDEV-10067 Replication thread sporadically hangi... Open
relates to MDEV-33268 IO Thread Can Write Gtid_list_log_eve... Open
Sprint: 10.2.7-1

 Description   

A user experienced some strange GTID out-of-order sequence number errors, which caused very strange symptoms, and in order to fix the problems, the server process to be killed.

The user has the following replication topology with 8 servers total in 2 data centers (4 nodes each DC). Node1 in each data center are masters that replicate from each other using circular replication. Nodes 2,3 and 7 in each data center are slaves with read-only enabled.

The gtid_domain_id and server_ids values are the following:

DC1:
Gtid domain: 15
Node1 server_id: 7715
Node2 server_id: 7716
Node3 server_id: 7717
Node7 server_id: 7718

DC2:
Gtid domain: 17
Node1 server_id: 7725
Node2 server_id: 7726
Node3 server_id: 7727
Node7 server_id: 7728

Somehow, DC1-Node1 (server_id 7715) received an event from DC2-Node1 (server_id 7725) that had a server_id value of 0:

# at 73459651
#160301 14:56:32 server id 0  end_log_pos 73459689      GTID 15-0-6377149
/*!100001 SET @@session.gtid_domain_id=15*//*!*/;
/*!100001 SET @@session.server_id=0*//*!*/;
/*!100001 SET @@session.gtid_seq_no=6377149*//*!*/;
BEGIN
/*!*/;
# at 73459689
# at 73459783
# at 73460738
# at 73461693
# at 73462648
# at 73463603
# at 73464558
# at 73465513
# at 73466468
# at 73467423
# at 73468378
# at 73469333
# at 73470288
# at 73471243
# at 73472198
# at 73473153
# at 73474108

This caused replication to break:

MariaDB [(none)]> show all slaves status \G;
*************************** 1. row ***************************
              Connection_name: dc2master
              Slave_SQL_State:
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.55.20.41
                  Master_User: replicator
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.007285
          Read_Master_Log_Pos: 74024310
               Relay_Log_File: relay-bin-dc2master.000026
                Relay_Log_Pos: 113584
        Relay_Master_Log_File: mariadb-bin.007285
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB: db1,db2
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1950
                   Last_Error: An attempt was made to binlog GTID 15-0-6377149 which would create an out-of-ord sequence number with existing GTID 15-7715-6377149, and gtid strict mode is enabled.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 73459651
              Relay_Log_Space: 354309
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1950
               Last_SQL_Error: An attempt was made to binlog GTID 15-0-6377149 which would create an out-of-ord sequence number with existing GTID 15-7715-6377149, and gtid strict mode is enabled.
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 7725
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 21-7521-7596336,0-7725-126,11-7511-6574751,15-7715-6377150,17-7725-2432887
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 14671605
    Slave_received_heartbeats: 22207
       Slave_heartbeat_period: 15.000
               Gtid_Slave_Pos: 21-7521-7596336,0-7725-126,15-7715-6377677,11-7511-6574751,17-7725-2432353
1 row in set (0.00 sec)

Some things to note here:

  • The server_id is somehow set to 0, but the application is not setting the variable.
  • The GTID for this transaction has a gtid_domain_id of 15, which means that it should have originated on DC1-Node1. In that case, it should have server_id set to 7715.
  • DC1-Node1 has gtid_ignore_duplicates enabled
  • DC1-Node1 shouldn't try to re-run its own transaction when gtid_ignore_duplicates is enabled. It only seems to try to do so because the transaction's server_id is incorrect.

The other events within the transaction do have the proper server_id set:

#160301 14:56:32 server id 7715  end_log_pos 73459783   Table_map: `db3`.`tab1` mapped to number 108
#160301 14:56:32 server id 7715  end_log_pos 73460738   Write_rows: table id 108
#160301 14:56:32 server id 7715  end_log_pos 73461693   Write_rows: table id 108
#160301 14:56:32 server id 7715  end_log_pos 73462648   Write_rows: table id 108
#160301 14:56:32 server id 7715  end_log_pos 73463603   Write_rows: table id 108
#160301 14:56:32 server id 7715  end_log_pos 73464558   Write_rows: table id 108
#160301 14:56:32 server id 7715  end_log_pos 73465513   Write_rows: table id 108
#160301 14:56:32 server id 7715  end_log_pos 73466468   Write_rows: table id 108
#160301 14:56:32 server id 7715  end_log_pos 73467423   Write_rows: table id 108
#160301 14:56:32 server id 7715  end_log_pos 73468378   Write_rows: table id 108
#160301 14:56:32 server id 7715  end_log_pos 73469333   Write_rows: table id 108
#160301 14:56:32 server id 7715  end_log_pos 73470288   Write_rows: table id 108
#160301 14:56:32 server id 7715  end_log_pos 73471243   Write_rows: table id 108
#160301 14:56:32 server id 7715  end_log_pos 73472198   Write_rows: table id 108

When I look at the same event in the binary log of DC2-Node1, it looks like that server has the correct server_id for the transaction:

# at 98082295
#160301 14:56:33 server id 7715  end_log_pos 98082333   GTID 15-7715-6377149
/*!100001 SET @@session.gtid_domain_id=15*//*!*/;
/*!100001 SET @@session.server_id=7715*//*!*/;
/*!100001 SET @@session.gtid_seq_no=6377149*//*!*/;
BEGIN
/*!*/;
# at 98082333
# at 98082427
# at 98083382
# at 98084337
# at 98085292
# at 98086247
# at 98087202
# at 98088157
# at 98089112
# at 98090067
# at 98091022
# at 98091977
# at 98092932
# at 98093887
# at 98094842
# at 98095797
# at 98096752
# at 98097707
# at 98098662
# at 98099617
# at 98100572
# at 98101527
# at 98102482
# at 98103437
# at 98104392
# at 98105347

The user was able to skip this transaction on DC1-Node1 by stopping the slave, setting sql_slave_skip_counter to 1, and then starting the slave.

Shortly after DC1-Node1 ran into the out-of-order sequence GTID error, DC2-Node2 ran into an out-of-order sequence GTID error for a different transaction:

MariaDB [(none)]> show all slaves status \G;
*************************** 1. row ***************************
              Connection_name: dc1master
              Slave_SQL_State:
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.35.20.41
                  Master_User: replicator
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.006328
          Read_Master_Log_Pos: 44256878
               Relay_Log_File: relay-bin-dc1master.000026
                Relay_Log_Pos: 98265194
        Relay_Master_Log_File: mariadb-bin.006326
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB: db1,db2
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1950
                   Last_Error: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 98406481
              Relay_Log_Space: 254089255
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1950
               Last_SQL_Error: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled.
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 7715
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 21-7521-7596336,0-7725-126,11-7511-6574751,15-7715-6377636,17-7725-2432353
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 1371851
    Slave_received_heartbeats: 1276
       Slave_heartbeat_period: 15.000
               Gtid_Slave_Pos: 21-7521-7596336,0-7725-126,15-7715-6377149,11-7511-6574751,17-7725-2432837
1 row in set (0.00 sec)

This transaction looks normal in DC1-Node1's binary log, and it only appears once, so it seems strange that it would trigger this error:

# at 98406481
#160301 14:56:33 server id 7715  end_log_pos 98406519   GTID 15-7715-6377150
/*!100001 SET @@session.gtid_seq_no=6377150*//*!*/;
BEGIN
/*!*/;
# at 98406519
# at 98406613
# at 98407568
# at 98408523
# at 98409478
# at 98410433
# at 98411388
# at 98412343
# at 98413298
# at 98414253
# at 98415208
# at 98416163
# at 98417118
# at 98418073
# at 98419028
# at 98419983
# at 98420938
# at 98421893
# at 98422848
# at 98423803
# at 98424758

Not only did this break replication on DC2-Node1, but the error log also hinted at a more serious problem:

160301 19:56:33 [ERROR] Master 'dc1master': Slave SQL: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled. Gtid 15-7715-6377150, Internal MariaDB error code: 1950
160301 19:56:33 [Warning] Master 'dc1master': Slave: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled. Error_code: 1950
160301 19:56:33 [ERROR] Master 'dc1master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.006326' position 98406481; GTID position '21-7521-7596336,0-7725-126,15-7715-6377149,11-7511-6574751,17-7725-2432353'
160301 19:56:33 [ERROR] Master 'dc1master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 7.
TRANSACTION 1046771904, not started
mysql tables in use 1, locked 7
MySQL thread id 4, OS thread handle 0x7fee9ea4f700, query id 20972814 Waiting for slave mutex on exit
 len 1280; hex XXXXXXXXXXXXXXX; asc  \n      !                      ('                      + ^                                                                                                              [                        Z                                                                                      '                                       xd>       V     xd>                            MySQLXid-         @                                                                                                             J % /                           b e                                                     h i     h                                                a                                                               !                        A     p$              7 ^                                                                                     H                                                                                                                                                                                                                                                                                                                                                                                                                                                       ;

After this, we noticed the following issues:

  • The broken slave thread held onto a slave mutex, which meant the slave couldn't be stopped or killed.
  • The broken slave thread had locks on some tables, which kept other transactions from using those tables.
  • The server wouldn't even cleanly shut down, so the server process was killed.

The user is not using parallel replication.

The user is using multi-source replication.



 Comments   
Comment by Elena Stepanova [ 2016-03-02 ]

support issue

Comment by Kristian Nielsen [ 2016-03-02 ]

Server id 0 is never a valid value. But without some indication of what
causes such id 0 to appear, it is hard to say more - is it binlog or network
corruption, or some specific steps or queries lead to it?

It is mentioned that gtid_ignore_duplicates and multi-source replication are
used. However, the described topology is not multi-source, and should not
need (or use) gtid_ignore_duplicates. This suggests either an incomplete
description, or a misconfiguration?

Comment by Geoff Montee (Inactive) [ 2016-03-02 ]

Hi knielsen,

Server id 0 is never a valid value. But without some indication of what
causes such id 0 to appear, it is hard to say more - is it binlog or network
corruption, or some specific steps or queries lead to it?

The binary logs are available for you to look at, if that would help. Information on how to access the logs was in the email that I sent to the engineering mailing list yesterday.

If any other information would help, please let me know.

It is mentioned that gtid_ignore_duplicates and multi-source replication are
used. However, the described topology is not multi-source, and should not
need (or use) gtid_ignore_duplicates. This suggests either an incomplete
description, or a misconfiguration?

The user is not using multi-source replication in the sense that they have multiple replication sources configured. Node1 in each data center is only replicating from each other, so each server only has one master.

When I said that they are using multi-source replication, I only meant that their replication was configured using the new named connection syntax added for multi-source replication in MariaDB 10.0. For example, you can see that the connections are named dc1master and dc2master in the SHOW SLAVE STATUS output above.

I do not know why gtid_ignore_duplicates is enabled. It may have been enabled in hopes that it would somehow fix this strange problem, since the user is running into the problem fairly often.

Comment by Matt Neth [ 2017-05-25 ]

I uploaded some error logs/core dump/binlogs to ftp.askmonty.org/private/ to try to help with this bug:

5-24-totalcrash.tar.gz - MySQL process totally crashed and when it restarted all of the slave connections were stopped with the out-of-order error from MDEV-9952 but the transaction it stopped on had the 0 Server ID bug from this one. There's a README in the tarball that explains things further. I unfortunately did not get a core dump for this one. The transaction in the server's relay log has the right GTID with the correct server ID, but in the binlog the server ID became 0.

5-24-totalcrash-M3.tar.gz - Same issue as above, but this happened on a different server that is not using multisource replication. It just has one slave connection. There's a README in the tarball with more info as well. This one also has a core dump and binary in it.

Comment by Andrei Elkin [ 2017-06-21 ]

I've looked through the latest logs. Alas only 5-24-totalcrash.tar.gz contains some interesting data.
Namely, M4 was processing a single 2-2-seqno group to binlog it partially as 2-0-seqno. There are no group split traces this time in binlog. Without the error log it's impossible to judge whether the missed part is still due to split (so sent "/dev/null" for some reason), or those remained Update events were ineffective.

Comment by Matt Neth [ 2017-06-21 ]

I uploaded MDEV9670-m4-errorlog-large to the FTP which is the entire error output from the day that 5-24-totalcrash.tar.gz happened, let me know if that helps.

Comment by Matt Neth [ 2017-06-21 ]

I also uploaded MDEV9952-6-8-outoforder-M4.tar.gz and MDEV9952-6-8-outoforder-M2.tar.gz a few days ago for Monty. The out of order happened on both M2 and M4 at the same time, and the M4 errorlog output showed the server ID 0 bug as well. Both of those have binlogs/relay logs/error logs so hopefully they can provide some good info.

Comment by Kristian Nielsen [ 2017-06-21 ]

If it is about partial logging of a group (the last part?) and server_id=0, I would suggest looking at MDEV-10863.

That bug can cause replication to start in the middle of an event group. Not sure if this could cause server_id=0, though, hard to say...

Comment by Andrei Elkin [ 2017-06-21 ]

Kristian, hello !

Thanks for the point, but if the referred ticket is entirely about parallel repl:

_ MariaDB ServerMDEV-10863 parallel replication tries to continue from wrong position _

then it can't relate.

To share more with you, the server_id 0 group is completed with COMMIT Query_log_event, not Xid_log_event. And in one of two such cases the zero-server_id group was followed by the correct server_id one, so the total sum of output (as seen in the binlog) events is equal to the sum of events in the input (seen in relay-log) single group.

Andrei

Comment by Kristian Nielsen [ 2017-06-21 ]

Hm, I see.

So it sounds like something in the code is somehow flushing the binlog cache to the binlog in the middle of the event group, in a completely incorrect way that does not create the proper GTID event, nor the correct commit event. Can't think of what it could be, unfortunately...

Comment by Andrei Elkin [ 2017-06-22 ]

Yes, this premature flushing cause has been my main suspect.

Sources are quite explicit about where server_id is set to zero, and in our case I would relate just two places in next_event() which generate fake (it's crucial as it explains absence of any server_id=0 events in the relay-log) Rotate or Gtid_list events.
However I could not built yet a theory involving either one.

To a dilemma I left in my previous comments above,
a recent detailed error log for 5-24-totalcrash hints on that the split indeed took place,
and the 2nd piece was the transaction that asserted as

??InnoDB: Assertion failure in thread 1396802931
60704 in file trx0rec.cc line 1242
InnoDB: Failing assertion: DICT_TF2_FLAG_IS_SET(index->table, DICT_TF2_TEMPORARY
)
 
 
/usr/sbin/mysqld(_ZN7handler13ha_update_rowEPKhPh+0x412)[0x746ad2]
/usr/sbin/mysqld(_ZN21Update_rows_log_event11do_exec_rowEP14rpl_group_info+0x151)[0x820511]
/usr/sbin/mysqld(_ZN14Rows_log_event14do_apply_eventEP14rpl_group_info+0x27d)[0x811fed]
/usr/sbin/mysqld(_Z26apply_event_and_update_posP9Log_eventP3THDP14rpl_group_infoP19rpl_parallel_thread+0x16d)[0x55b90d]
/usr/sbin/mysqld(handle_slave_sql+0x1754)[0x55e234]
??

which eventually got rolled back by Innodb/server recovery:

InnoDB: Doing recovery: scanned up to log sequence number 462050329339
InnoDB: Transaction 447118455 was in the XA prepared state.
InnoDB: Transaction 447118455 was in the XA prepared state.
InnoDB: 2 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 1831 row operations to undo
InnoDB: Trx id counter is 447118848
170524  8:48:22 [Note] InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percent: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
InnoDB: Apply batch completed
InnoDB: Last MySQL binlog file position 0 115938667, file name ./wopr-binlog.000038
InnoDB: Starting in background the rollback of uncommitted transactions
170524  8:48:41 [Note] InnoDB: 128 rollback segment(s) are active.
2017-05-24 08:48:41 7f81fbffe700  InnoDB: Rolling back trx with id 447118354, 1831 rows to undo
 
InnoDB: Progress in percents: 1170524  8:48:41 [Note] InnoDB: Waiting for purge to start
 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100170524  8:48:41 [Note] InnoDB: Rollback of trx with id 447118354 completed
2017-05-24 08:48:41 7f81fbffe700  InnoDB: Rollback of non-prepared transactions completed
...
170524  8:48:41 [Note] Starting crash recovery...
2017-05-24 08:48:41 7f9d40b218c0  InnoDB: Starting recovery for XA transactions...
2017-05-24 08:48:41 7f9d40b218c0  InnoDB: Transaction 447118455 in prepared state after recovery
2017-05-24 08:48:41 7f9d40b218c0  InnoDB: Transaction contains changes to 3 rows
2017-05-24 08:48:41 7f9d40b218c0  InnoDB: 1 transactions in prepared state after recovery
170524  8:48:41 [Note] Found 1 prepared transaction(s) in InnoDB
170524  8:48:41 [Note] Crash recovery finished.

Notice that 1831 rows to undo is somewhat close to the number of missed rows event (to gain the same total with the input).

Comment by Andrei Elkin [ 2017-06-22 ]

Matt, thanks for uploading MDEV9670-m4-errorlog-large more informative piece of log. It was already helpful. Yet do you have
records of what was happening before this assert is reported:

2017-05-24 08:45:38 7f09da418700  InnoDB: Assertion failure in thread 139680293160704 in file trx0rec.cc line 1242

?

The newer file, alas starts from this line so we can't possibly see what might have caused an apparent premature commit of the 1st "part" of this split event group.

Ideally it would be great to see all reported incidents in the error log since time the slave service is started, actually for all slaves of multisourced slave server as we can't exclude yet interference from different sources.

Comment by Matt Neth [ 2017-06-22 ]

That's all that was in the log from around that time. That was the first entry in the log for that day and for that crash. I went ahead and uploaded the full current errorlog to the FTP as MDEV9670-full-errorlog-M4.

Comment by Andrei Elkin [ 2017-06-23 ]

There are some promising news. I've discovered one way to split a group into two a way we needed.
I had to control the SQL thread manually with gdb though to simulate a situation
as if the SQL thread reaches EOF having a not-NULL value of rli->ign_master_log_pos_end[0].

I can't clarify it yet whether this is actually possible in any real execution. But the slave binlog shows
the parts just perfectly.

The Input:

MariaDB [test]> show binlog events in 'master-bin.000001';
...
| master-bin.000001 |  327 | Gtid              |         1 |         365 | BEGIN GTID 0-1-2                           |
| master-bin.000001 |  365 | Table_map         |         1 |         408 | table_id: 70 (test.t2)                     |
| master-bin.000001 |  408 | Write_rows_v1     |         1 |        1468 | table_id: 70                               |
| master-bin.000001 | 1468 | Write_rows_v1     |         1 |        2528 | table_id: 70                               |
| master-bin.000001 | 2528 | Write_rows_v1     |         1 |        3588 | table_id: 70 flags: STMT_END_F             |
| master-bin.000001 | 3588 | Xid               |         1 |        3615 | COMMIT /* xid=17 */        

and the output in the slave binlog:

| mysqld-bin.000001 |  471 | Gtid              |         0 |         509 | BEGIN GTID 0-0-2                                                      |
| mysqld-bin.000001 |  509 | Table_map         |         1 |         552 | table_id: 70 (test.t2)                                                |
| mysqld-bin.000001 |  552 | Write_rows_v1     |         1 |        1612 | table_id: 70 flags: STMT_END_F                                        |
| mysqld-bin.000001 | 1612 | Query             |         0 |        1672 | COMMIT                                                                |
| mysqld-bin.000001 | 1672 | Gtid              |         1 |        1710 | BEGIN GTID 0-1-3                                                      |
| mysqld-bin.000001 | 1710 | Table_map         |         1 |        1753 | table_id: 70 (test.t2)                                                |
| mysqld-bin.000001 | 1753 | Write_rows_v1     |         1 |        2813 | table_id: 70                                                          |
| mysqld-bin.000001 | 2813 | Write_rows_v1     |         1 |        3873 | table_id: 70 flags: STMT_END_F                                        |
| mysqld-bin.000001 | 3873 | Xid               |         1 |        3900 | COMMIT /* xid=14 */         

Now I am looking into how to create a real execution.

Comment by Michael Widenius [ 2017-07-02 ]

Problem was that in a circular replication setup the master remembers
position to events it has generated itself when reading from a slave.
If there are no new events in the queue from the slave, a
Gtid_list_log_event is generated to remember the last skipped event.
The problem happens if there is a network delay and we generate a
Gtid_list_log_event in the middle of the transaction, in which case there
will be an implicit comment and a new transaction with serverid=0 will be
logged.

The fix was to not generate any Gtid_list_log_events in the middle of a
transaction.

Comment by Michael Widenius [ 2017-07-02 ]

Fix pushed.
Code from Andrei Elkin.

Comment by Sachin Setiya (Inactive) [ 2018-07-17 ]

Test case fails if used with repeat , It fails with this error

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019                                                                                                                        
rpl.circular_serverid0 'innodb,row'      [ pass ]   3301                                                                                                                                      
rpl.circular_serverid0 'innodb,row'      [ 2 fail ]                                                                                                                                           
        Test ended at 2018-07-17 15:53:36                                                                                                                                                     
                                                                                                                                                                                              
CURRENT_TEST: rpl.circular_serverid0                                                                                                                                                          
--- /home/sachin/replication/server/mysql-test/suite/rpl/r/circular_serverid0.result    2018-07-17 14:26:54.744560346 +0530                                                                   
+++ /home/sachin/replication/server/mysql-test/suite/rpl/r/circular_serverid0.reject    2018-07-17 15:53:36.162384904 +0530                                                                   
@@ -14,11 +14,19 @@                                                                                                                                                                           
 INSERT INTO `t1` VALUES (null, repeat('a', 1024)), (null, repeat('b', 1024));                                                                                                                
 connection M2;                                                                                                                                                                               
 SET debug_sync='now WAIT_FOR start_sql_thread';                                                                                                                                              
+Warnings:                                                                                                                                                                                    
+Warning        1639    debug sync point wait timed out                                                                                                                                       
 START SLAVE SQL_THREAD;                                                                                                                                                                      
 include/show_binlog_events.inc                                                                                                                                                               
 Log_name       Pos     Event_type      Server_id       End_log_pos     Info                                                                                                                  
 slave-bin.000001       #       Gtid    #       #       GTID #-#-#                                                                                                                            
 slave-bin.000001       #       Query   #       #       use `test`; CREATE TABLE t1 (a INT AUTO_INCREMENT PRIMARY KEY, b VARCHAR(30000)) ENGINE=innodb                                        
+slave-bin.000001       #       Gtid    #       #       BEGIN GTID #-#-#                                                                                                                      
+slave-bin.000001       #       Annotate_rows   #       #       INSERT INTO `t1` VALUES (null, repeat('a', 1024)), (null, repeat('b', 1024))                                                  
+slave-bin.000001       #       Table_map       #       #       table_id: # (test.t1)                                                                                                         
+slave-bin.000001       #       Write_rows_v1   #       #       table_id: #                                                                                                                   
+slave-bin.000001       #       Write_rows_v1   #       #       table_id: # flags: STMT_END_F                                                                                                 
+slave-bin.000001       #       Xid     #       #       COMMIT /* XID */                                                                                                                      
 SET debug_sync='now SIGNAL go_on_relay_logging';    

Comment by Sachin Setiya (Inactive) [ 2018-07-17 ]

--debug-sync-timeout=5 can be used in mtr argument to generate the failure early

./mtr circular_serverid0  --debug-sync-timeout=5 --repeat 2

Generated at Thu Feb 08 07:36:25 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.