[MDEV-9670] server_id mysteriously set to 0 in binlog causes GTID & multisource replication to break - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.0.22
Fix Version/s: 10.0.32
Component/s: Replication
Labels:

Sprint:
10.2.7-1

Description

A user experienced some strange GTID out-of-order sequence number errors, which caused very strange symptoms, and in order to fix the problems, the server process to be killed.

The user has the following replication topology with 8 servers total in 2 data centers (4 nodes each DC). Node1 in each data center are masters that replicate from each other using circular replication. Nodes 2,3 and 7 in each data center are slaves with read-only enabled.

The gtid_domain_id and server_ids values are the following:

DC1:
Gtid domain: 15
Node1 server_id: 7715
Node2 server_id: 7716
Node3 server_id: 7717
Node7 server_id: 7718

DC2:
Gtid domain: 17
Node1 server_id: 7725
Node2 server_id: 7726
Node3 server_id: 7727
Node7 server_id: 7728

Somehow, DC1-Node1 (server_id 7715) received an event from DC2-Node1 (server_id 7725) that had a server_id value of 0:

# at 73459651

#160301 14:56:32 server id 0  end_log_pos 73459689      GTID 15-0-6377149

/*!100001 SET @@session.gtid_domain_id=15*//*!*/;

/*!100001 SET @@session.server_id=0*//*!*/;

/*!100001 SET @@session.gtid_seq_no=6377149*//*!*/;

BEGIN

/*!*/;

# at 73459689

# at 73459783

# at 73460738

# at 73461693

# at 73462648

# at 73463603

# at 73464558

# at 73465513

# at 73466468

# at 73467423

# at 73468378

# at 73469333

# at 73470288

# at 73471243

# at 73472198

# at 73473153

# at 73474108

This caused replication to break:

MariaDB [(none)]> show all slaves status \G;

*************************** 1. row ***************************

              Connection_name: dc2master

              Slave_SQL_State:

               Slave_IO_State: Waiting for master to send event

                  Master_Host: 10.55.20.41

                  Master_User: replicator

                  Master_Port: 3306

                Connect_Retry: 60

              Master_Log_File: mariadb-bin.007285

          Read_Master_Log_Pos: 74024310

               Relay_Log_File: relay-bin-dc2master.000026

                Relay_Log_Pos: 113584

        Relay_Master_Log_File: mariadb-bin.007285

             Slave_IO_Running: Yes

            Slave_SQL_Running: No

              Replicate_Do_DB:

          Replicate_Ignore_DB: db1,db2

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 1950

                   Last_Error: An attempt was made to binlog GTID 15-0-6377149 which would create an out-of-ord sequence number with existing GTID 15-7715-6377149, and gtid strict mode is enabled.

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 73459651

              Relay_Log_Space: 354309

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: NULL

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 1950

               Last_SQL_Error: An attempt was made to binlog GTID 15-0-6377149 which would create an out-of-ord sequence number with existing GTID 15-7715-6377149, and gtid strict mode is enabled.

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 7725

               Master_SSL_Crl:

           Master_SSL_Crlpath:

                   Using_Gtid: Slave_Pos

                  Gtid_IO_Pos: 21-7521-7596336,0-7725-126,11-7511-6574751,15-7715-6377150,17-7725-2432887

         Retried_transactions: 0

           Max_relay_log_size: 104857600

         Executed_log_entries: 14671605

    Slave_received_heartbeats: 22207

       Slave_heartbeat_period: 15.000

               Gtid_Slave_Pos: 21-7521-7596336,0-7725-126,15-7715-6377677,11-7511-6574751,17-7725-2432353

1 row in set (0.00 sec)

Some things to note here:

The server_id is somehow set to 0, but the application is not setting the variable.
The GTID for this transaction has a gtid_domain_id of 15, which means that it should have originated on DC1-Node1. In that case, it should have server_id set to 7715.
DC1-Node1 has gtid_ignore_duplicates enabled
DC1-Node1 shouldn't try to re-run its own transaction when gtid_ignore_duplicates is enabled. It only seems to try to do so because the transaction's server_id is incorrect.

The other events within the transaction do have the proper server_id set:

#160301 14:56:32 server id 7715  end_log_pos 73459783   Table_map: `db3`.`tab1` mapped to number 108

#160301 14:56:32 server id 7715  end_log_pos 73460738   Write_rows: table id 108

#160301 14:56:32 server id 7715  end_log_pos 73461693   Write_rows: table id 108

#160301 14:56:32 server id 7715  end_log_pos 73462648   Write_rows: table id 108

#160301 14:56:32 server id 7715  end_log_pos 73463603   Write_rows: table id 108

#160301 14:56:32 server id 7715  end_log_pos 73464558   Write_rows: table id 108

#160301 14:56:32 server id 7715  end_log_pos 73465513   Write_rows: table id 108

#160301 14:56:32 server id 7715  end_log_pos 73466468   Write_rows: table id 108

#160301 14:56:32 server id 7715  end_log_pos 73467423   Write_rows: table id 108

#160301 14:56:32 server id 7715  end_log_pos 73468378   Write_rows: table id 108

#160301 14:56:32 server id 7715  end_log_pos 73469333   Write_rows: table id 108

#160301 14:56:32 server id 7715  end_log_pos 73470288   Write_rows: table id 108

#160301 14:56:32 server id 7715  end_log_pos 73471243   Write_rows: table id 108

#160301 14:56:32 server id 7715  end_log_pos 73472198   Write_rows: table id 108

When I look at the same event in the binary log of DC2-Node1, it looks like that server has the correct server_id for the transaction:

# at 98082295

#160301 14:56:33 server id 7715  end_log_pos 98082333   GTID 15-7715-6377149

/*!100001 SET @@session.gtid_domain_id=15*//*!*/;

/*!100001 SET @@session.server_id=7715*//*!*/;

/*!100001 SET @@session.gtid_seq_no=6377149*//*!*/;

BEGIN

/*!*/;

# at 98082333

# at 98082427

# at 98083382

# at 98084337

# at 98085292

# at 98086247

# at 98087202

# at 98088157

# at 98089112

# at 98090067

# at 98091022

# at 98091977

# at 98092932

# at 98093887

# at 98094842

# at 98095797

# at 98096752

# at 98097707

# at 98098662

# at 98099617

# at 98100572

# at 98101527

# at 98102482

# at 98103437

# at 98104392

# at 98105347

The user was able to skip this transaction on DC1-Node1 by stopping the slave, setting sql_slave_skip_counter to 1, and then starting the slave.

Shortly after DC1-Node1 ran into the out-of-order sequence GTID error, DC2-Node2 ran into an out-of-order sequence GTID error for a different transaction:

MariaDB [(none)]> show all slaves status \G;

*************************** 1. row ***************************

              Connection_name: dc1master

              Slave_SQL_State:

               Slave_IO_State: Waiting for master to send event

                  Master_Host: 10.35.20.41

                  Master_User: replicator

                  Master_Port: 3306

                Connect_Retry: 60

              Master_Log_File: mariadb-bin.006328

          Read_Master_Log_Pos: 44256878

               Relay_Log_File: relay-bin-dc1master.000026

                Relay_Log_Pos: 98265194

        Relay_Master_Log_File: mariadb-bin.006326

             Slave_IO_Running: Yes

            Slave_SQL_Running: No

              Replicate_Do_DB:

          Replicate_Ignore_DB: db1,db2

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 1950

                   Last_Error: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled.

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 98406481

              Relay_Log_Space: 254089255

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: NULL

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 1950

               Last_SQL_Error: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled.

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 7715

               Master_SSL_Crl:

           Master_SSL_Crlpath:

                   Using_Gtid: Slave_Pos

                  Gtid_IO_Pos: 21-7521-7596336,0-7725-126,11-7511-6574751,15-7715-6377636,17-7725-2432353

         Retried_transactions: 0

           Max_relay_log_size: 104857600

         Executed_log_entries: 1371851

    Slave_received_heartbeats: 1276

       Slave_heartbeat_period: 15.000

               Gtid_Slave_Pos: 21-7521-7596336,0-7725-126,15-7715-6377149,11-7511-6574751,17-7725-2432837

1 row in set (0.00 sec)

This transaction looks normal in DC1-Node1's binary log, and it only appears once, so it seems strange that it would trigger this error:

# at 98406481

#160301 14:56:33 server id 7715  end_log_pos 98406519   GTID 15-7715-6377150

/*!100001 SET @@session.gtid_seq_no=6377150*//*!*/;

BEGIN

/*!*/;

# at 98406519

# at 98406613

# at 98407568

# at 98408523

# at 98409478

# at 98410433

# at 98411388

# at 98412343

# at 98413298

# at 98414253

# at 98415208

# at 98416163

# at 98417118

# at 98418073

# at 98419028

# at 98419983

# at 98420938

# at 98421893

# at 98422848

# at 98423803

# at 98424758

Not only did this break replication on DC2-Node1, but the error log also hinted at a more serious problem:

160301 19:56:33 [ERROR] Master 'dc1master': Slave SQL: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled. Gtid 15-7715-6377150, Internal MariaDB error code: 1950

160301 19:56:33 [Warning] Master 'dc1master': Slave: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled. Error_code: 1950

160301 19:56:33 [ERROR] Master 'dc1master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.006326' position 98406481; GTID position '21-7521-7596336,0-7725-126,15-7715-6377149,11-7511-6574751,17-7725-2432353'

160301 19:56:33 [ERROR] Master 'dc1master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 7.

TRANSACTION 1046771904, not started

mysql tables in use 1, locked 7

MySQL thread id 4, OS thread handle 0x7fee9ea4f700, query id 20972814 Waiting for slave mutex on exit

 len 1280; hex XXXXXXXXXXXXXXX; asc  \n      !                      ('                      + ^                                                                                                              [                        Z                                                                                      '                                       xd>       V     xd>                            MySQLXid-         @                                                                                                             J % /                           b e                                                     h i     h                                                a                                                               !                        A     p$              7 ^                                                                                     H                                                                                                                                                                                                                                                                                                                                                                                                                                                       ;

After this, we noticed the following issues:

The broken slave thread held onto a slave mutex, which meant the slave couldn't be stopped or killed.
The broken slave thread had locks on some tables, which kept other transactions from using those tables.
The server wouldn't even cleanly shut down, so the server process was killed.

The user is not using parallel replication.

The user is using multi-source replication.

Attachments

Issue Links

is duplicated by

MDEV-9952 Strange out-of-order sequence errors on slave, unreleased table locks and possibly SQL thread position corruption

Closed

relates to

MDEV-9952 Strange out-of-order sequence errors on slave, unreleased table locks and possibly SQL thread position corruption

Closed

MDEV-10067 Replication thread sporadically hanging on 'Table lock' state

Open

MDEV-33268 IO Thread Can Write Gtid_list_log_event Mid-transaction into Relay Log

Open

Activity

Ascending order - Click to sort in descending order

View 13 older comments

Andrei Elkin added a comment - 2017-06-23 19:11

There are some promising news. I've discovered one way to split a group into two a way we needed.
I had to control the SQL thread manually with gdb though to simulate a situation
as if the SQL thread reaches EOF having a not-NULL value of rli->ign_master_log_pos_end[0].

I can't clarify it yet whether this is actually possible in any real execution. But the slave binlog shows
the parts just perfectly.

The Input:

MariaDB [test]> show binlog events in 'master-bin.000001';

...

| master-bin.000001 |  327 | Gtid              |         1 |         365 | BEGIN GTID 0-1-2                           |

| master-bin.000001 |  365 | Table_map         |         1 |         408 | table_id: 70 (test.t2)                     |

| master-bin.000001 |  408 | Write_rows_v1     |         1 |        1468 | table_id: 70                               |

| master-bin.000001 | 1468 | Write_rows_v1     |         1 |        2528 | table_id: 70                               |

| master-bin.000001 | 2528 | Write_rows_v1     |         1 |        3588 | table_id: 70 flags: STMT_END_F             |

| master-bin.000001 | 3588 | Xid               |         1 |        3615 | COMMIT /* xid=17 */

and the output in the slave binlog:

| mysqld-bin.000001 |  471 | Gtid              |         0 |         509 | BEGIN GTID 0-0-2                                                      |

| mysqld-bin.000001 |  509 | Table_map         |         1 |         552 | table_id: 70 (test.t2)                                                |

| mysqld-bin.000001 |  552 | Write_rows_v1     |         1 |        1612 | table_id: 70 flags: STMT_END_F                                        |

| mysqld-bin.000001 | 1612 | Query             |         0 |        1672 | COMMIT                                                                |

| mysqld-bin.000001 | 1672 | Gtid              |         1 |        1710 | BEGIN GTID 0-1-3                                                      |

| mysqld-bin.000001 | 1710 | Table_map         |         1 |        1753 | table_id: 70 (test.t2)                                                |

| mysqld-bin.000001 | 1753 | Write_rows_v1     |         1 |        2813 | table_id: 70                                                          |

| mysqld-bin.000001 | 2813 | Write_rows_v1     |         1 |        3873 | table_id: 70 flags: STMT_END_F                                        |

| mysqld-bin.000001 | 3873 | Xid               |         1 |        3900 | COMMIT /* xid=14 */

Now I am looking into how to create a real execution.

Andrei Elkin added a comment - 2017-06-23 19:11 There are some promising news. I've discovered one way to split a group into two a way we needed. I had to control the SQL thread manually with gdb though to simulate a situation as if the SQL thread reaches EOF having a not-NULL value of rli->ign_master_log_pos_end [0] . I can't clarify it yet whether this is actually possible in any real execution. But the slave binlog shows the parts just perfectly. The Input: MariaDB [test]> show binlog events in 'master-bin.000001'; ... | master-bin.000001 | 327 | Gtid | 1 | 365 | BEGIN GTID 0-1-2 | | master-bin.000001 | 365 | Table_map | 1 | 408 | table_id: 70 (test.t2) | | master-bin.000001 | 408 | Write_rows_v1 | 1 | 1468 | table_id: 70 | | master-bin.000001 | 1468 | Write_rows_v1 | 1 | 2528 | table_id: 70 | | master-bin.000001 | 2528 | Write_rows_v1 | 1 | 3588 | table_id: 70 flags: STMT_END_F | | master-bin.000001 | 3588 | Xid | 1 | 3615 | COMMIT /* xid=17 */ and the output in the slave binlog: | mysqld-bin.000001 | 471 | Gtid | 0 | 509 | BEGIN GTID 0-0-2 | | mysqld-bin.000001 | 509 | Table_map | 1 | 552 | table_id: 70 (test.t2) | | mysqld-bin.000001 | 552 | Write_rows_v1 | 1 | 1612 | table_id: 70 flags: STMT_END_F | | mysqld-bin.000001 | 1612 | Query | 0 | 1672 | COMMIT | | mysqld-bin.000001 | 1672 | Gtid | 1 | 1710 | BEGIN GTID 0-1-3 | | mysqld-bin.000001 | 1710 | Table_map | 1 | 1753 | table_id: 70 (test.t2) | | mysqld-bin.000001 | 1753 | Write_rows_v1 | 1 | 2813 | table_id: 70 | | mysqld-bin.000001 | 2813 | Write_rows_v1 | 1 | 3873 | table_id: 70 flags: STMT_END_F | | mysqld-bin.000001 | 3873 | Xid | 1 | 3900 | COMMIT /* xid=14 */ Now I am looking into how to create a real execution.

Michael Widenius added a comment - 2017-07-02 15:56

Problem was that in a circular replication setup the master remembers
position to events it has generated itself when reading from a slave.
If there are no new events in the queue from the slave, a
Gtid_list_log_event is generated to remember the last skipped event.
The problem happens if there is a network delay and we generate a
Gtid_list_log_event in the middle of the transaction, in which case there
will be an implicit comment and a new transaction with serverid=0 will be
logged.

The fix was to not generate any Gtid_list_log_events in the middle of a
transaction.

Michael Widenius added a comment - 2017-07-02 15:56 Problem was that in a circular replication setup the master remembers position to events it has generated itself when reading from a slave. If there are no new events in the queue from the slave, a Gtid_list_log_event is generated to remember the last skipped event. The problem happens if there is a network delay and we generate a Gtid_list_log_event in the middle of the transaction, in which case there will be an implicit comment and a new transaction with serverid=0 will be logged. The fix was to not generate any Gtid_list_log_events in the middle of a transaction.

Michael Widenius added a comment - 2017-07-02 16:48

Fix pushed.
Code from Andrei Elkin.

Michael Widenius added a comment - 2017-07-02 16:48 Fix pushed. Code from Andrei Elkin.

Sachin Setiya (Inactive) added a comment - 2018-07-17 10:40

Test case fails if used with repeat , It fails with this error

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019

rpl.circular_serverid0 'innodb,row'      [ pass ]   3301

rpl.circular_serverid0 'innodb,row'      [ 2 fail ]

        Test ended at 2018-07-17 15:53:36

CURRENT_TEST: rpl.circular_serverid0

--- /home/sachin/replication/server/mysql-test/suite/rpl/r/circular_serverid0.result    2018-07-17 14:26:54.744560346 +0530

+++ /home/sachin/replication/server/mysql-test/suite/rpl/r/circular_serverid0.reject    2018-07-17 15:53:36.162384904 +0530

@@ -14,11 +14,19 @@

 INSERT INTO `t1` VALUES (null, repeat('a', 1024)), (null, repeat('b', 1024));

 connection M2;

 SET debug_sync='now WAIT_FOR start_sql_thread';

+Warnings:

+Warning        1639    debug sync point wait timed out

 START SLAVE SQL_THREAD;

 include/show_binlog_events.inc

 Log_name       Pos     Event_type      Server_id       End_log_pos     Info

 slave-bin.000001       #       Gtid    #       #       GTID #-#-#

 slave-bin.000001       #       Query   #       #       use `test`; CREATE TABLE t1 (a INT AUTO_INCREMENT PRIMARY KEY, b VARCHAR(30000)) ENGINE=innodb

+slave-bin.000001       #       Gtid    #       #       BEGIN GTID #-#-#

+slave-bin.000001       #       Annotate_rows   #       #       INSERT INTO `t1` VALUES (null, repeat('a', 1024)), (null, repeat('b', 1024))

+slave-bin.000001       #       Table_map       #       #       table_id: # (test.t1)

+slave-bin.000001       #       Write_rows_v1   #       #       table_id: #

+slave-bin.000001       #       Write_rows_v1   #       #       table_id: # flags: STMT_END_F

+slave-bin.000001       #       Xid     #       #       COMMIT /* XID */

 SET debug_sync='now SIGNAL go_on_relay_logging';

Sachin Setiya (Inactive) added a comment - 2018-07-17 10:40 Test case fails if used with repeat , It fails with this error worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019 rpl.circular_serverid0 'innodb,row' [ pass ] 3301 rpl.circular_serverid0 'innodb,row' [ 2 fail ] Test ended at 2018-07-17 15:53:36 CURRENT_TEST: rpl.circular_serverid0 --- /home/sachin/replication/server/mysql-test/suite/rpl/r/circular_serverid0.result 2018-07-17 14:26:54.744560346 +0530 +++ /home/sachin/replication/server/mysql-test/suite/rpl/r/circular_serverid0.reject 2018-07-17 15:53:36.162384904 +0530 @@ -14,11 +14,19 @@ INSERT INTO `t1` VALUES (null, repeat('a', 1024)), (null, repeat('b', 1024)); connection M2; SET debug_sync='now WAIT_FOR start_sql_thread'; +Warnings: +Warning 1639 debug sync point wait timed out START SLAVE SQL_THREAD; include/show_binlog_events.inc Log_name Pos Event_type Server_id End_log_pos Info slave-bin.000001 # Gtid # # GTID #-#-# slave-bin.000001 # Query # # use `test`; CREATE TABLE t1 (a INT AUTO_INCREMENT PRIMARY KEY, b VARCHAR(30000)) ENGINE=innodb +slave-bin.000001 # Gtid # # BEGIN GTID #-#-# +slave-bin.000001 # Annotate_rows # # INSERT INTO `t1` VALUES (null, repeat('a', 1024)), (null, repeat('b', 1024)) +slave-bin.000001 # Table_map # # table_id: # (test.t1) +slave-bin.000001 # Write_rows_v1 # # table_id: # +slave-bin.000001 # Write_rows_v1 # # table_id: # flags: STMT_END_F +slave-bin.000001 # Xid # # COMMIT /* XID */ SET debug_sync='now SIGNAL go_on_relay_logging';

Sachin Setiya (Inactive) added a comment - 2018-07-17 10:41

--debug-sync-timeout=5 can be used in mtr argument to generate the failure early

./mtr circular_serverid0  --debug-sync-timeout=5 --repeat 2

Sachin Setiya (Inactive) added a comment - 2018-07-17 10:41 --debug-sync-timeout=5 can be used in mtr argument to generate the failure early ./mtr circular_serverid0 --debug-sync-timeout=5 --repeat 2

People

Assignee:: Michael Widenius

Reporter:: Geoff Montee (Inactive)

Votes:: 2 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 2016-03-02 00:30

Updated:: 2024-07-08 01:11

Resolved:: 2017-07-02 16:48

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server