Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.0.22
-
10.2.7-1
Description
A user experienced some strange GTID out-of-order sequence number errors, which caused very strange symptoms, and in order to fix the problems, the server process to be killed.
The user has the following replication topology with 8 servers total in 2 data centers (4 nodes each DC). Node1 in each data center are masters that replicate from each other using circular replication. Nodes 2,3 and 7 in each data center are slaves with read-only enabled.
The gtid_domain_id and server_ids values are the following:
DC1:
Gtid domain: 15
Node1 server_id: 7715
Node2 server_id: 7716
Node3 server_id: 7717
Node7 server_id: 7718
DC2:
Gtid domain: 17
Node1 server_id: 7725
Node2 server_id: 7726
Node3 server_id: 7727
Node7 server_id: 7728
Somehow, DC1-Node1 (server_id 7715) received an event from DC2-Node1 (server_id 7725) that had a server_id value of 0:
# at 73459651
|
#160301 14:56:32 server id 0 end_log_pos 73459689 GTID 15-0-6377149
|
/*!100001 SET @@session.gtid_domain_id=15*//*!*/;
|
/*!100001 SET @@session.server_id=0*//*!*/;
|
/*!100001 SET @@session.gtid_seq_no=6377149*//*!*/;
|
BEGIN
|
/*!*/;
|
# at 73459689
|
# at 73459783
|
# at 73460738
|
# at 73461693
|
# at 73462648
|
# at 73463603
|
# at 73464558
|
# at 73465513
|
# at 73466468
|
# at 73467423
|
# at 73468378
|
# at 73469333
|
# at 73470288
|
# at 73471243
|
# at 73472198
|
# at 73473153
|
# at 73474108
|
This caused replication to break:
MariaDB [(none)]> show all slaves status \G;
|
*************************** 1. row ***************************
|
Connection_name: dc2master
|
Slave_SQL_State:
|
Slave_IO_State: Waiting for master to send event
|
Master_Host: 10.55.20.41
|
Master_User: replicator
|
Master_Port: 3306
|
Connect_Retry: 60
|
Master_Log_File: mariadb-bin.007285
|
Read_Master_Log_Pos: 74024310
|
Relay_Log_File: relay-bin-dc2master.000026
|
Relay_Log_Pos: 113584
|
Relay_Master_Log_File: mariadb-bin.007285
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: No
|
Replicate_Do_DB:
|
Replicate_Ignore_DB: db1,db2
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 1950
|
Last_Error: An attempt was made to binlog GTID 15-0-6377149 which would create an out-of-ord sequence number with existing GTID 15-7715-6377149, and gtid strict mode is enabled.
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 73459651
|
Relay_Log_Space: 354309
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: No
|
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: NULL
|
Master_SSL_Verify_Server_Cert: No
|
Last_IO_Errno: 0
|
Last_IO_Error:
|
Last_SQL_Errno: 1950
|
Last_SQL_Error: An attempt was made to binlog GTID 15-0-6377149 which would create an out-of-ord sequence number with existing GTID 15-7715-6377149, and gtid strict mode is enabled.
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 7725
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: Slave_Pos
|
Gtid_IO_Pos: 21-7521-7596336,0-7725-126,11-7511-6574751,15-7715-6377150,17-7725-2432887
|
Retried_transactions: 0
|
Max_relay_log_size: 104857600
|
Executed_log_entries: 14671605
|
Slave_received_heartbeats: 22207
|
Slave_heartbeat_period: 15.000
|
Gtid_Slave_Pos: 21-7521-7596336,0-7725-126,15-7715-6377677,11-7511-6574751,17-7725-2432353
|
1 row in set (0.00 sec)
|
Some things to note here:
- The server_id is somehow set to 0, but the application is not setting the variable.
- The GTID for this transaction has a gtid_domain_id of 15, which means that it should have originated on DC1-Node1. In that case, it should have server_id set to 7715.
- DC1-Node1 has gtid_ignore_duplicates enabled
- DC1-Node1 shouldn't try to re-run its own transaction when gtid_ignore_duplicates is enabled. It only seems to try to do so because the transaction's server_id is incorrect.
The other events within the transaction do have the proper server_id set:
#160301 14:56:32 server id 7715 end_log_pos 73459783 Table_map: `db3`.`tab1` mapped to number 108
|
#160301 14:56:32 server id 7715 end_log_pos 73460738 Write_rows: table id 108
|
#160301 14:56:32 server id 7715 end_log_pos 73461693 Write_rows: table id 108
|
#160301 14:56:32 server id 7715 end_log_pos 73462648 Write_rows: table id 108
|
#160301 14:56:32 server id 7715 end_log_pos 73463603 Write_rows: table id 108
|
#160301 14:56:32 server id 7715 end_log_pos 73464558 Write_rows: table id 108
|
#160301 14:56:32 server id 7715 end_log_pos 73465513 Write_rows: table id 108
|
#160301 14:56:32 server id 7715 end_log_pos 73466468 Write_rows: table id 108
|
#160301 14:56:32 server id 7715 end_log_pos 73467423 Write_rows: table id 108
|
#160301 14:56:32 server id 7715 end_log_pos 73468378 Write_rows: table id 108
|
#160301 14:56:32 server id 7715 end_log_pos 73469333 Write_rows: table id 108
|
#160301 14:56:32 server id 7715 end_log_pos 73470288 Write_rows: table id 108
|
#160301 14:56:32 server id 7715 end_log_pos 73471243 Write_rows: table id 108
|
#160301 14:56:32 server id 7715 end_log_pos 73472198 Write_rows: table id 108
|
When I look at the same event in the binary log of DC2-Node1, it looks like that server has the correct server_id for the transaction:
# at 98082295
|
#160301 14:56:33 server id 7715 end_log_pos 98082333 GTID 15-7715-6377149
|
/*!100001 SET @@session.gtid_domain_id=15*//*!*/;
|
/*!100001 SET @@session.server_id=7715*//*!*/;
|
/*!100001 SET @@session.gtid_seq_no=6377149*//*!*/;
|
BEGIN
|
/*!*/;
|
# at 98082333
|
# at 98082427
|
# at 98083382
|
# at 98084337
|
# at 98085292
|
# at 98086247
|
# at 98087202
|
# at 98088157
|
# at 98089112
|
# at 98090067
|
# at 98091022
|
# at 98091977
|
# at 98092932
|
# at 98093887
|
# at 98094842
|
# at 98095797
|
# at 98096752
|
# at 98097707
|
# at 98098662
|
# at 98099617
|
# at 98100572
|
# at 98101527
|
# at 98102482
|
# at 98103437
|
# at 98104392
|
# at 98105347
|
The user was able to skip this transaction on DC1-Node1 by stopping the slave, setting sql_slave_skip_counter to 1, and then starting the slave.
Shortly after DC1-Node1 ran into the out-of-order sequence GTID error, DC2-Node2 ran into an out-of-order sequence GTID error for a different transaction:
MariaDB [(none)]> show all slaves status \G;
|
*************************** 1. row ***************************
|
Connection_name: dc1master
|
Slave_SQL_State:
|
Slave_IO_State: Waiting for master to send event
|
Master_Host: 10.35.20.41
|
Master_User: replicator
|
Master_Port: 3306
|
Connect_Retry: 60
|
Master_Log_File: mariadb-bin.006328
|
Read_Master_Log_Pos: 44256878
|
Relay_Log_File: relay-bin-dc1master.000026
|
Relay_Log_Pos: 98265194
|
Relay_Master_Log_File: mariadb-bin.006326
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: No
|
Replicate_Do_DB:
|
Replicate_Ignore_DB: db1,db2
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 1950
|
Last_Error: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled.
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 98406481
|
Relay_Log_Space: 254089255
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: No
|
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: NULL
|
Master_SSL_Verify_Server_Cert: No
|
Last_IO_Errno: 0
|
Last_IO_Error:
|
Last_SQL_Errno: 1950
|
Last_SQL_Error: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled.
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 7715
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: Slave_Pos
|
Gtid_IO_Pos: 21-7521-7596336,0-7725-126,11-7511-6574751,15-7715-6377636,17-7725-2432353
|
Retried_transactions: 0
|
Max_relay_log_size: 104857600
|
Executed_log_entries: 1371851
|
Slave_received_heartbeats: 1276
|
Slave_heartbeat_period: 15.000
|
Gtid_Slave_Pos: 21-7521-7596336,0-7725-126,15-7715-6377149,11-7511-6574751,17-7725-2432837
|
1 row in set (0.00 sec)
|
This transaction looks normal in DC1-Node1's binary log, and it only appears once, so it seems strange that it would trigger this error:
# at 98406481
|
#160301 14:56:33 server id 7715 end_log_pos 98406519 GTID 15-7715-6377150
|
/*!100001 SET @@session.gtid_seq_no=6377150*//*!*/;
|
BEGIN
|
/*!*/;
|
# at 98406519
|
# at 98406613
|
# at 98407568
|
# at 98408523
|
# at 98409478
|
# at 98410433
|
# at 98411388
|
# at 98412343
|
# at 98413298
|
# at 98414253
|
# at 98415208
|
# at 98416163
|
# at 98417118
|
# at 98418073
|
# at 98419028
|
# at 98419983
|
# at 98420938
|
# at 98421893
|
# at 98422848
|
# at 98423803
|
# at 98424758
|
Not only did this break replication on DC2-Node1, but the error log also hinted at a more serious problem:
160301 19:56:33 [ERROR] Master 'dc1master': Slave SQL: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled. Gtid 15-7715-6377150, Internal MariaDB error code: 1950
|
160301 19:56:33 [Warning] Master 'dc1master': Slave: An attempt was made to binlog GTID 15-7715-6377150 which would create an out-of-order sequence number with existing GTID 15-7715-6377150, and gtid strict mode is enabled. Error_code: 1950
|
160301 19:56:33 [ERROR] Master 'dc1master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.006326' position 98406481; GTID position '21-7521-7596336,0-7725-126,15-7715-6377149,11-7511-6574751,17-7725-2432353'
|
160301 19:56:33 [ERROR] Master 'dc1master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 7.
|
TRANSACTION 1046771904, not started
|
mysql tables in use 1, locked 7
|
MySQL thread id 4, OS thread handle 0x7fee9ea4f700, query id 20972814 Waiting for slave mutex on exit
|
len 1280; hex XXXXXXXXXXXXXXX; asc \n ! (' + ^ [ Z ' xd> V xd> MySQLXid- @ J % / b e h i h a ! A p$ 7 ^ H ;
|
After this, we noticed the following issues:
- The broken slave thread held onto a slave mutex, which meant the slave couldn't be stopped or killed.
- The broken slave thread had locks on some tables, which kept other transactions from using those tables.
- The server wouldn't even cleanly shut down, so the server process was killed.
The user is not using parallel replication.
The user is using multi-source replication.
Attachments
Issue Links
- is duplicated by
-
MDEV-9952 Strange out-of-order sequence errors on slave, unreleased table locks and possibly SQL thread position corruption
- Closed
- relates to
-
MDEV-9952 Strange out-of-order sequence errors on slave, unreleased table locks and possibly SQL thread position corruption
- Closed
-
MDEV-10067 Replication thread sporadically hanging on 'Table lock' state
- Open
-
MDEV-33268 IO Thread Can Write Gtid_list_log_event Mid-transaction into Relay Log
- Open