Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.0.22
Description
The same user who experienced MDEV-9670 has started experiencing similar issues, but in these new cases, server_id is not mysteriously set to 0 in the binlog event.
The latest error showed a bunch of out-of-order sequence number errors in a short period of time. Each time, the user attempted to skip the problem. After each of those, another out-of-order sequence number error occurred shortly after.
For example:
160417 10:33:07 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937125 which would create an out-of-order sequence number with existing GTID 35-7735-937125, and gtid strict mode is enabled. Gtid 35-7735-937125, Internal MariaDB error code: 1950
|
160417 10:33:07 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937125 which would create an out-of-order sequence number with existing GTID 35-7735-937125, and gtid strict mode is enabled. Error_code: 1950
|
160417 10:33:07 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 95365805; GTID position '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017'
|
160417 10:33:07 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 4.
|
TRANSACTION 1521353596, not started
|
mysql tables in use 1, locked 4
|
MySQL thread id 329862, OS thread handle 0x7efc5a4da700, query id 137799390 Waiting for slave mutex on exit
|
...
|
160417 10:44:43 [Note] Master 'dc2master': 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='/data/logs/relay/relay-bin-dc2master.000001', relay_log_pos='4', master_log_name='mariadb-bin.007022', master_log_pos='95365805', GTID '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017'; and new position at relay_log_file='/data/logs/relay/relay-bin-dc2master.000002', relay_log_pos='48656', master_log_name='mariadb-bin.007022', master_log_pos='95413434', GTID '21-7521-7596336,0-7725-126,35-7735-937125,11-7511-6574751,17-7725-102939928,15-7715-29236017';
|
160417 10:44:43 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937137 which would create an out-of-order sequence number with existing GTID 35-7735-937137, and gtid strict mode is enabled. Gtid 35-7735-937137, Internal MariaDB error code: 1950
|
160417 10:44:43 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937137 which would create an out-of-order sequence number with existing GTID 35-7735-937137, and gtid strict mode is enabled. Error_code: 1950
|
160417 10:44:43 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 95860806; GTID position '21-7521-7596336,0-7725-126,35-7735-937136,11-7511-6574751,17-7725-102939930,15-7715-29236019'
|
160417 10:44:43 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 4.
|
TRANSACTION 1521379111, not started
|
mysql tables in use 1, locked 4
|
MySQL thread id 348591, OS thread handle 0x7efc5a4da700, query id 138021377 Waiting for slave mutex on exit
|
...
|
160417 10:44:57 [Note] Master 'dc2master': 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='/data/logs/relay/relay-bin-dc2master.000001', relay_log_pos='4', master_log_name='mariadb-bin.007022', master_log_pos='95860806', GTID '21-7521-7596336,0-7725-126,35-7735-937136,11-7511-6574751,17-7725-102939930,15-7715-29236019'; and new position at relay_log_file='/data/logs/relay/relay-bin-dc2master.000002', relay_log_pos='23166', master_log_name='mariadb-bin.007022', master_log_pos='95882945', GTID '21-7521-7596336,0-7725-126,35-7735-937137,11-7511-6574751,17-7725-102939930,15-7715-29236019';
|
160417 10:44:57 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937151 which would create an out-of-order sequence number with existing GTID 35-7735-937151, and gtid strict mode is enabled. Gtid 35-7735-937151, Internal MariaDB error code: 1950
|
160417 10:44:57 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937151 which would create an out-of-order sequence number with existing GTID 35-7735-937151, and gtid strict mode is enabled. Error_code: 1950
|
160417 10:44:57 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 96808225; GTID position '21-7521-7596336,0-7725-126,35-7735-937150,11-7511-6574751,17-7725-102939940,15-7715-29236023'
|
160417 10:44:57 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 4.
|
TRANSACTION 1521379598, not started
|
mysql tables in use 1, locked 4
|
MySQL thread id 348606, OS thread handle 0x7efc5a4da700, query id 138026480 Waiting for slave mutex on exit
|
...
|
160417 10:45:21 [Note] Master 'dc2master': 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='/data/logs/relay/relay-bin-dc2master.000001', relay_log_pos='4', master_log_name='mariadb-bin.007022', master_log_pos='96808225', GTID '21-7521-7596336,0-7725-126,35-7735-937150,11-7511-6574751,17-7725-102939940,15-7715-29236023'; and new position at relay_log_file='/data/logs/relay/relay-bin-dc2master.000002', relay_log_pos='277601', master_log_name='mariadb-bin.007022', master_log_pos='97085707', GTID '21-7521-7596336,0-7725-126,35-7735-937151,11-7511-6574751,17-7725-102939940,15-7715-29236023';
|
160417 10:45:21 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937161 which would create an out-of-order sequence number with existing GTID 35-7735-937161, and gtid strict mode is enabled. Gtid 35-7735-937161, Internal MariaDB error code: 1950
|
160417 10:45:21 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937161 which would create an out-of-order sequence number with existing GTID 35-7735-937161, and gtid strict mode is enabled. Error_code: 1950
|
160417 10:45:21 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 97244987; GTID position '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'
|
160417 10:45:21 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 6.
|
TRANSACTION 1521380481, not started
|
mysql tables in use 1, locked 6
|
MySQL thread id 348640, OS thread handle 0x7efc5a4da700, query id 138034243 Waiting for slave mutex on exit
|
I've noticed some strange things about the latest failure at 10:45:21:
1.) When the slave stops due to out-of-order sequence number with existing GTID 35-7735-937161, the slave is stopped at master_log_name=''mariadb-bin.007022' master_log_pos='97244987', GTID '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'
160417 10:45:21 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 97244987; GTID position '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'
|
The strange thing is that GTID '35-7735-937161' is not present in the binary log until much later: master_log_name='mariadb-bin.007022' master_log_pos='97246276'
# at 97246276
|
#160417 10:33:08 server id 7735 end_log_pos 97246318 CRC32 0x0e07bf7f GTID 35-7735-937161
|
/*!100001 SET @@session.gtid_domain_id=35*//*!*/;
|
/*!100001 SET @@session.server_id=7735*//*!*/;
|
/*!100001 SET @@session.gtid_seq_no=937161*//*!*/;
|
If the slave supposedly failed executing an event at master_log_name='mariadb-bin.007022' master_log_pos='97246276', why would it stop all the way back at master_log_name='mariadb-bin.007022' master_log_pos='97244987'?
2.) When the slave stops due to out-of-order sequence number with existing GTID 35-7735-937161, the GTID position when it stops is '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'. Notice that the sequence number for gtid_domain_id 35 is 937160, which is less than the out-of-order sequence number 937161.
If the highest sequence number for the domain was only 937160 at the time, why would a sequence number of 937161 cause an out-of-order sequence number?
These items also make me wonder if these strange observations also hold for the first failure that happened at 10:33, so I backtracked in the binlogs to look into it.
1.) When the slave stops due to out-of-order sequence number with existing GTID '35-7735-937125', it stops at position master_log_name='mariadb-bin.007022', master_log_pos='95365805'
160417 10:33:07 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 95365805; GTID position '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017'
|
This position does actually seem to be correct, since the GTID '35-7735-937125' is at that position:
# at 95365805
|
#160417 10:33:07 server id 7735 end_log_pos 95365847 CRC32 0x021ffb43 GTID 35-7735-937125
|
/*!100001 SET @@session.gtid_seq_no=937125*//*!*/;
|
BEGIN
|
/*!*/;
|
2.) When the slave stops due to out-of-order sequence number with existing GTID '35-7735-937125', the GTID position when it stops is '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017'. Notice here that the sequence number for gtid_domain_id 35 is only 937124, so this out-of-order sequence error also seems to be unusual. Additionally, I can see that the transaction with the sequence number 937124 is directly preceding the failing one in the binlog:
# at 95327974
|
#160417 10:33:07 server id 7735 end_log_pos 95328016 CRC32 0x60f374c6 GTID 35-7735-937124
|
/*!100001 SET @@session.gtid_seq_no=937124*//*!*/;
|
BEGIN
|
/*!*/;
|
Let's see if observation #1 above holds for the other two failures at 10:44:43 and 10:44:57.
The failure at 10:44:43 involved GTID '35-7735-937137', and the slave thread stopped at master_log_name='mariadb-bin.007022', master_log_pos='95860806'.
This position also seems to be correct, since the GTID '35-7735-937137' is at the position:
# at 95860806
|
#160417 10:33:07 server id 7735 end_log_pos 95860848 CRC32 0x5a30278a GTID 35-7735-937137
|
/*!100001 SET @@session.gtid_seq_no=937137*//*!*/;
|
BEGIN
|
/*!*/;
|
The failure at 10:44:57 involved GTID '35-7735-937151', and the slave thread stopped at master_log_name='mariadb-bin.007022', master_log_pos='96808225'.
This GTID doesn't appear until position 96809133, a bit later than position 96808225.
# at 96809133
|
#160417 10:33:08 server id 7735 end_log_pos 96809175 CRC32 0xe44d2cfa GTID 35-7735-937151
|
/*!100001 SET @@session.gtid_domain_id=35*//*!*/;
|
/*!100001 SET @@session.server_id=7735*//*!*/;
|
/*!100001 SET @@session.gtid_seq_no=937151*//*!*/;
|
BEGIN
|
/*!*/;
|
These observations make me wonder if the SQL thread's position was corrupted after the 10:44:57 failure. I know of at least one open bug (MDEV-9138) that could cause problems with the SQL thread position, but it does not seem similar to this problem.
The user does not have parallel replication enabled.
Some other observations:
- Sometimes when this problem occurs, the mysqld process has to be killed because the slave thread, which is killed when the out-of-order sequence error occurs and no longer exists, still seems to have table locks and/or a slave mutex.
- Other times when this problem occurs, the event can simply be skipped and the slave thread will happily restart.
Attachments
Issue Links
- duplicates
-
MDEV-9670 server_id mysteriously set to 0 in binlog causes GTID & multisource replication to break
- Closed
- relates to
-
MDEV-10067 Replication thread sporadically hanging on 'Table lock' state
- Open
-
MDEV-9670 server_id mysteriously set to 0 in binlog causes GTID & multisource replication to break
- Closed