[MDEV-9952] Strange out-of-order sequence errors on slave, unreleased table locks and possibly SQL thread position corruption Created: 2016-04-19  Updated: 2018-07-14  Resolved: 2018-07-14

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0.22
Fix Version/s: 10.0.32

Type: Bug Priority: Major
Reporter: Geoff Montee (Inactive) Assignee: Andrei Elkin
Resolution: Fixed Votes: 4
Labels: gtid, replication

Attachments: HTML File 5-17-crash-errorlog     HTML File 5-17-crash-gdb-bt     Text File gdb-bt-during-slave-status-lock.txt    
Issue Links:
Duplicate
duplicates MDEV-9670 server_id mysteriously set to 0 in bi... Closed
Relates
relates to MDEV-10067 Replication thread sporadically hangi... Open
relates to MDEV-9670 server_id mysteriously set to 0 in bi... Closed

 Description   

The same user who experienced MDEV-9670 has started experiencing similar issues, but in these new cases, server_id is not mysteriously set to 0 in the binlog event.

The latest error showed a bunch of out-of-order sequence number errors in a short period of time. Each time, the user attempted to skip the problem. After each of those, another out-of-order sequence number error occurred shortly after.

For example:

160417 10:33:07 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937125 which would create an out-of-order sequence number with existing GTID 35-7735-937125, and gtid strict mode is enabled. Gtid 35-7735-937125, Internal MariaDB error code: 1950
160417 10:33:07 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937125 which would create an out-of-order sequence number with existing GTID 35-7735-937125, and gtid strict mode is enabled. Error_code: 1950
160417 10:33:07 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 95365805; GTID position '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017'
160417 10:33:07 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 4.
TRANSACTION 1521353596, not started
mysql tables in use 1, locked 4
MySQL thread id 329862, OS thread handle 0x7efc5a4da700, query id 137799390 Waiting for slave mutex on exit
...
160417 10:44:43 [Note] Master 'dc2master': 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='/data/logs/relay/relay-bin-dc2master.000001', relay_log_pos='4', master_log_name='mariadb-bin.007022', master_log_pos='95365805', GTID '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017';  and new position at relay_log_file='/data/logs/relay/relay-bin-dc2master.000002', relay_log_pos='48656', master_log_name='mariadb-bin.007022', master_log_pos='95413434', GTID '21-7521-7596336,0-7725-126,35-7735-937125,11-7511-6574751,17-7725-102939928,15-7715-29236017';
160417 10:44:43 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937137 which would create an out-of-order sequence number with existing GTID 35-7735-937137, and gtid strict mode is enabled. Gtid 35-7735-937137, Internal MariaDB error code: 1950
160417 10:44:43 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937137 which would create an out-of-order sequence number with existing GTID 35-7735-937137, and gtid strict mode is enabled. Error_code: 1950
160417 10:44:43 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 95860806; GTID position '21-7521-7596336,0-7725-126,35-7735-937136,11-7511-6574751,17-7725-102939930,15-7715-29236019'
160417 10:44:43 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 4.
TRANSACTION 1521379111, not started
mysql tables in use 1, locked 4
MySQL thread id 348591, OS thread handle 0x7efc5a4da700, query id 138021377 Waiting for slave mutex on exit
...
160417 10:44:57 [Note] Master 'dc2master': 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='/data/logs/relay/relay-bin-dc2master.000001', relay_log_pos='4', master_log_name='mariadb-bin.007022', master_log_pos='95860806', GTID '21-7521-7596336,0-7725-126,35-7735-937136,11-7511-6574751,17-7725-102939930,15-7715-29236019';  and new position at relay_log_file='/data/logs/relay/relay-bin-dc2master.000002', relay_log_pos='23166', master_log_name='mariadb-bin.007022', master_log_pos='95882945', GTID '21-7521-7596336,0-7725-126,35-7735-937137,11-7511-6574751,17-7725-102939930,15-7715-29236019';
160417 10:44:57 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937151 which would create an out-of-order sequence number with existing GTID 35-7735-937151, and gtid strict mode is enabled. Gtid 35-7735-937151, Internal MariaDB error code: 1950
160417 10:44:57 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937151 which would create an out-of-order sequence number with existing GTID 35-7735-937151, and gtid strict mode is enabled. Error_code: 1950
160417 10:44:57 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 96808225; GTID position '21-7521-7596336,0-7725-126,35-7735-937150,11-7511-6574751,17-7725-102939940,15-7715-29236023'
160417 10:44:57 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 4.
TRANSACTION 1521379598, not started
mysql tables in use 1, locked 4
MySQL thread id 348606, OS thread handle 0x7efc5a4da700, query id 138026480 Waiting for slave mutex on exit
...
160417 10:45:21 [Note] Master 'dc2master': 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='/data/logs/relay/relay-bin-dc2master.000001', relay_log_pos='4', master_log_name='mariadb-bin.007022', master_log_pos='96808225', GTID '21-7521-7596336,0-7725-126,35-7735-937150,11-7511-6574751,17-7725-102939940,15-7715-29236023';  and new position at relay_log_file='/data/logs/relay/relay-bin-dc2master.000002', relay_log_pos='277601', master_log_name='mariadb-bin.007022', master_log_pos='97085707', GTID '21-7521-7596336,0-7725-126,35-7735-937151,11-7511-6574751,17-7725-102939940,15-7715-29236023';
160417 10:45:21 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937161 which would create an out-of-order sequence number with existing GTID 35-7735-937161, and gtid strict mode is enabled. Gtid 35-7735-937161, Internal MariaDB error code: 1950
160417 10:45:21 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937161 which would create an out-of-order sequence number with existing GTID 35-7735-937161, and gtid strict mode is enabled. Error_code: 1950
160417 10:45:21 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 97244987; GTID position '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'
160417 10:45:21 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 6.
TRANSACTION 1521380481, not started
mysql tables in use 1, locked 6
MySQL thread id 348640, OS thread handle 0x7efc5a4da700, query id 138034243 Waiting for slave mutex on exit

I've noticed some strange things about the latest failure at 10:45:21:

1.) When the slave stops due to out-of-order sequence number with existing GTID 35-7735-937161, the slave is stopped at master_log_name=''mariadb-bin.007022' master_log_pos='97244987', GTID '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'

160417 10:45:21 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 97244987; GTID position '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'

The strange thing is that GTID '35-7735-937161' is not present in the binary log until much later: master_log_name='mariadb-bin.007022' master_log_pos='97246276'

# at 97246276
#160417 10:33:08 server id 7735  end_log_pos 97246318 CRC32 0x0e07bf7f  GTID 35-7735-937161
/*!100001 SET @@session.gtid_domain_id=35*//*!*/;
/*!100001 SET @@session.server_id=7735*//*!*/;
/*!100001 SET @@session.gtid_seq_no=937161*//*!*/;

If the slave supposedly failed executing an event at master_log_name='mariadb-bin.007022' master_log_pos='97246276', why would it stop all the way back at master_log_name='mariadb-bin.007022' master_log_pos='97244987'?

2.) When the slave stops due to out-of-order sequence number with existing GTID 35-7735-937161, the GTID position when it stops is '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'. Notice that the sequence number for gtid_domain_id 35 is 937160, which is less than the out-of-order sequence number 937161.

If the highest sequence number for the domain was only 937160 at the time, why would a sequence number of 937161 cause an out-of-order sequence number?

These items also make me wonder if these strange observations also hold for the first failure that happened at 10:33, so I backtracked in the binlogs to look into it.

1.) When the slave stops due to out-of-order sequence number with existing GTID '35-7735-937125', it stops at position master_log_name='mariadb-bin.007022', master_log_pos='95365805'

160417 10:33:07 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 95365805; GTID position '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017'

This position does actually seem to be correct, since the GTID '35-7735-937125' is at that position:

# at 95365805
#160417 10:33:07 server id 7735  end_log_pos 95365847 CRC32 0x021ffb43  GTID 35-7735-937125
/*!100001 SET @@session.gtid_seq_no=937125*//*!*/;
BEGIN
/*!*/;

2.) When the slave stops due to out-of-order sequence number with existing GTID '35-7735-937125', the GTID position when it stops is '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017'. Notice here that the sequence number for gtid_domain_id 35 is only 937124, so this out-of-order sequence error also seems to be unusual. Additionally, I can see that the transaction with the sequence number 937124 is directly preceding the failing one in the binlog:

# at 95327974
#160417 10:33:07 server id 7735  end_log_pos 95328016 CRC32 0x60f374c6  GTID 35-7735-937124
/*!100001 SET @@session.gtid_seq_no=937124*//*!*/;
BEGIN
/*!*/;

Let's see if observation #1 above holds for the other two failures at 10:44:43 and 10:44:57.

The failure at 10:44:43 involved GTID '35-7735-937137', and the slave thread stopped at master_log_name='mariadb-bin.007022', master_log_pos='95860806'.

This position also seems to be correct, since the GTID '35-7735-937137' is at the position:

# at 95860806
#160417 10:33:07 server id 7735  end_log_pos 95860848 CRC32 0x5a30278a  GTID 35-7735-937137
/*!100001 SET @@session.gtid_seq_no=937137*//*!*/;
BEGIN
/*!*/;

The failure at 10:44:57 involved GTID '35-7735-937151', and the slave thread stopped at master_log_name='mariadb-bin.007022', master_log_pos='96808225'.

This GTID doesn't appear until position 96809133, a bit later than position 96808225.

# at 96809133
#160417 10:33:08 server id 7735  end_log_pos 96809175 CRC32 0xe44d2cfa  GTID 35-7735-937151
/*!100001 SET @@session.gtid_domain_id=35*//*!*/;
/*!100001 SET @@session.server_id=7735*//*!*/;
/*!100001 SET @@session.gtid_seq_no=937151*//*!*/;
BEGIN
/*!*/;

These observations make me wonder if the SQL thread's position was corrupted after the 10:44:57 failure. I know of at least one open bug (MDEV-9138) that could cause problems with the SQL thread position, but it does not seem similar to this problem.

The user does not have parallel replication enabled.

Some other observations:

  • Sometimes when this problem occurs, the mysqld process has to be killed because the slave thread, which is killed when the out-of-order sequence error occurs and no longer exists, still seems to have table locks and/or a slave mutex.
  • Other times when this problem occurs, the event can simply be skipped and the slave thread will happily restart.


 Comments   
Comment by Geoff Montee (Inactive) [ 2016-04-19 ]

support issue

Comment by Kristian Nielsen [ 2016-04-19 ]

> If the highest sequence number for the domain was only 937160 at the time,
> why would a sequence number of 937161 cause an out-of-order sequence
> number?

There seems to be some confusion here. The out-of-order error is not about
the slave GTID replication position. It is about the slave binlog.

If gtid_strict_mode is set, a commit on the server is first checked against
the current binlog. If the commit would cause an out-of-order sequence in
the binlog, the commit is rejected with the out-of-order error message.

So the error occurs when trying to replicate a GTID that is already in the
slave binlog, independently of what the slave replication position is.

It is not a bug in itself to get this out-of-order error, it is a feature to
protect against mistakes in the replication setup. You need to check the
slave binlog and find out why the slave is replicating GTIDs that it already
has in its binlog. One way this can happen is if there are multiple ways
that GTIDs arrive (multi-source replication or maybe mysqldump|mysqlbinlog).
Another way is if there are manual transactions done on the slave with same
domain id...

Comment by Geoff Montee (Inactive) [ 2016-04-19 ]

There seems to be some confusion here. The out-of-order error is not about
the slave GTID replication position. It is about the slave binlog.

I understand that. However, data from this particular gtid_domain_id (35) is only coming from data replicated from dc2master. The only way that a GTID from this domain should end up in the slave's binlog is if it replicates it from dc2master's binlog.

It is not a bug in itself to get this out-of-order error, it is a feature to
protect against mistakes in the replication setup.

Thanks, but I understand what the feature is for. It just does not seem to be functioning as-expected in this case.

You need to check the slave binlog and find out why the slave is replicating GTIDs that it already has in its binlog.

Well, I'm not entirely sure that the GTIDs are in the slave's binlog, because as I mentioned, they should only be there if they came from dc2master's binlog. However, I will ask for the binlogs from this slave, so I can check.

Any thoughts on why the GTIDs in the out-of-order sequence errors starting at 10:44:57 aren't even at the position in the binlog that the SQL thread supposedly encounters the error at?

Or why the whole system sometimes hangs when this happens, and the only way to un-hang it is to kill mysqld?

Comment by Kristian Nielsen [ 2016-04-19 ]

> Any thoughts on why the GTIDs in the out-of-order sequence errors starting
> at 10:44:57 aren't even at the position in the binlog that the SQL thread
> supposedly encounters the error at?

It's very hard to say from the information given. Try pasting the relevant
part of the master's binlog around those events (it's ok to replace data
with XXX or something if it is considered not suitable to publish).

> Or why the whole system sometimes hangs when this happens, and the only
> way to un-hang it is to kill mysqld?

Server hang should always be a bug (and there seems to be complaints about
innodb?). But it is very hard to say something sensible about it without a
way to reproduce.

Hm, one thought. SQL_SLAVE_SKIP_COUNTER is used, I wonder if the slave IO
thread was stopped before setting it? It is not allowed to set
SQL_SLAVE_SKIP_COUNTER when any slave thread is running, especially in GTID
mode this can cause a number of problems. There is currently a bug (see
discussion in MDEV-9138) that the server does not enforce this, which is
quite unfortunate.

So this might be related. Try to check if the IO thread was running when
SQL_SLAVE_SKIP_COUNTER was set. And see if stopping the IO thread before
changing SQL_SLAVE_SKIP_COUNTER helps to avoid the problem?

Comment by Geoff Montee (Inactive) [ 2016-04-21 ]

There are 3 masters in this topology:

dc2-master2 <===> dc2-master <===> dc1-master

  • dc1-master is the master that always encounters this error.
  • The GTIDs that trigger this problem always originate from dc2-master2. e.g.: dc2-master2 has gtid_domain_id set to 35, and server_id set to 7735.

I've looked through the binlogs from all 3 masters, and I've noticed the following pattern each time that this error has occurred:

1.) When dc1-master encounters the out-of-order sequence number error about a particular GTID, the transaction with that GTID in dc1-master's binlog actually corresponds to a subset of the preceding transaction from dc2-master and dc2-master2.

For example, when dc1-master got this error:

160417 10:33:07 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937125 which would create an out-of-order sequence number with existing GTID 35-7735-937125, and gtid strict mode is enabled. Gtid 35-7735-937125, Internal MariaDB error code: 1950
160417 10:33:07 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937125 which would create an out-of-order sequence number with existing GTID 35-7735-937125, and gtid strict mode is enabled. Error_code: 1950
160417 10:33:07 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 95365805; GTID position '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017'
160417 10:33:07 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 4.
TRANSACTION 1521353596, not started
mysql tables in use 1, locked 4
MySQL thread id 329862, OS thread handle 0x7efc5a4da700, query id 137799390 Waiting for slave mutex on exit

The transaction with GTID 35-7735-937125 in dc1-master's binlog actually corresponds to a subset of the transaction with GTID 35-7735-937124 from dc2-master's and dc2-master2's binlogs.

2.) When dc1-master replicates a transaction with the wrong GTID, as explained above, the original GTID of the transaction is not present in the log at all.

For example, when dc1-master mistakenly replicated transaction with GTID 35-7735-937124 as GTID 35-7735-937125, the transaction's original GTID (35-7735-937124) is not present anywhere in dc1-master's log.

Comment by Kristian Nielsen [ 2016-04-21 ]

This sounds a lot like the problem with setting sql_slave_skip_counter
without stopping the IO thread, that I refered to. I think the SQL thread
restarts in the middle of a transaction.

The server really isn't supposed to allow this, it can cause all sorts of problems, possibly including corrupting internal server state. I need to get hold of Monty and find out what the intension was with the patch that introduced this.

As I said, you can try to see if it helps to ensure manually that
sql_slave_skip_counter (or other replication admin commands like CHANGE
MASTER, SET GLOBAL @@gtid_slave_pos, ...) is not done unless both the slave
sql and io threads are stopped.

I wonder if this also causes the strange server_id=0 in another bug report?

Comment by Geoff Montee (Inactive) [ 2016-04-21 ]

This sounds a lot like the problem with setting sql_slave_skip_counter
without stopping the IO thread, that I refered to. I think the SQL thread
restarts in the middle of a transaction.

I do see in the logs that the I/O thread was started about 2 seconds prior to setting sql_slave_skip_counter:

160417 10:45:19 [ERROR] Master 'dc2master': Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
160417 10:45:19 [Note] Master 'dc2master': Slave I/O thread killed while reading event
160417 10:45:19 [Note] Master 'dc2master': Slave I/O thread exiting, read up to log 'mariadb-bin.007023', position 16581082; GTID position 21-7521-7596336,0-7725-126,15-7715-29238650,35-7735-937719,17-7725-102954662,11-7511-6574751
160417 10:45:19 [Note] Master 'dc2master': Slave SQL thread initialized, starting replication in log 'mariadb-bin.007022' at position 96808225, relay log '/data/logs/relay/relay-bin-dc2master.000001' position: 4; GTID position '21-7521-7596336,0-7725-126,35-7735-937150,11-7511-6574751,17-7725-102939940,15-7715-29236023'
160417 10:45:19 [Note] Master 'dc2master': Slave I/O thread: connected to master 'replicator@10.35.20.41:3306',replication starts at GTID position '21-7521-7596336,0-7725-126,15-7715-29236023,35-7735-937150,17-7725-102939940,11-7511-6574751'
160417 10:45:21 [Note] Master 'dc2master': 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='/data/logs/relay/relay-bin-dc2master.000001', relay_log_pos='4', master_log_name='mariadb-bin.007022', master_log_pos='96808225', GTID '21-7521-7596336,0-7725-126,35-7735-937150,11-7511-6574751,17-7725-102939940,15-7715-29236023';  and new position at relay_log_file='/data/logs/relay/relay-bin-dc2master.000002', relay_log_pos='277601', master_log_name='mariadb-bin.007022', master_log_pos='97085707', GTID '21-7521-7596336,0-7725-126,35-7735-937151,11-7511-6574751,17-7725-102939940,15-7715-29236023';
160417 10:45:21 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937161 which would create an out-of-order sequence number with existing GTID 35-7735-937161, and gtid strict mode is enabled. Gtid 35-7735-937161, Internal MariaDB error code: 1950
160417 10:45:21 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937161 which would create an out-of-order sequence number with existing GTID 35-7735-937161, and gtid strict mode is enabled. Error_code: 1950
160417 10:45:21 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 97244987; GTID position '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'
160417 10:45:21 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 6.
TRANSACTION 1521380481, not started
mysql tables in use 1, locked 6
MySQL thread id 348640, OS thread handle 0x7efc5a4da700, query id 138034243 Waiting for slave mutex on exit

I will let these users know that they should only set sql_slave_skip_counter when both replication threads are stopped.

The server really isn't supposed to allow this, it can cause all sorts of problems, possibly including corrupting internal server state. I need to get hold of Monty and find out what the intension was with the patch that introduced this.

If it can cause problems like this, then I agree that it sounds quite dangerous. Sometimes this problem causes this user's system to lock up so badly that they have to kill mysqld.

I wonder if this also causes the strange server_id=0 in another bug report?

I submitted that bug report on behalf of the same users, so it's probably pretty likely that that problem has the same root cause as this one.

Thanks for the assistance!

Comment by Geoff Montee (Inactive) [ 2016-04-22 ]

These users say that when they use sql_slave_skip_counter, they always execute it in the following way:

set session default_master_connection='dc2master';
stop slave;
set GLOBAL SQL_SLAVE_SKIP_COUNTER=1;
start slave;

So it seems that the I/O thread should have been stopped when they encountered this problem. I believe the 2 second difference in timestamps in the above output might just be due to delays in the threads starting up.

Can you think of anything else that might lead to problems like this?

Comment by Kristian Nielsen [ 2016-04-25 ]

It still seems the most likely explanation that SQL_SLAVE_SKIP_COUNTER was
changed with the IO thread running, and the bug made the server corrupt its
state instead of giving an error. Maybe just once, and then other problems
were caused by this. But without a way to reproduce, only wild guesses are
possible.

Note that the topology given requires special actions on the part of the
user to prevent replication loops. Otherwise a transaction from dc2-master2
can loop forever between dc2-master and dc1-master.

Comment by Kristian Nielsen [ 2016-04-25 ]

Hm, actually SQL_SLAVE_SKIP_COUNTER only requires the SQL thread be stopped. I was thinking of other variables, like gtid_slave_pos, slave_parallel_threads, slave_domain_parallel_threads, and gtid_ignore_duplicates. Those variables have the bug that the server allows changing them with IO thread running, even though this can corrupt the server state in some cases.

Comment by Geoff Montee (Inactive) [ 2016-04-25 ]

Note that the topology given requires special actions on the part of the
user to prevent replication loops. Otherwise a transaction from dc2-master2
can loop forever between dc2-master and dc1-master.

dc2-master2 is actually a relatively new addition to the topology. The topology used to just have two masters that replicated like this:

dc2-master <===> dc1-master

They were still getting similar issues with that topology. For example, MDEV-9670 was reported with that topology. They added dc2-master2 hoping that it would somehow make the problem go away, but it hasn't.

I was thinking of other variables, like gtid_slave_pos, slave_parallel_threads, slave_domain_parallel_threads, and gtid_ignore_duplicates. Those variables have the bug that the server allows changing them with IO thread running, even though this can corrupt the server state in some cases.

I don't see any mention of this requirement in the documentation for any of these variables, except for slave_parallel_threads. Should a note be added about that until the bug you're referring to has been fixed?

https://mariadb.com/kb/en/mariadb/gtid/#gtid_slave_pos

https://mariadb.com/kb/en/mariadb/gtid/#gtid_ignore_duplicates

https://mariadb.com/kb/en/mariadb/replication-and-binary-log-server-system-variables/#slave_parallel_threads

https://mariadb.com/kb/en/mariadb/replication-and-binary-log-server-system-variables/#slave_domain_parallel_threads

In this particular case, the users said that they haven't dynamically changed any of these variables.

Comment by Michaël de groot [ 2016-05-09 ]

Some more information, the user does not experience MDEV-9952 anymore after they made a change in the configuration, it is now:
SFO2 < - > SFO1 < - > DC1 - using multisource replication and requiring gtid-ignore-duplicates.

Both of the issues were created when doing bulk data loading. I'm investigating now to find similarities and will keep you up to date. Perhaps you have some ideas where to look, what information to gather?

Comment by Kristian Nielsen [ 2016-05-10 ]

Maybe some corner cases in error handling, or some replication filtering?

gtid-ignore-duplicates is a very complicated setup, so a lot of care will be required on the part of the user to make sure everything is right. But stuff like getting server_id=0 or partial transactions really shouldn't happen, that really needs to be fixed...

Comment by Michaël de groot [ 2016-05-10 ]

There are no replication filters turned on.

What do you think about this part in the error log:
160417 10:33:07 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 4.

Does this mean the slave sql thread has tables locked when it is stopped due to gtid-safe-mode? How can that happen? I would assume safe mode checks are enforced before it locks any tables.

I think these table locks might be causing another issue, leaving the server unusable until a hard restart.

Comment by Kristian Nielsen [ 2016-05-10 ]

It sounds like something is wrong internally in the server. Just like server_id=0 in GTIDs and partial transactions in the binlog does...

Comment by Michaël de groot [ 2016-05-10 ]

I agree. That doesn't bring us closer to fixing the issue though.

The masters are a bit out of sync. I know there was at least 1 user error, investigating if this issue causes more inconsistencies. Can this change the situation in any way?

Comment by Michaël de groot [ 2016-05-10 ]

knielsen: I was wrong, there ARE replication filters turned on:

| replicate_ignore_db              | customer_product_prestage,customer_product_archive |

I have general query log of 1 of the occurrences of the issue and it contained many times, around the moment of the issue:

COMMIT /* implicit, from Xid_log_event */

Do you think the issue might have to do with replication fiters? If so can you elaborate?

Comment by Kristian Nielsen [ 2016-05-11 ]

To repeat myself:

> But without a way to reproduce, only wild guesses are possible.

For this particular wild guess, filtering is one way that complexity and corner-cases can appear, and a bug might lead to incorrect filtering of part of transactions?

Filtering, multi-source, multi-path gtid_ignore_duplicate, sql_skip_slave_counter, ... this is a horribly complex setup, it really requires having a test system where problems can be reproduced and analysed and reduced to a test case...

Comment by Michaël de groot [ 2016-05-11 ]

The sql_slave_skip_counter was just set to work around this bug, it is not a part of the normal setup.

We have a test system, I can set it up A <> B <> C msr ring. I'm happy to run any test you like but I just have no clue on what to test. I went through a 350GB general log yesterday to hopefully find something crazy (and have a test case) but besides the transactions ignored by replicate-ignore-db I did not find anything.
If we can pinpoint this to replication filters I am also happy to disable replication filters and see if that solves the issue. I just need something more then the fact that the setup is complex.

Comment by Michaël de groot [ 2016-05-11 ]

I have just found that this issue often results into another issue where some tables get locked indefinitely. The only fix for this is to restart the server. Would a coredump of this be of any help?

Comment by Matt Neth [ 2017-04-27 ]

I'm having this happen as well and i'm not using replication filters anywhere so if I can provide any helpful information, let me know. I've also seen the server ID 0 event thing happen as well!

Our three DBs are 10.0.30 using multi-source replication where each one replicates to and from the others and we have gitd_strict_mode=1 and gtid_ignore_duplicates=1. One DB is a physical server here at our office datacenter, one is an Amazon EC2 instance, and one is a Linode.

The issue seems to have shown up as a slave thread being stuck at waiting for table lock as well as direct inserts to the master being stuck as well. Doing a 'show engine innodb status' during this time shows no deadlocks and there appears to be nothing relevant in the slow query log for it.

Comment by Matt Neth [ 2017-05-08 ]

This issue seems to manifest in three different ways for us:
1. Random out-of-order issue stopping replication, even though gtid_ignore_duplicates is on
2. One of the slave threads on a master getting indefinitely stuck on 'waiting for table level lock', 'show all slaves status' commands never complete, and the mysql process needs signal 9'd to stop
3. Direct inserts to a master will get indefinitely stuck at 'waiting for table level lock' and it takes a signal 9 to kill the service as well. It's happened to servers even after we -reimported all data from scratch.

I've never seen a DEADLOCK section during any of these errors when I check SHOW ENGINE INNODB STATUS

For #1, i'm still waiting to get a core dump from right after the out-of-order shows up. This tends to show in the error log when the out of order happens (but not every time):
170508 8:00:07 [ERROR] Master 'OTHERMASTER': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 4.
TRANSACTION 357571220, not started
mysql tables in use 1, locked 4
MySQL thread id 4, OS thread handle 0x7f88b3aeb700, query id 2558283 Waiting for slave mutex on exit
len 1472; hex 065c6e05000000000000000000000000000000000000000003000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000088508199887f000088c98e9c887f0000000000000000000093e5db000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000080419999887f0000000000000000000000000000000000000000000000000000887b8399887f000000000000000000000000000000000000be8dd8000000000001000000000000000100
00000000000000000000000000000100000000000000010000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000102700000000000000000000000000000000000000000000000000000000000000000
00000000000ae1a501500000000d76b105900000000941a5015000000000100000000000000180000000000000000000000000000004d7953514c58696403000000000000004b09270000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000b7beb02475000000000000000000000008d08299887f00000000000
0000000007bebc1080000000000000000000000000000000000000000010000000000000004000000000000006806129f887f00000000000000000000688663a0887f000068c68e9c887f000000000000000000000000000000000000000000000a00000000000000
000000000000000000000000287637fa887f00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000300000000000000000000000
000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000708663a0887f0000704d819988
7f000000000000000000009fe5db0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000048038199887
f0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000; asc \n P
A { '
P k Y P MySQLXid K '
$u { h h c h (v7
p c pM
H

;

For #2, I uploaded a core dump to ftp.askmonty.org/private/ called MDEV9952.slave-table-lock-core.tar.gz.

For #3, i'm still waiting to get a core dump of this as well. I did get an strace when this happened this morning, and the output for the specific thread ID of the stuck query was this:
Thread 18 (Thread 0x7f887ca8d700 (LWP 13042)):
#0 0x00007f9187bf4a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000bdb447 in wait_for_lock.isra.2 ()
#2 0x0000000000bdbe4b in thr_multi_lock ()
#3 0x000000000080a214 in mysql_lock_tables(THD*, st_mysql_lock*, unsigned int) ()
#4 0x000000000080b156 in mysql_lock_tables(THD*, TABLE**, unsigned int, unsigned int) ()
#5 0x0000000000586bd8 in lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int) ()
#6 0x000000000058fc32 in open_and_lock_tables(THD*, TABLE_LIST*, bool, unsigned int, Prelocking_strategy*) ()
#7 0x00000000005b850c in mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item> >&, List<Item>&, List<Item>&, enum_duplicates, bool) ()
#8 0x00000000005d546f in mysql_execute_command(THD*) ()
#9 0x00000000005da989 in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()
#10 0x00000000005dc715 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#11 0x00000000006a4d9b in do_handle_one_connection(THD*) ()
#12 0x00000000006a4e59 in handle_one_connection ()
#13 0x00007f9187bf0dc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f918646973d in clone () from /lib64/libc.so.6

Our mysql config is as follows, with each server having a different server_id, domain_id, and auto_increment_offset:
transaction-isolation=READ-COMMITTED
innodb-buffer-pool-size=32G
innodb-flush-log-at-trx-commit=1
innodb-file-per-table=1
innodb_read_io_threads=64
innodb_write_io_threads=64
innodb-doublewrite=1
innodb_log_file_size=512M
innodb-log-files-in-group=2
innodb-buffer-pool-instances=8
innodb-thread-concurrency=0
innodb_stats_on_metadata=0
innodb-file-format=barracuda
innodb-strict-mode=1
innodb-flush-method = O_DIRECT
max-connections=1000
log-error=db-errorlog
relay-log=db-relaylog
log-basename=db
slow-query-log=1
slow-query-log-file=/var/log/db-slowquerylog
long-query-time=60
log-slow-slave-statements=1
server-id=3
report-host=HOSTNAME
gtid-domain-id=3
auto_increment_increment=6
auto_increment_offset=3
slave-parallel-threads=0
slave-domain-parallel-threads=0
slave-parallel-max-queued=1M
log-bin=db-binlog
log-bin-index=db-binlog-index
binlog-format=ROW
log-slave-updates=1
binlog-stmt-cache-size=1M
slave-transaction-retries=5
slave-exec-mode=IDEMPOTENT
gtid-ignore-duplicates=1
gtid-strict-mode=1
binlog-checksum=CRC32
master-verify-checksum=1
slave-sql-verify-checksum=1
relay-log-purge=1
sync-binlog=5
sync-relay-log=500
relay-log-recovery=1
expire-logs-days=14
max-allowed-packet=16777216
table-open-cache=600
table-definition-cache=600
skip-name-resolve

Comment by Matt Neth [ 2017-05-16 ]

I uploaded MDEV9952.slave-table-lock-mysqld.tar.gz to ftp.askmonty.org/private/ which is the mysqld binary for the core dump file I uploaded before: MDEV9952.slave-table-lock-core.tar.gz

Comment by Michaël de groot [ 2017-05-16 ]

The customer on whose behalf this issue was added reported that the issue dissapeared after they stopped writing to multiple nodes. Maybe that helps you mneth1 to work around it and knielsen to reproduce/fix it.

Comment by Kristian Nielsen [ 2017-05-16 ]

This looks like the relevant part of the attached stacktrace:

Thread 17 (Thread 0x7f4738df6700 (LWP 24880)):
#0  0x00007f510cbcfa82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000bdc4a7 in wait_for_lock.isra.2 ()
#2  0x0000000000bdceab in thr_multi_lock ()
#3  0x0000000000809ff4 in mysql_lock_tables(THD*, st_mysql_lock*, unsigned int) ()
#4  0x000000000080af36 in mysql_lock_tables(THD*, TABLE**, unsigned int, unsigned int) ()
#5  0x0000000000586c18 in lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int) ()
#6  0x000000000058fca2 in open_and_lock_tables(THD*, TABLE_LIST*, bool, unsigned int, Prelocking_strategy*) ()
#7  0x0000000000812453 in Rows_log_event::do_apply_event(rpl_group_info*) ()
#8  0x000000000055bc0d in apply_event_and_update_pos(Log_event*, THD*, rpl_group_info*, rpl_parallel_thread*) ()
#9  0x000000000055e3e5 in handle_slave_sql ()
#10 0x00007f510cbcbdc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f510b44473d in clone () from /lib64/libc.so.6

So the SQL thread is stuck on a table lock. And in turn STOP SLAVE is
waiting, and some SHOW SLAVE STATUS or similar.

So if there is no genuine table lock held, then I wonder if this is an MDL
locking issue? It might be related to replication, or it might only be
triggered by some specific replication load...

It still doesn't really seem to help to understand what causes the problem, unfortunately.

Comment by Matt Neth [ 2017-05-16 ]

Hm yeah, when these issues occur i've never seen a deadlock listed in SHOW ENGINE INNODB STATUS or in information_schema.innodb_locks/innodb_lock_waits.

Comment by Michaël de groot [ 2017-05-16 ]

mneth1, if you can reprodcuce this, could you try to enable the metadata lock info plugin and see if anything shows up there? Link: https://mariadb.com/kb/en/mariadb/metadata_lock_info/

Comment by Matt Neth [ 2017-05-17 ]

The mysql process on one of our masters died by itself today so I wasn't able to see the contents of the metadata_lock_info. However, after mysql restarted after the crash the slave threads were stopped with the out-of-order error AND the transaction it stopped on had the server ID 0 bug from MDEV-9670 so hopefully this new core dump will help multiple bugs at once.

The new core was uploaded to the FTP server with the filename: MDEV9952-core-crash.tar.gz

I attached the error log from the time of the crash to this bug as 5-17-crash-errorlog, and the full backtrace as 5-17-crash-gdb-bt

Comment by Matt Neth [ 2017-05-18 ]

@michaeldg Here's the output from the metadata lock table when we had some stuff stuck at waiting for table lock. Normally when this happens it's just the slave thread stuck, this time there were direct queries stuck in addition to the slave thread being stuck:

THREAD_ID,LOCK_MODE,LOCK_DURATION,LOCK_TYPE,TABLE_SCHEMA,TABLE_NAME
1006956,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1007464,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1007962,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1008252,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1013299,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1013354,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1016872,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1028208,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1034919,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1034954,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1043397,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1043401,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1061963,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1061994,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1061998,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1074327,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1074846,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1074911,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
4,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1086587,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1086639,MDL_INTENTION_EXCLUSIVE,\N,Global read lock,,
1006956,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1007464,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1007962,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1008252,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1013299,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1013354,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1016872,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1028208,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1034919,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1034954,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1043397,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1043401,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1061963,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1061994,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1061998,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1074327,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1074846,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1074911,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1086587,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
1086639,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRC
4,MDL_SHARED_WRITE,\N,Table metadata lock,mark,tableRCFG

Comment by Kristian Nielsen [ 2017-05-18 ]

What does "Global read lock" in there mean?
Is there something holding a FLUSH TABLES WITH READ LOCK?
If so, it is hardly surprising that the slave is blocked...

Comment by Matt Neth [ 2017-05-18 ]

I'm not sure how to find what would be holding the FTWRL, none of our stuff is doing that intentionally that I can find.

Comment by Michael Widenius [ 2017-05-29 ]

I have looked at the latest uploaded logs in detail. Here is the questions I have asked on IRC
(I added this here to not loose the information)
It includes a note from Kristian...

Error from updated logs:

170523 8:00:04 [ERROR] Master 'aws-woprm5': Slave SQL: An attempt was made to binlog GTID 2-2-58631956 which would create an out-of-order sequence number with existing GTID 2-2-58631956, and gtid strict mode is enabled. Gtid 2-2-58631956, Internal MariaDB error code: 1950
170523 8:00:04 [ERROR] Master 'aws-woprm5': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'wopr-binlog.000032' position 11396248; GTID position '5-5-19235506,2-2-58631955,3-3-34192152,6-6-258,1-1-1403399669,4-4-199673641'

The strange thing is that the slave is not running strict mode (as
said by user, I am trying to verify that), but even then, the stop
GTID is less than the out-of-order number.

How can the stop position be less ?

This is a ring setup, so I assume the slave can get the gtid twice, but there is no indication of this in the
slaves binary log. (Unfortunately I didn't get the slave's relay log at the incident which would have explained
more things).

knielsen> montywi: The slave has position 2-2-...955, it tries to replicate 2-2-...956, but that fails because slave binlog also has 2-2-...956. So replication fails and rolls back, and stop position is one step before, 2-2-...995. Seems ok

How can the slave have 995 if binlog is 956 ?
With replicate, do you mean write-to-binlog or appy ?

With binlog, do you mean that slave has written 956 to binlog? If
that's the case, then slave should be at 956, not 955 ?

I am looking at the master log and this has strangely enough
GTID 2-2-..995 and next is GTID-2-2-..557, no 956 in the master binlog.
(There are some GTID-4-4 in between, but this shouldn't be an issue I assume)

All in same log, so there can't be any master crash in between.

I would have assume that the scenario would be like following:

  • Slave get GTID 955 applies it. Then it gets 956 applies it. Next time it
    would see 956 it will not apply it but keep 956 as it's already commited.

Looking at the logs, I found something very strange:

On master, you have GTID's 2-2-58631954 and then ..555 and then ...557
On slave, you have GTID's 2-2-58631954 and then ..556 and then ...557

It looks (hard to verify) that the slave has applied 555 as 556, while
the master skipped number 556.

I tried to check if master event ...555 is same as slave ...556 . The
events looks similar (a lot if Update_rows events). However the size
of the events are 394211 on master and 362787 on slave so they e
different.

I wonder under which circumstances would a GTID be missing on the master.
I checked if rollback could do that, but it didn't.

i assume that on slave things can be different as after the crash, the
user probably started with skip-event.

However, I can't understand how master can have one number missing and
how slave can have .556 which is not on master.

I also don't understand how slave could originally complain about duplicate
..556 which doesn't exist in master binlog.

Could some of that be explained with that the user restarted replication
with --skip-slave?

Comment by Matt Neth [ 2017-05-29 ]

Couple bits of info that may help:
I thought the issue was related to multisource replication or the ring set up so I made one of my masters just have 1 single slave connection, and that master still has these issues. M2 (server ID 2) < - > M4 (server ID 4) < - > M3 (server ID 3). The server M4 has one other slave thread connection not shown here (M5), but M3 just has the one slave connection to M4.

gtid_strict_mode is definitely enabled on all of them. The way we fix the out-of-order issue that is stopping replication is by temporarily disabling gtid_strict_mode, NOT by using sql_slave_skip_counter-
set global gtid_strict_mode=0;
start slave 'broken';
set global gtid_strict_mode=1;
This lets the slave thread catch back up. Some of the times we do this, though, the slave thread gets stuck at "waiting for table level lock".

Comment by Elena Stepanova [ 2017-06-18 ]

Since all cases of out-of-order problem seem to involve M2 (server ID 2) in one or another way, it might be important to have the cnf file from it. We've got M3 and M4 configs, is any of the them identical to M2, apart from server/domain IDs?

Comment by Matt Neth [ 2017-06-19 ]

M2's is identical to M3 except for server/domain ID and report-host. The out-of-order has happened has happened on M3 and M4 as well, but the the last few times have been on M2.

Comment by Ulrich Moser (Inactive) [ 2017-11-30 ]

Hi,
we have a similar problem with out of order GTIDs
We have server S1 (server_id=10) and server S2 (server_id=11) in a master-master setup with both gtid_strict_mode = OFF. We now attached a new slave S3 (server_id=60) to S1. The gtid_strict_mode was set ot ON.
We got the following GTIDs causing the slave to stop with out-of-sequence error 0-10-1269188142 followed by 0-11-1269188141. Since server ids are different I would not expect an out-of-order eror here.

Comment by Kristian Nielsen [ 2017-11-30 ]

> Since server ids are different I would not expect an out-of-order eror here.

No, different server ids does not mean that out-of-order is allowed.

On the contrary, out-of-order sequence numbers between different server ids
(but same gtid_domain_id) is exactly what gtid_strict_mode is meant to flag
as error.

To allow out-of-order (as in a master-master ring), you need different
domain id on each master. Out-of-order sequence ids are allowed in
gtid_strict_mode between different domain ids, but not between different
server ids within the same domain.

See the docs for more, eg.
https://mariadb.com/kb/en/library/gtid/#use-with-multi-source-replication-and-other-multi-master-setups

Comment by Ulrich Moser (Inactive) [ 2017-11-30 ]

Thanks Kristian.

Comment by Mannoj Kumar Saravanan [ 2018-04-19 ]

Hi Team,
I'm coming up with below model using Multi-Source Replication.
A == B <--> D == E

A & B are galera cluster with domain_id=4100
D & E are galera cluster with domain_id=4101

A,B is Master of E. Using Multi Source Replication model.
D,E is Master of B. Using Multi Source Replication model.

wsrep-gtid-mode=on
gtid_strict_mode=1
gtid_ignore_duplicates=1

This is working with AB replicating to D&E. But it is not happening when I write to D&E it doesn't replicate to A & B. Which Checking B slave status -> I get below error.

Last_SQL_Error: An attempt was made to binlog GTID 0-195981401-8 which would create an out-of-order sequence number with existing GTID 0-195981401-8, and gtid strict mode is enabled.

https://jira.mariadb.org/browse/MDEV-9952 . - > I'm having different domain_id for each cluster as mentioned by Kristian Nielsen. Please let me know if I'm missing something.

Comment by Andrei Elkin [ 2018-07-11 ]

The fixed MDEV-9670 showed various traces where the server_id 0 was just one of. Lack of proper transaction state cleanup (the zero server id transaction never received Xid-log-event, nor other vital events for the cleanup) was another part and may lead to locking issue as well.

Considering all that as well as the report is done against the pre-9670 fixes version the case should be closed as duplicate.

Comment by Andrei Elkin [ 2018-07-13 ]

Reverting it back to active to verify a scenario when the IO thread
restarts and resumes retrieval from possibly out-dated gtid_slave_pos.

Comment by Andrei Elkin [ 2018-07-14 ]

Verified a possibility to reconnect to the master of a circular setup while having one GTID:s
arrived but not yet processed by an applier thread. In such case the IO thread may resume
retrieval from a past own gtid but this does not create any issue apart of transferring extra.
The delivered own GTID:s not re-executed (unless requested explicitly).
Important is that external Gtid:s (that were generated by a different node) won't be retrieved for the 2nd time and therefore there is no "OOO" (Out-Of-Order) threat.

Generated at Thu Feb 08 07:38:33 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.