[MDEV-11201] gtid_ignore_duplicates incorrectly ignores statements even when GTID replication is not enabled Created: 2016-11-01  Updated: 2017-04-10  Resolved: 2017-04-10

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0, 10.1, 10.2
Fix Version/s: 10.0.31

Type: Bug Priority: Major
Reporter: Michaël de groot Assignee: Lixun Peng
Resolution: Fixed Votes: 1
Labels: None

Issue Links:
Duplicate
duplicates MDEV-12287 Enabling gtid_ignore_duplicates witho... Closed

 Description   

Hi!

gtid_ignore_duplicates ignores statements even when GTID replication is not enabled. I can reproduce it and I can give you access to the VM's where I can reproduce it.

It seams that duplicate STATEMENTS always get ignored and unique STATEMENTS only get ignored when the slave gets restarted.

master config:

server_id=1
log-bin
replicate_ignore_db=mysql
binlog_format=STATEMENT

Slave config:

[mysqld]
server_id=2
gtid_ignore_duplicates=1
log-slave-updates
log-bin

Steps to reproduce:
1. Setup replication between two nodes

2. Create 10 test tables and insert data into them:

CREATE TABLE `test1` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `b` int(10) unsigned NOT NULL,
  `c` char(50) NOT NULL,
  UNIQUE KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=3580584 DEFAULT CHARSET=latin1;
CREATE TABLE test2 LIKE test1;
CREATE TABLE test3 LIKE test1;
CREATE TABLE test4 LIKE test1;
CREATE TABLE test5 LIKE test1;
CREATE TABLE test6 LIKE test1;
CREATE TABLE test7 LIKE test1;
CREATE TABLE test8 LIKE test1;
CREATE TABLE test9 LIKE test1;
CREATE TABLE test10 LIKE test1;
insert into test1 (b) values(1);
insert into test2 (b) values(1);
insert into test3 (b) values(1);
insert into test4 (b) values(1);
insert into test5 (b) values(1);
insert into test6 (b) values(1);
insert into test7 (b) values(1);
insert into test8 (b) values(1);
insert into test9 (b) values(1);
insert into test10 (b) values(1);

3. Put script.sh and start.sh

[root@test1 ~]# cat script.sh 
#!/bin/bash
 
DATE=`date`
while true; do mysql mariadb -e "INSERT INTO mariadb.$1 (b, c) select RAND()*1000, \"$RANDOM $RANDOM $RANDOM $RANDOM $DATE $RANDOM\" FROM mariadb.$1 LIMIT 10000"; done

#!/bin/bash
 
./script.sh test1&
./script.sh test1&
./script.sh test1&
./script.sh test1&
./script.sh test1&
./script.sh test2&
./script.sh test2&
./script.sh test2&
./script.sh test2&
./script.sh test2&
./script.sh test2&
./script.sh test2&
./script.sh test3&
./script.sh test3&
./script.sh test3&
./script.sh test3&
./script.sh test3&
./script.sh test3&
./script.sh test3&
./script.sh test3&
./script.sh test3&
./script.sh test4&
./script.sh test4&
./script.sh test4&
./script.sh test4&
./script.sh test4&
./script.sh test4&
./script.sh test4&
./script.sh test4&
./script.sh test4&
./script.sh test4&
./script.sh test5&
./script.sh test5&
./script.sh test5&
./script.sh test5&
./script.sh test5&
./script.sh test5&
./script.sh test5&
./script.sh test5&
./script.sh test5&
./script.sh test5&
./script.sh test5&
./script.sh test6&
./script.sh test6&
./script.sh test6&
./script.sh test6&
./script.sh test6&
./script.sh test6&
./script.sh test6&
./script.sh test6&
./script.sh test6&
./script.sh test6&
./script.sh test6&
./script.sh test7&
./script.sh test7&
./script.sh test7&
./script.sh test7&
./script.sh test7&
./script.sh test7&
./script.sh test7&
./script.sh test7&
./script.sh test7&
./script.sh test7&
./script.sh test7&
./script.sh test8&
./script.sh test8&
./script.sh test8&
./script.sh test8&
./script.sh test8&
./script.sh test8&
./script.sh test8&
./script.sh test8&
./script.sh test9&
./script.sh test9&
./script.sh test9&
./script.sh test9&
./script.sh test9&
./script.sh test9&
./script.sh test9&
./script.sh test9&
./script.sh test10&
./script.sh test10&
./script.sh test10&
./script.sh test10&
./script.sh test10&
./script.sh test10&
./script.sh test10&
./script.sh test10&
./script.sh test10&
./script.sh test10&

4. Run start.sh

5. While this is running, start this script to restart the slave every minute:

while true; do mysql  -e 'stop slave; set global slave_parallel_threads = 4 - @@slave_parallel_threads; start slave;'; date >> log3.log; for i in {0..10}; do ( echo -n $(date); mysql  -e 'show slave status\G' | grep Seconds ) >> log3.log; sleep 2; done; done&

I am not sure if changing slave_parallel_threads has anything to do with it. I just realize it is still there after creating this report.

6. Observer the replication gets out of sync:

MariaDB [mariadb]> select id FROM test7 order by id desc limit 1;
+----------+
| id       |
+----------+
| 11616384 |
+----------+
1 row in set (0.00 sec)
MariaDB [mariadb]> select count(*) FROM test7 where id <= 11616384 order by id desc limit 1;
+----------+
| count(*) |
+----------+
| 10606384 |
+----------+
1 row in set (7.02 sec)

On master:

MariaDB [mariadb]> select count(*) FROM test7 where id <= 11616384 order by id desc limit 1;
+----------+
| count(*) |
+----------+
| 11616384 |
+----------+
1 row in set (1 min 13.63 sec)



 Comments   
Comment by Elena Stepanova [ 2016-11-01 ]

michaeldg,

In the description, you refer to "duplicate statements" and "unique statements". What do you mean by that?

Since it is in the context of gtid_ignore_duplicates, I would assume that you are talking about duplicate GTIDs. But even if the option is wrongly applied when GTID is not enabled, in the described scenario there should be no GTID duplicates. You have one master and one replication channel, where the duplicates come from?

Please also specify the exact command that you use to "setup replication between two nodes".

Comment by Michaël de groot [ 2016-11-06 ]

elenst, sorry for the late reply here.

With duplicate statements I mean:
UPDATE table SET a=a+1 WHERE id=1
UPDATE table SET a=a+1 WHERE id=1
UPDATE table SET a=a+1 WHERE id=1

With unique statements I mean:
UPDATE table SET a=a+1,b="023033 random from bash 312133" WHERE id=1
UPDATE table SET a=a+1,b="342442423 random from bash 5434453" WHERE id=1
UPDATE table SET a=a+1,b="9767887 random from bash 988778" WHERE id=1

GTID is the funny thing about this. I do not use it and still gtid_ignore_duplicates has effect. So I'm not talking about duplicate ID's

I set up replication by:
RESET MASTER;
GRANT REPLICATION SLAVE ON . to 'repl'@'%' identified by 'repl';

CHANGE MASTER TO MASTER_HOST='machine', MASTER_USER='repl', MASTER_PASSWORD='repl';
START SLAVE;

I verified that master_use_gitd=no; the machines were all completely fresh.

I did not verify if this occurs with GTID SBR as well.

Let me know if you have any more questions!

Comment by Kristian Nielsen [ 2016-11-21 ]

It seems likely that something else is going on here.

As Elena wrote there are no duplicate gtids here. But you are using MyISAM tables in a scary way:

INSERT INTO t_myisam SELECT ... FROM t_myisam LIMIT 10000

I wonder if there is any guarantee that this will work correctly? The problem is if the locking is enough to ensure that the LIMIT will select the same number of rows? Probably not the same rows, though that should in principle not matter. It could also be a limitation/bug with parallel replication and MyISAM.

Does the problem still occur if gtid_ignore_duplicates=off? Does it occur if using InnoDB tables?

Comment by Michaël de groot [ 2016-11-28 ]

knielsen, the problem does not occur if gtid_ignore_duplicates is off.

I did not check if it also happened with InnoDB tables. elenst, could you try to reproduce it? I am off for customers the next 4 weeks.

Comment by Elena Stepanova [ 2016-12-01 ]

michaeldg, nice catch, it is very much reproducible.

Even though, indeed, there are no duplicates of any kind, replication seems to misbehave badly when gtid_ignore_duplicates=1 and replication does not actually use GTID.

That is:

  • gtid_ignore_duplicates=0, MASTER_USE_GTID=no - OK
  • gtid_ignore_duplicates=0, MASTER_USE_GTID=current_pos (or slave_pos) - OK
  • gtid_ignore_duplicates=1, MASTER_USE_GTID=current_pos (or slave_pos) - OK
  • gtid_ignore_duplicates=1, MASTER_USE_GTID=no - PROBLEM

Parallel replication is irrelevant.
Identical statements are irrelevant.
Concurrency on master is irrelevant.
replicate_ignore_db is irrelevant.

The basic scenario is:

  • master produces a number of binlog events;
  • slave (configured with gtid_ignore_duplicates) starts replicating, everything goes well;
  • STOP SLAVE / START SLAVE is executed on the slave – it skips events after that.

Test case

--source include/master-slave.inc
--source include/have_binlog_format_statement.inc
 
--let $count= 2000
 
 
--connection slave
--source include/stop_slave.inc
SET GLOBAL gtid_ignore_duplicates = 1;
CHANGE MASTER TO master_use_gtid=no;
 
--connection master
CREATE TABLE t1 (i INT);
--let $run= $count
--echo # Run $run INSERT statements
--disable_query_log
while ($run)
{
    --eval INSERT INTO t1 VALUES ($run)
    --dec $run
}
--enable_query_log
 
SELECT COUNT(*) FROM t1;
--save_master_pos
 
--connection slave
--source include/start_slave.inc
--source include/stop_slave.inc
--source include/start_slave.inc
--sync_with_master
 
--echo #
--echo # Here we synchronized with master and should have $count records:
--echo #
SELECT COUNT(*) FROM t1;
 
--echo #
--echo # Checking the status
--echo #
--let $status_items= Read_Master_Log_Pos,Exec_Master_Log_Pos,Using_Gtid,Slave_IO_Running,Slave_SQL_Running
--source include/show_slave_status.inc
 
# Cleanup
--connection master
DROP TABLE t1;
--sync_slave_with_master
 
--source include/rpl_end.inc
 
--connection slave
SET GLOBAL gtid_ignore_duplicates = DEFAULT;

knielsen, do you want to take it?

Comment by Kristian Nielsen [ 2016-12-01 ]

Ok, now I see it. So this is a race between the IO thread connect and the SQL
thread when not using GTID.

In this case, the IO thread overwrites the slave gtid position with the gtid
position of the IO thread. This causes skip of any transactions fetched
earlier but not yet applied.

I wonder if the solution is to just disable --gtid-ignore-duplicates in
non-gtid mode?

http://lists.askmonty.org/pipermail/commits/2016-December/010145.html

That patch fixes the problem from the test case.

It seems kind of reasonable that --gtid-ignore-duplicates should only do
something when master_use_gtid!=no, but I still think such a change in 10.0
requires a review. Elena, you being still on the inside, maybe you can find
a reviewer...

  • Kristian.
Comment by Elena Stepanova [ 2016-12-01 ]

knielsen, do I understand correctly that with this patch the value of the variable will remain intact, it just will be ignored if GTIDs are not used for replication?

Just clarifying because when I read "disable --gtid-ignore-duplicates in non-gtid mode" first, it sounded to me like the variable would be forcefully unset, which would be a wrong thing to do; but I don't see anything like that in the patch.

Comment by Kristian Nielsen [ 2016-12-01 ]

> do I understand correctly that with this patch the value of the variable
> will remain intact, it just will be ignored if GTIDs are not used for
> replication?

Correct.

Specifically, with the patch, if a slave-master connection has
MASTER_USE_GTID=no, then that connection will ignore any
--gtid-ignore-duplicates=1 setting. While another slave-master connection
with MASTER_USE_GTID=slave_pos|current_pos will still honor
--gtid-ignore-duplicates.

Comment by Elena Stepanova [ 2016-12-01 ]

knielsen, thanks.

plinux, could you please review the patch?

Comment by Michaël de groot [ 2016-12-02 ]

Hi everybody!

Sorry to bother you but I think a little more investigation may be required. Does this occur due to GTID not being enabled? Can we guarantee that this does not occur if GTID is enabled?

Thanks,
Mcihaël

Comment by Michaël de groot [ 2016-12-02 ]

And thank you, elenst, for getting to the bottom of this, filtering out the complexity of what I submitted.

Comment by Kristian Nielsen [ 2016-12-02 ]

This particular bug will not occur when GTID is enabled, because the
offending code in the IO thread that overwrites the slave position is only
run in non-GTID case:

  if (mi->using_gtid != Master_info::USE_GTID_NO)
    ...
  else
  {
    /*
      If we are not using GTID to connect this time, then instead request
      the corresponding GTID position from the master, so that the user
      can reconnect the next time using MASTER_GTID_POS=AUTO.
 
    query.append("SELECT binlog_gtid_pos('");
 
    if (!mysql_real_query(mysql, query.c_ptr_safe(), query.length()) &&
        (master_res= mysql_store_result(mysql)) &&
        (master_row= mysql_fetch_row(master_res)) &&
        (master_row[0] != NULL))
    {
      rpl_global_gtid_slave_state->load(mi->io_thd, master_row[0],
                                        strlen(master_row[0]), false, false);
    }

Comment by Michaël de groot [ 2016-12-02 ]

Ah ok, excellent. Thank you knielsen!

Comment by Lixun Peng [ 2017-04-09 ]

knielsen I reviewed the patch here: http://lists.askmonty.org/pipermail/commits/2016-December/010145.html
You can push it.

Comment by Kristian Nielsen [ 2017-04-10 ]

Pushed to 10.0

Generated at Thu Feb 08 07:48:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.