Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.0.26
-
linux
Description
When doing STOP SLAVE SQL_THREAD and START SLAVE SQL_THREAD repeatedly parallel replication sometimes fails on START with a duplicate key error. SHOW SLAVE STATUS then always shows a relay log position at the beginning of a log file, e.g. "Relay_Log_Pos: 4". This only seems to happen if a large transaction is split across more than one relay log file.
How to reproduce:
Master config:
[mysqld]
|
server-id=1
|
log-bin=master-bin
|
binlog-format=statement
|
Slave config:
[mysqld]
|
server-id=2
|
slave_parallel_threads=20
|
max_relay_log_size=98304
|
Create replication user on the master:
CREATE USER 'repl'@'%' IDENTIFIED BY 'password';
|
GRANT ALL PRIVILEGES ON *.* TO 'repl'@'%';
|
RESET MASTER;
|
Start slave:
SET GLOBAL gtid_slave_pos='0-1-2';
|
CHANGE MASTER TO Master_host='mymasterip', Master_user='repl', Master_password='password', master_use_gtid=slave_pos;
|
START SLAVE;
|
On the master: create a simple PHP script "test.php" to populate a table using given domain ID (modulo 3) and table name in transactions of 1000 rows each (transactions will take about 40K binlog space each, so talking about half of the max_relaylog_size which makes hitting a transaction split very likely)
<?php
|
|
$domain_id = $argv[1]%3 + 1;
|
$table = $argv[2];
|
|
echo "SET gtid_domain_id=$domain_id;";
|
|
echo "DROP TABLE IF EXISTS test.$table;";
|
echo "CREATE TABLE test.$table(id int primary key, msg varchar(100));";
|
|
$n=1;
|
|
while (true) {
|
echo "BEGIN;\n";
|
for ($i = 1; $i < 1000; $i++) {
|
echo "INSERT INTO test.$table VALUES($n, MD5(RAND()));\n";
|
$n++;
|
}
|
echo "COMMIT;\n";
|
}
|
Then run several instances of it in parallel:
for a in $(seq 20); do (php test.php $a t$a | mysql &) ; done
|
Now on the slave run this script to repeatedly stop and restart the SQL thread:
( sudo mysql <<< "select version(); start slave sql_thread"; while date; do
|
sudo mysql <<< 'stop slave sql_thread; show slave status\G'
|
gsp_stopped="$(sudo mysql <<< 'SELECT @@GLOBAL.gtid_slave_pos;')"; echo "$gsp_stopped";
|
sudo mysql <<< 'start slave sql_thread;'; sleep 5;
|
sss="$(sudo mysql <<< 'show slave status\G')"
|
gsp="$(sudo mysql <<< 'SELECT @@GLOBAL.gtid_slave_pos')"; echo "$sss"; echo "$gsp";
|
if test "$(awk '$1 == "Slave_SQL_Running:"{print $2}' <<< "$sss")" == "Yes"; then echo; echo "sleeping..."; sleep 5; else echo "things are broken..."; break; fi; done; ) | tee test.log
|
This usually fails with a duplicate key error after a small number of iterations already
Attachments
- binlogs.tar.gz
- 599 kB
- test.log
- 14 kB
Issue Links
- relates to
-
MDEV-9138 Relay log position corrupted with parallel replication after interrupting LOAD DATA on master
-
- Closed
-
Activity
Full gtid_% variables on slave after running into duplicate key error with attached logs:
MariaDB [test]> show variables like "gt%";
|
+------------------------+-------------------------------+
|
| Variable_name | Value |
|
+------------------------+-------------------------------+
|
| gtid_binlog_pos | 0-2-2116 |
|
| gtid_binlog_state | 0-2-2116 |
|
| gtid_current_pos | 1-1-10,0-2-2116,3-1-10,2-1-22 |
|
| gtid_domain_id | 0 |
|
| gtid_ignore_duplicates | OFF |
|
| gtid_seq_no | 0 |
|
| gtid_slave_pos | 1-1-10,0-1-2,3-1-10,2-1-22 |
|
| gtid_strict_mode | OFF |
|
+------------------------+-------------------------------+
|
8 rows in set (0.00 sec)
|
Already replicated tables and their sizes:
MariaDB [test]> show tables;
|
+----------------+
|
| Tables_in_test |
|
+----------------+
|
| t1 |
|
| t13 |
|
| t16 |
|
| t2 |
|
| t3 |
|
| t4 |
|
| t7 |
|
| t8 |
|
+----------------+
|
8 rows in set (0.00 sec)
|
|
+----------+---------+
|
| count(*) | max(id) |
|
+----------+---------+
|
| 1998 | 1998 |
|
+----------+---------+
|
|
table t2
|
+----------+---------+
|
| count(*) | max(id) |
|
+----------+---------+
|
| 999 | 999 |
|
+----------+---------+
|
|
table t3
|
+----------+---------+
|
| count(*) | max(id) |
|
+----------+---------+
|
| 2997 | 2997 |
|
+----------+---------+
|
|
table t4
|
+----------+---------+
|
| count(*) | max(id) |
|
+----------+---------+
|
| 4995 | 4995 |
|
+----------+---------+
|
|
table t7
|
+----------+---------+
|
| count(*) | max(id) |
|
+----------+---------+
|
| 2997 | 2997 |
|
+----------+---------+
|
|
table t8
|
+----------+---------+
|
| count(*) | max(id) |
|
+----------+---------+
|
| 0 | NULL |
|
+----------+---------+
|
|
table t13
|
+----------+---------+
|
| count(*) | max(id) |
|
+----------+---------+
|
| 0 | NULL |
|
+----------+---------+
|
|
table t16
|
+----------+---------+
|
| count(*) | max(id) |
|
+----------+---------+
|
| 0 | NULL |
|
+----------+---------+
|
|
|
The correct position to continue from in hartmut-3510-relay-bin.000038 would have been 81695:
# at 81695
|
#160922 0:19:20 server id 1 end_log_pos 277188 GTID 2-1-23 cid=15432534
|
/*!100001 SET @@session.gtid_domain_id=2*//*!*/;
|
/*!100001 SET @@session.server_id=1*//*!*/;
|
/*!100001 SET @@session.gtid_seq_no=23*//*!*/;
|
BEGIN
|
which matches the information for domain 2 in gtid_current_pos / gtid_slave_pos which shows GTID 2-1-22 as last applied for this domain.
I don't fully understand the code behind this yet, but my blind educated guess is that the last applied transaction GTID 2-1-22 is actually being searched for first, and only then 2-1-23, and as the beginning event of transaction 2-1-22 was in the previous, already purged, binlog hartmut-3510-relay-bin.000037, it can't be found in the current binlog and so the position is never set correctly, instead just remains at the beginning of .000038, and replication wrongfully continues from there. (Or at least mysqlbinlog doesn't show any trace of GTID 2-1-22, not sure whether its output is complete in that respect though)
This is a summary I wrote elsewhere last week. Unfortunately recent tests have shown that the proposed cause in the first paragraph can't be the complete story. More about this in the next comment ...
— Original Text —
When a Transactions is being spread across a relay log border
the parallel replication relay log repositioning logic may end
up positioning into the middle of the transaction, and GTID
"we've already processed this transaction" logic doesn't kick
in as there's no GTID information at the beginning of the
relay log file, only at the beginning of the next transaction.
Now, as far as I can tell, we have two ways to solve this:
1) Change relay log copying logic so that transactions will
never be split up, similar to how things are dealt with
on the master side with binlogs. With this max relay log
size will no longer be a strict limit and relay logs may
become as big as needed for large transactions. This doesn't
cause any new file system space issues though as right now
all relay log parts that cover a large transactions can't be
purged until transaction commit anyway.
It will require a) some extra logic in the IO thread to be
transaction aware though, especially to deal with FLUSH
RELAY LOGS, or b) alternatively to copy binlogs to relay logs
on a 1:1 file by file basis, totally ignoring FLUSH requests.
2) Change repositioning logic. Right now it simply repositions
to the beginning of the first not completely processed relay
log, assuming that already processed changes will be filtered
out and not re-run due to GTID "already seen" logic, which fails
for events that apparently are not given a GTID (as it was already
given in the previous relay log file)
So by extending the repositioning logic to not simply rewind to
the beginning of the relay log, but to then roll forward over
events until seeing a new transaction start with explicit GTID.
Then take the position of this event as the new start position.
3) Change GTID filtering logic to completely ignore binlog/relaylog
events for which the GTID is not known, either only in parallel
mode, or maybe even completely
Further testing has shown that shown that the "how to reproduce" instructions from above only lead to duplicate key errors on sufficiently capable hardware. E.g. I can reproduce this very quickly when replicating from my desktop (6 cores) to my laptop (4 cores), or when replicating between my two public web servers (8 cores each – or actually 4 real ones plus hyper threading),
When trying the same steps, but using two Virtualbox VMs as master and slave on my 6 core machine, and giving each VM 3 virtual cores, the test can run for hours though without failing at all.
So this doesn't only seem to be about relay log splitting at the "wrong" place, but there also seems to be some race condition that needs to be hit ...
Ouch, as Hartmut says this is the following code in handle_slave_sql():
/*
|
Load the starting GTID position, so that we can skip already applied
|
GTIDs when we restart the SQL thread. And set the start position in
|
the relay log back to a known safe place to start (prior to any not
|
yet applied transaction in any domain).
|
*/
|
rli->restart_gtid_pos.load(rpl_global_gtid_slave_state, NULL, 0);
|
if ((ir= rli->inuse_relaylog_list))
|
{
|
rpl_gtid *gtid= ir->relay_log_state;
|
uint32 count= ir->relay_log_state_count;
|
while (count > 0)
|
{
|
process_gtid_for_restart_pos(rli, gtid);
|
++gtid;
|
--count;
|
}
|
strmake_buf(rli->group_relay_log_name, ir->name);
|
rli->group_relay_log_pos= BIN_LOG_HEADER_SIZE;
|
rli->relay_log_state.load(ir->relay_log_state, ir->relay_log_state_count);
|
}
|
So it always restarts at the beginning of a relay log, which can easily be
in the middle of an event group. That must be just plainly wrong.
Probably proposed solution (3) should be the right fix. It would be nice to
set a better restart position that does not point inside an event group, but
I am not sure we have that information available. There might not even be
any GTID event at all in a given relay log file.
Kristian: do you have an idea though why this only seems to be reproducible on sufficiently powerful hardware (e.g. only when replicating between real machines, not between Virtualbox VMs?)
If it were only about relay log being split in the middle of an event group I'd expect it to happen regardless of "machine power", and just being related to event group sizes vs. max relaylog size?
It's hard to say why you could not reproduce, your test setup reproduced
easily for me just on my laptop.
Here is a mysql-test-run testcase that seems to reproduce it reliably.
- Kristian.
--source include/have_innodb.inc
|
--let $rpl_topology=1->2
|
--source include/rpl_init.inc
|
|
# Test various aspects of parallel replication.
|
|
--connection server_2
|
SET @old_parallel_threads=@@GLOBAL.slave_parallel_threads;
|
--source include/stop_slave.inc
|
SET GLOBAL slave_parallel_threads=10;
|
SET @old_max_relay= @@GLOBAL.max_relay_log_size;
|
SET GLOBAL max_relay_log_size = 4096;
|
CHANGE MASTER TO master_use_gtid=slave_pos;
|
--source include/start_slave.inc
|
|
--connection server_1
|
ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
|
CREATE TABLE t1 (a int PRIMARY KEY, b VARCHAR(100)) ENGINE=InnoDB;
|
INSERT INTO t1 VALUES (1, "a");
|
--save_master_pos
|
|
--connection server_2
|
--sync_with_master
|
|
--echo *** Create a long transaction that will span a relay log file. ***
|
--connection server_1
|
|
# Add some transactions in separate domains, that will cause the need to
|
# have a multi-valued restart position in the relay log for the SQL thread.
|
SET @old_domain= @@gtid_domain_id;
|
SET gtid_domain_id=10;
|
INSERT INTO t1 VALUES (10000, "domain 10");
|
SET gtid_domain_id=20;
|
INSERT INTO t1 VALUES (20000, "domain 20");
|
SET gtid_domain_id=@old_domain;
|
|
BEGIN;
|
--echo [lots of inserts omitted]
|
--disable_query_log
|
--let $count = 500
|
while ($count) {
|
eval INSERT INTO t1 VALUES (1000+$count, REPEAT("hulubulu??!?", 8));
|
dec $count;
|
}
|
--enable_query_log
|
COMMIT;
|
|
--save_master_pos
|
|
--connection server_2
|
--sync_with_master
|
|
--connection server_1
|
# Now do another one, to make the inuse_relaylog proceed to somewhere inside
|
# the first large transaction.
|
|
BEGIN;
|
--echo [lots of inserts omitted]
|
--disable_query_log
|
--let $count = 500
|
while ($count) {
|
eval INSERT INTO t1 VALUES (2000+$count, REPEAT("hulubulu??!?", 8));
|
dec $count;
|
}
|
--enable_query_log
|
COMMIT;
|
|
--save_master_pos
|
|
--connection server_2
|
--sync_with_master
|
|
|
# Stop and restart the SQL thread only.
|
# The bug was that the SQL thread would restart at the start
|
# of a relay log file, which could be in the middle of an event group.
|
# This way, part of that event group could be wrongly re-applied.
|
|
--source include/stop_slave_sql.inc
|
START SLAVE SQL_THREAD;
|
--source include/wait_for_slave_to_start.inc
|
|
|
--connection server_1
|
INSERT INTO t1 VALUES (100001, "More stuffs.");
|
INSERT INTO t1 VALUES (100002, "And even more");
|
--save_master_pos
|
|
--connection server_2
|
--sync_with_master
|
SELECT * FROM t1 WHERE a >= 1000000 ORDER BY a;
|
|
|
# Clean up.
|
--connection server_2
|
--source include/stop_slave.inc
|
SET GLOBAL slave_parallel_threads=@old_parallel_threads;
|
SET GLOBAL max_relay_log_size= @old_max_relay;
|
--source include/start_slave.inc
|
|
--connection server_1
|
DROP TABLE t1;
|
|
--source include/rpl_end.inc
|
Fix:
http://lists.askmonty.org/pipermail/commits/2016-November/010061.html
Lixun, do you want to review the patch before I push it into 10.0?
- Kristian.
Looks good, I ran my test setup for almost two days without any replication breaks
knielsen
I checked your patch, the main idea is ignore the first part without a begining GTID event in the first relay log, right?
I'm not sure if this logic always be right under parallels replication.
Do you think keep the previous relay log file is a better way?
I mean, if Event[0] be splitted into realy log File[0] and File[1], and GTID moved to Evnet[1], should we keep the File[0] until Evnet[1] be applied? Then parallels replication can always get the start postion in the exist relay log file.
> I checked your patch, the main idea is ignore the first part without a
> begining GTID event in the first relay log, right?
The way the code works is, it scans over the earlier GTIDs in the relay log
file. If the GTID was already applied, it sets the GTID_SKIP_TRANSACTION (or
GTID_SKIP_STANDALONE) flag, so that the associated events are skipped.
But the code starts scanning at the beginning of a relay log, which can be
in the middle of an event group that was already applied. So it does not see
the GTID for the first event group, thus the gtid_skip_flag needs to be
explicitly initialised to GTID_SKIP_TRANSACTION.
We can be sure that any GTID from an earlier relay log was already applied,
so that skipping it is correct. This is ensured by the inuse_relaylog
mechanism. See struct inuse_relaylog in sql/rpl_rli.h.
An alternative might be to keep track of the position of the first GTID (or
other event outside of an event group), and then start scanning from that
point. Then we need to handle also the case where a relay log file contains
no GTID at all because both the start and the end of a big transaction is
outside that file. I am not sure that is any simpler or better.
> I'm not sure if this logic always be right under parallels replication.
Yes. This code which handles restarting the SQL thread is really
complicated, and I was always very unhappy about it. There have been a
number of bugs related to it. I would certainly welcome any suggestions for
a simpler solution.
I think the better overall approach would be that the SQL thread was able to
correctly start in the relay log file from a GTID position, using a
mechanism similar to how this works when mysql_binlog_send() runs on the
master. Then, if relay log recovery was implemented on the slave similar to
binlog recovery on the master, we would also be able to re-use the relay log
after a slave server restart, without having to re-fetch it from the
master. But that is a much bigger change, probably too much for 10.0.
(And it would still be necessary to handle partiel event groups at the start
of the relay log).
> Do you think keep the previous relay log file is a better way? I mean, if
> Event[0] be splitted into realy log File[0] and File[1], and GTID moved to
> Evnet[1], should we keep the File[0] until Evnet[1] be applied? Then
> parallels replication can always get the start postion in the exist relay
> log file.
I don't see how this will help. There might still be another earlier
Event[-1] in File[0] that was split and does not have its beginning GTID in
File[0]. No matter what, I think we need to be able to handle the
possibility that there is a partial event group at the start of a relay log
file without the starting GTID.
Thanks,
- Kristian.
Hi Kritian,
I think you can merge your code into the master branch.
Thanks,
Lixun
> I think you can merge your code into the master branch.
Ok, will do.
- Kristian.
Pushed to 10.0.
Thanks to Hartmut for an excellent bug report. Detailed analysis, and a
stand-alone testcase to reproduce a complex failure scenario.
Actually most of it was originally done by Jean-François, the only part really done by me was the PHP generator script that made hitting the situation described by J-F much more likely ...
We assumed that this might already have been fixed with
http://lists.askmonty.org/pipermail/commits/2016-January/008828.html
but it turns out that it isn't ...