When doing STOP SLAVE SQL_THREAD and START SLAVE SQL_THREAD repeatedly parallel replication sometimes fails on START with a duplicate key error. SHOW SLAVE STATUS then always shows a relay log position at the beginning of a log file, e.g. "Relay_Log_Pos: 4". This only seems to happen if a large transaction is split across more than one relay log file.
How to reproduce:
Master config:
[mysqld]
|
server-id=1
|
log-bin=master-bin
|
binlog-format=statement
|
Slave config:
[mysqld]
|
server-id=2
|
slave_parallel_threads=20
|
max_relay_log_size=98304
|
Create replication user on the master:
CREATE USER 'repl'@'%' IDENTIFIED BY 'password';
|
GRANT ALL PRIVILEGES ON *.* TO 'repl'@'%';
|
RESET MASTER;
|
Start slave:
SET GLOBAL gtid_slave_pos='0-1-2';
|
CHANGE MASTER TO Master_host='mymasterip', Master_user='repl', Master_password='password', master_use_gtid=slave_pos;
|
START SLAVE;
|
On the master: create a simple PHP script "test.php" to populate a table using given domain ID (modulo 3) and table name in transactions of 1000 rows each (transactions will take about 40K binlog space each, so talking about half of the max_relaylog_size which makes hitting a transaction split very likely)
<?php
|
|
$domain_id = $argv[1]%3 + 1;
|
$table = $argv[2];
|
|
echo "SET gtid_domain_id=$domain_id;";
|
|
echo "DROP TABLE IF EXISTS test.$table;";
|
echo "CREATE TABLE test.$table(id int primary key, msg varchar(100));";
|
|
$n=1;
|
|
while (true) {
|
echo "BEGIN;\n";
|
for ($i = 1; $i < 1000; $i++) {
|
echo "INSERT INTO test.$table VALUES($n, MD5(RAND()));\n";
|
$n++;
|
}
|
echo "COMMIT;\n";
|
}
|
Then run several instances of it in parallel:
for a in $(seq 20); do (php test.php $a t$a | mysql &) ; done
|
Now on the slave run this script to repeatedly stop and restart the SQL thread:
( sudo mysql <<< "select version(); start slave sql_thread"; while date; do
|
sudo mysql <<< 'stop slave sql_thread; show slave status\G'
|
gsp_stopped="$(sudo mysql <<< 'SELECT @@GLOBAL.gtid_slave_pos;')"; echo "$gsp_stopped";
|
sudo mysql <<< 'start slave sql_thread;'; sleep 5;
|
sss="$(sudo mysql <<< 'show slave status\G')"
|
gsp="$(sudo mysql <<< 'SELECT @@GLOBAL.gtid_slave_pos')"; echo "$sss"; echo "$gsp";
|
if test "$(awk '$1 == "Slave_SQL_Running:"{print $2}' <<< "$sss")" == "Yes"; then echo; echo "sleeping..."; sleep 5; else echo "things are broken..."; break; fi; done; ) | tee test.log
|
This usually fails with a duplicate key error after a small number of iterations already
- relates to
-
MDEV-9138
Relay log position corrupted with parallel replication after interrupting LOAD DATA on master
-
-
Closed
{"report":{"fcp":1511.2999999523163,"ttfb":349.2999999523163,"pageVisibility":"visible","entityId":58172,"key":"jira.project.issue.view-issue","isInitial":true,"threshold":1000,"elementTimings":{},"userDeviceMemory":8,"userDeviceProcessors":64,"apdex":0.5,"journeyId":"d82618ca-cc93-4a3b-86e6-98d0b328fff4","navigationType":0,"readyForUser":1639.1999999284744,"redirectCount":0,"resourceLoadedEnd":2317.0999999046326,"resourceLoadedStart":371.59999990463257,"resourceTiming":[{"duration":429.60000002384186,"initiatorType":"link","name":"https://jira.mariadb.org/s/2c21342762a6a02add1c328bed317ffd-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/css/_super/batch.css","startTime":371.59999990463257,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":371.59999990463257,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":801.1999999284744,"responseStart":0,"secureConnectionStart":0},{"duration":526.6999999284744,"initiatorType":"link","name":"https://jira.mariadb.org/s/7ebd35e77e471bc30ff0eba799ebc151-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/css/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.css?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&slack-enabled=true&whisper-enabled=true","startTime":371.89999997615814,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":371.89999997615814,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":898.5999999046326,"responseStart":0,"secureConnectionStart":0},{"duration":598.5999999046326,"initiatorType":"script","name":"https://jira.mariadb.org/s/0917945aaa57108d00c5076fea35e069-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/js/_super/batch.js?locale=en","startTime":372,"connectEnd":372,"connectStart":372,"domainLookupEnd":372,"domainLookupStart":372,"fetchStart":372,"redirectEnd":0,"redirectStart":0,"requestStart":372,"responseEnd":970.5999999046326,"responseStart":970.5999999046326,"secureConnectionStart":372},{"duration":721.3999999761581,"initiatorType":"script","name":"https://jira.mariadb.org/s/2d8175ec2fa4c816e8023260bd8c1786-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/js/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&locale=en&slack-enabled=true&whisper-enabled=true","startTime":372.1999999284744,"connectEnd":372.1999999284744,"connectStart":372.1999999284744,"domainLookupEnd":372.1999999284744,"domainLookupStart":372.1999999284744,"fetchStart":372.1999999284744,"redirectEnd":0,"redirectStart":0,"requestStart":372.1999999284744,"responseEnd":1093.5999999046326,"responseStart":1093.5999999046326,"secureConnectionStart":372.1999999284744},{"duration":736.2999999523163,"initiatorType":"script","name":"https://jira.mariadb.org/s/a9324d6758d385eb45c462685ad88f1d-CDN/lu2cib/820016/12ta74/c92c0caa9a024ae85b0ebdbed7fb4bd7/_/download/contextbatch/js/atl.global,-_super/batch.js?locale=en","startTime":372.39999997615814,"connectEnd":372.39999997615814,"connectStart":372.39999997615814,"domainLookupEnd":372.39999997615814,"domainLookupStart":372.39999997615814,"fetchStart":372.39999997615814,"redirectEnd":0,"redirectStart":0,"requestStart":372.39999997615814,"responseEnd":1108.6999999284744,"responseStart":1108.6999999284744,"secureConnectionStart":372.39999997615814},{"duration":746.1999999284744,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-en/jira.webresources:calendar-en.js","startTime":372.5,"connectEnd":372.5,"connectStart":372.5,"domainLookupEnd":372.5,"domainLookupStart":372.5,"fetchStart":372.5,"redirectEnd":0,"redirectStart":0,"requestStart":372.5,"responseEnd":1118.6999999284744,"responseStart":1118.6999999284744,"secureConnectionStart":372.5},{"duration":756.5,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-localisation-moment/jira.webresources:calendar-localisation-moment.js","startTime":372.6999999284744,"connectEnd":372.6999999284744,"connectStart":372.6999999284744,"domainLookupEnd":372.6999999284744,"domainLookupStart":372.6999999284744,"fetchStart":372.6999999284744,"redirectEnd":0,"redirectStart":0,"requestStart":372.6999999284744,"responseEnd":1129.1999999284744,"responseStart":1129.1999999284744,"secureConnectionStart":372.6999999284744},{"duration":761.7000000476837,"initiatorType":"link","name":"https://jira.mariadb.org/s/b04b06a02d1959df322d9cded3aeecc1-CDN/lu2cib/820016/12ta74/a2ff6aa845ffc9a1d22fe23d9ee791fc/_/download/contextbatch/css/jira.global.look-and-feel,-_super/batch.css","startTime":372.7999999523163,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":372.7999999523163,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1134.5,"responseStart":0,"secureConnectionStart":0},{"duration":756.6000000238419,"initiatorType":"script","name":"https://jira.mariadb.org/rest/api/1.0/shortcuts/820016/47140b6e0a9bc2e4913da06536125810/shortcuts.js?context=issuenavigation&context=issueaction","startTime":373.09999990463257,"connectEnd":373.09999990463257,"connectStart":373.09999990463257,"domainLookupEnd":373.09999990463257,"domainLookupStart":373.09999990463257,"fetchStart":373.09999990463257,"redirectEnd":0,"redirectStart":0,"requestStart":373.09999990463257,"responseEnd":1129.6999999284744,"responseStart":1129.6999999284744,"secureConnectionStart":373.09999990463257},{"duration":761.3999999761581,"initiatorType":"link","name":"https://jira.mariadb.org/s/3ac36323ba5e4eb0af2aa7ac7211b4bb-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/css/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.css?jira.create.linked.issue=true","startTime":373.1999999284744,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":373.1999999284744,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1134.5999999046326,"responseStart":0,"secureConnectionStart":0},{"duration":757.1999999284744,"initiatorType":"script","name":"https://jira.mariadb.org/s/5d5e8fe91fbc506585e83ea3b62ccc4b-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/js/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.js?jira.create.linked.issue=true&locale=en","startTime":373.39999997615814,"connectEnd":373.39999997615814,"connectStart":373.39999997615814,"domainLookupEnd":373.39999997615814,"domainLookupStart":373.39999997615814,"fetchStart":373.39999997615814,"redirectEnd":0,"redirectStart":0,"requestStart":373.39999997615814,"responseEnd":1130.5999999046326,"responseStart":1130.5999999046326,"secureConnectionStart":373.39999997615814},{"duration":1573.1999999284744,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-js/jira.webresources:bigpipe-js.js","startTime":381.39999997615814,"connectEnd":381.39999997615814,"connectStart":381.39999997615814,"domainLookupEnd":381.39999997615814,"domainLookupStart":381.39999997615814,"fetchStart":381.39999997615814,"redirectEnd":0,"redirectStart":0,"requestStart":381.39999997615814,"responseEnd":1954.5999999046326,"responseStart":1954.5999999046326,"secureConnectionStart":381.39999997615814},{"duration":1934.6999999284744,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-init/jira.webresources:bigpipe-init.js","startTime":382.39999997615814,"connectEnd":382.39999997615814,"connectStart":382.39999997615814,"domainLookupEnd":382.39999997615814,"domainLookupStart":382.39999997615814,"fetchStart":382.39999997615814,"redirectEnd":0,"redirectStart":0,"requestStart":382.39999997615814,"responseEnd":2317.0999999046326,"responseStart":2317,"secureConnectionStart":382.39999997615814},{"duration":806.5,"initiatorType":"xmlhttprequest","name":"https://jira.mariadb.org/rest/webResources/1.0/resources","startTime":1156.7999999523163,"connectEnd":1156.7999999523163,"connectStart":1156.7999999523163,"domainLookupEnd":1156.7999999523163,"domainLookupStart":1156.7999999523163,"fetchStart":1156.7999999523163,"redirectEnd":0,"redirectStart":0,"requestStart":1156.7999999523163,"responseEnd":1963.2999999523163,"responseStart":1963.2999999523163,"secureConnectionStart":1156.7999999523163},{"duration":817.6999999284744,"initiatorType":"script","name":"https://www.google-analytics.com/analytics.js","startTime":1502.8999999761581,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":1502.8999999761581,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":2320.5999999046326,"responseStart":0,"secureConnectionStart":0}],"fetchStart":0,"domainLookupStart":0,"domainLookupEnd":0,"connectStart":0,"connectEnd":0,"requestStart":152,"responseStart":349,"responseEnd":382,"domLoading":369,"domInteractive":2344,"domContentLoadedEventStart":2344,"domContentLoadedEventEnd":2396,"domComplete":3295,"loadEventStart":3295,"loadEventEnd":3295,"userAgent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)","marks":[{"name":"bigPipe.sidebar-id.start","time":2321.5},{"name":"bigPipe.sidebar-id.end","time":2322.2999999523163},{"name":"bigPipe.activity-panel-pipe-id.start","time":2322.5},{"name":"bigPipe.activity-panel-pipe-id.end","time":2324.0999999046326},{"name":"activityTabFullyLoaded","time":2413.6999999284744}],"measures":[],"correlationId":"9054f3f8c4d91a","effectiveType":"4g","downlink":9,"rtt":0,"serverDuration":104,"dbReadsTimeInMs":13,"dbConnsTimeInMs":22,"applicationHash":"9d11dbea5f4be3d4cc21f03a88dd11d8c8687422","experiments":[]}}
> I checked your patch, the main idea is ignore the first part without a
> begining GTID event in the first relay log, right?
The way the code works is, it scans over the earlier GTIDs in the relay log
file. If the GTID was already applied, it sets the GTID_SKIP_TRANSACTION (or
GTID_SKIP_STANDALONE) flag, so that the associated events are skipped.
But the code starts scanning at the beginning of a relay log, which can be
in the middle of an event group that was already applied. So it does not see
the GTID for the first event group, thus the gtid_skip_flag needs to be
explicitly initialised to GTID_SKIP_TRANSACTION.
We can be sure that any GTID from an earlier relay log was already applied,
so that skipping it is correct. This is ensured by the inuse_relaylog
mechanism. See struct inuse_relaylog in sql/rpl_rli.h.
An alternative might be to keep track of the position of the first GTID (or
other event outside of an event group), and then start scanning from that
point. Then we need to handle also the case where a relay log file contains
no GTID at all because both the start and the end of a big transaction is
outside that file. I am not sure that is any simpler or better.
> I'm not sure if this logic always be right under parallels replication.
Yes. This code which handles restarting the SQL thread is really
complicated, and I was always very unhappy about it. There have been a
number of bugs related to it. I would certainly welcome any suggestions for
a simpler solution.
I think the better overall approach would be that the SQL thread was able to
correctly start in the relay log file from a GTID position, using a
mechanism similar to how this works when mysql_binlog_send() runs on the
master. Then, if relay log recovery was implemented on the slave similar to
binlog recovery on the master, we would also be able to re-use the relay log
after a slave server restart, without having to re-fetch it from the
master. But that is a much bigger change, probably too much for 10.0.
(And it would still be necessary to handle partiel event groups at the start
of the relay log).
> Do you think keep the previous relay log file is a better way? I mean, if
> Event[0] be splitted into realy log File[0] and File[1], and GTID moved to
> Evnet[1], should we keep the File[0] until Evnet[1] be applied? Then
> parallels replication can always get the start postion in the exist relay
> log file.
I don't see how this will help. There might still be another earlier
Event[-1] in File[0] that was split and does not have its beginning GTID in
File[0]. No matter what, I think we need to be able to handle the
possibility that there is a partial event group at the start of a relay log
file without the starting GTID.
Thanks,