Thread 5 is locked by global read lock but FTWRL never completes and stays in waiting for commit lock; replication thread stays locked forever until FTWRL is killed
When FTWRL starts, it first checks all parallel replication worker
threads. It finds the most recent GTID started by any of them. It then sets
a flag to tell the threads not to start on any newer GTIDs, and then waits
for all earlier GTIDs to fully commit. It also sets a flag to tell START
SLAVE, STOP SLAVE, and the SQL thread to not start any new slave activity.
Once all worker threads have reached their designated point, FTWLR continues
to take the global read lock. Once that is obtained, it clears the flags and
signals worker threads and other slave code that it can proceed. At this
point, the lock is held, so no real activity will be possible until the lock
is cleared with UNLOCK TABLES.
This should hopefully fix the deadlock, at least I got the test case of
Elena to pass with a preliminary patch along these lines.
Some care will probably be needed to guard against other deadlocks against
concurrent START SLAVE / STOP SLAVE, hopefully I can get that solved.
Kristian Nielsen
added a comment - Current idea:
When FTWRL starts, it first checks all parallel replication worker
threads. It finds the most recent GTID started by any of them. It then sets
a flag to tell the threads not to start on any newer GTIDs, and then waits
for all earlier GTIDs to fully commit. It also sets a flag to tell START
SLAVE, STOP SLAVE, and the SQL thread to not start any new slave activity.
Once all worker threads have reached their designated point, FTWLR continues
to take the global read lock. Once that is obtained, it clears the flags and
signals worker threads and other slave code that it can proceed. At this
point, the lock is held, so no real activity will be possible until the lock
is cleared with UNLOCK TABLES.
This should hopefully fix the deadlock, at least I got the test case of
Elena to pass with a preliminary patch along these lines.
Some care will probably be needed to guard against other deadlocks against
concurrent START SLAVE / STOP SLAVE, hopefully I can get that solved.
I have a fix for this now (that just removes the deadlock, without breaking
existing backup tools or otherwise changing user-visible behaviour).
It is in three parts. The actual fix is the second patch. The first patch is
just unrelated refactoring of the parallel replication code to make the
second patch cleaner. The third patch is a test case.
Elena: You asked some days ago - this patch should be ready, if you want to
apply it in further testing.
EDIT: Updated with newest version of patches after fixing MDEV-8318.
Kristian Nielsen
added a comment - - edited I have a fix for this now (that just removes the deadlock, without breaking
existing backup tools or otherwise changing user-visible behaviour).
It is in three parts. The actual fix is the second patch. The first patch is
just unrelated refactoring of the parallel replication code to make the
second patch cleaner. The third patch is a test case.
http://lists.askmonty.org/pipermail/commits/2015-June/008059.html
http://lists.askmonty.org/pipermail/commits/2015-June/008060.html
http://lists.askmonty.org/pipermail/commits/2015-June/008061.html
Elena: You asked some days ago - this patch should be ready, if you want to
apply it in further testing.
EDIT: Updated with newest version of patches after fixing MDEV-8318 .
+/*
+ Do not start parallel execution of this event group until all prior groups
+ have reached the commit phase that are not safe to run in parallel with.
+*/
+static bool
+do_gco_wait(rpl_group_info *rgi, group_commit_orderer *gco,
+ bool *did_enter_cond, PSI_stage_info *old_stage)
You did however remove this comment:
/*
Register ourself to wait for the previous commit, if we need to do
such registration and that previous commit has not already
occured.
-
Also do not start parallel execution of this event group until all
prior groups have reached the commit phase that are not safe to run
in parallel with.
*/
Why did you remove the first part of the comment?
Is that not true anymore?
In sql/rpl_parallel.h you added:
+ /*
+ The largest sub_id that has started its transaction. Protected by
+ LOCK_parallel_entry.
+
+ (Transactions can start out-of-order, so this value signifies that no
+ transactions with larger sub_id have started, but not necessarily that all
+ transactions with smaller sub_id have started).
+ */
+ uint64 largest_started_sub_id;
As you don't have an EXIT_COND() matching the above ENTER_COND, please
add a comment that EXIT_COND() will be called by the caller
(handle_rpl_parallel_thread())
<cut>
Wouldn't a better name for the following function be:
rpl_unpause_after_ftwrl(THD *thd) ?
{
+ mysql_mutex_unlock(&rpt->LOCK_rpl_thread);
+ continue;
Can the above thread ever have e->pause_sub_id != 0 ?
For example if the thread did have a current_owner during pause
and not an owner now ?
+ }
As pool_mark_not_busy() is already doing the broadcast, you can remove
it from above.
If we always have to take the above mutexes when calling pool_mark_not_busy()
it would be better if we move the lock and unlock inside pool_mark_not_busy()
<cut>
+rpl_pause_for_ftwrl(THD *thd)
+{
<cut>
+ for (i= 0; i < pool->count; ++i)
+ {
+ PSI_stage_info old_stage;
+ rpl_parallel_entry *e;
+ rpl_parallel_thread *rpt= pool->threads[i];
+
+ mysql_mutex_lock(&rpt->LOCK_rpl_thread);
+ if (!rpt->current_owner)
+
{
+ /*
+ We need to pause any parallel replication slave workers during FLUSH
+ TABLES WITH READ LOCK. Otherwise we might cause a deadlock, as
+ worker threads eun run in arbitrary order but need to commit in a
+ specific given order.
+ */
+ if (rpl_pause_for_ftwrl(thd))
+ goto error;
+ }
/*
reload_acl_and_cache() will tell us if we are allowed to write to the
binlog or not.
@@ -4289,6 +4300,8 @@ case SQLCOM_PREPARE:
if (!res)
my_ok(thd);
}
+ if (lex->type & REFRESH_READ_LOCK)
+ rpl_unpause_for_ftwrl(thd);
break;
}
Why have the above code in sql_parse.cc, instead of in
reload_acl_and_cache() around the call for
I didn't see test for doing FLUSH TABLES WITH READ LOCK, 2 in a row.
This should detect if there ever was any issues with
rpl_unpause_for_ftwrl() not beeing properly called after
rpl_pause...().
Michael Widenius
added a comment - Review of MDEV-7818 : Deadlock occurring with parallel replication and FTWRL
http://lists.askmonty.org/pipermail/commits/2015-June/008059.html
In this patch you added:
+/*
+ Do not start parallel execution of this event group until all prior groups
+ have reached the commit phase that are not safe to run in parallel with.
+*/
+static bool
+do_gco_wait(rpl_group_info *rgi, group_commit_orderer *gco,
+ bool *did_enter_cond, PSI_stage_info *old_stage)
You did however remove this comment:
/*
Register ourself to wait for the previous commit, if we need to do
such registration and that previous commit has not already
occured.
-
Also do not start parallel execution of this event group until all
prior groups have reached the commit phase that are not safe to run
in parallel with.
*/
Why did you remove the first part of the comment?
Is that not true anymore?
In sql/rpl_parallel.h you added:
+ /*
+ The largest sub_id that has started its transaction. Protected by
+ LOCK_parallel_entry.
+
+ (Transactions can start out-of-order, so this value signifies that no
+ transactions with larger sub_id have started, but not necessarily that all
+ transactions with smaller sub_id have started).
+ */
+ uint64 largest_started_sub_id;
But this is not used anywhere.
http://lists.askmonty.org/pipermail/commits/2015-June/008060.html
+static void
+do_ftwrl_wait(rpl_group_info *rgi,
+ bool *did_enter_cond, PSI_stage_info *old_stage)
+{
+ THD *thd= rgi->thd;
+ rpl_parallel_entry *entry= rgi->parallel_entry;
+ uint64 sub_id= rgi->gtid_sub_id;
+
+ mysql_mutex_assert_owner(&entry->LOCK_parallel_entry);
Would be good to have a comment for the following test, like:
/*
If ftwrl is activate, wait until ftwrl is finnished if we are a
new transaction that started after the ftwrl command was given
*/
Note that if you would store MAX_INT in entry->pause_sub_id when it's not
active, then the following test would be much easier:
+ if (unlikely(entry->pause_sub_id > 0) && sub_id > entry->pause_sub_id)
+ {
+ thd->ENTER_COND(&entry->COND_parallel_entry, &entry->LOCK_parallel_entry,
+ &stage_waiting_for_ftwrl, old_stage);
As you don't have an EXIT_COND() matching the above ENTER_COND, please
add a comment that EXIT_COND() will be called by the caller
(handle_rpl_parallel_thread())
<cut>
Wouldn't a better name for the following function be:
rpl_unpause_after_ftwrl(THD *thd) ?
void
+rpl_unpause_for_ftwrl(THD *thd)
+{
+ uint32 i;
+ rpl_parallel_thread_pool *pool= &global_rpl_thread_pool;
+
+ DBUG_ASSERT(pool->busy);
+
+ for (i= 0; i < pool->count; ++i)
+ {
+ rpl_parallel_entry *e;
+ rpl_parallel_thread *rpt= pool->threads [i] ;
+
+ mysql_mutex_lock(&rpt->LOCK_rpl_thread);
+ if (!rpt->current_owner)
+
{
+ mysql_mutex_unlock(&rpt->LOCK_rpl_thread);
+ continue;
Can the above thread ever have e->pause_sub_id != 0 ?
For example if the thread did have a current_owner during pause
and not an owner now ?
+ }
+ e= rpt->current_entry;
+ mysql_mutex_lock(&e->LOCK_parallel_entry);
+ mysql_mutex_unlock(&rpt->LOCK_rpl_thread);
+ e->pause_sub_id= 0;
+ mysql_cond_broadcast(&e->COND_parallel_entry);
Don't you need to unlock mysql_mutex_lock(&e->LOCK_parallel_entry) here ?
+ }
+ mysql_mutex_lock(&pool->LOCK_rpl_thread_pool);
+ pool_mark_not_busy(pool);
+ mysql_cond_broadcast(&pool->COND_rpl_thread_pool);
+ mysql_mutex_unlock(&pool->LOCK_rpl_thread_pool);
As pool_mark_not_busy() is already doing the broadcast, you can remove
it from above.
If we always have to take the above mutexes when calling pool_mark_not_busy()
it would be better if we move the lock and unlock inside pool_mark_not_busy()
<cut>
+rpl_pause_for_ftwrl(THD *thd)
+{
<cut>
+ for (i= 0; i < pool->count; ++i)
+ {
+ PSI_stage_info old_stage;
+ rpl_parallel_entry *e;
+ rpl_parallel_thread *rpt= pool->threads [i] ;
+
+ mysql_mutex_lock(&rpt->LOCK_rpl_thread);
+ if (!rpt->current_owner)
+
{
+ mysql_mutex_unlock(&rpt->LOCK_rpl_thread);
+ continue;
+ }
+ e= rpt->current_entry;
+ mysql_mutex_lock(&e->LOCK_parallel_entry);
+ mysql_mutex_unlock(&rpt->LOCK_rpl_thread);
+ ++e->need_sub_id_signal;
+ if (!e->pause_sub_id)
+ e->pause_sub_id= e->largest_started_sub_id;
Why the above test?
In which case can pause_sub_id be != 0 here ?
If it's != 0, what does it mean?
+ thd->ENTER_COND(&e->COND_parallel_entry, &e->LOCK_parallel_entry,
+ &stage_waiting_for_ftwrl_threads_to_pause, &old_stage);
+ while (e->last_committed_sub_id < e->pause_sub_id && !err)
+ {
+ if (thd->check_killed())
+
{
+ thd->send_kill_message();
+ err= 1;
+ break;
+ }
+ mysql_cond_wait(&e->COND_parallel_entry, &e->LOCK_parallel_entry);
+ };
+ - e >need_sub_id_signal;
+ thd->EXIT_COND(&old_stage);
+ if (err)
+ break;
+ }
+
+ if (err)
+ rpl_unpause_for_ftwrl(thd);
+ return err;
+}
@@ -1106,7 +1302,14 @@ rpl_parallel_change_thread_count(rpl_parallel_thread_pool *pool,
*/
for (i= 0; i < pool->count; ++i)
{
rpl_parallel_thread *rpt= pool->get_thread(NULL, NULL);
+ rpl_parallel_thread *rpt;
+
+ mysql_mutex_lock(&pool->LOCK_rpl_thread_pool);
+ while ((rpt= pool->free_list) == NULL)
+ mysql_cond_wait(&pool->COND_rpl_thread_pool, &pool->LOCK_rpl_thread_pool);
+ pool->free_list= rpt->next;
+ mysql_mutex_unlock(&pool->LOCK_rpl_thread_pool);
+ mysql_mutex_lock(&rpt->LOCK_rpl_thread);
I see you used the original code from get_thread(), but you don't reset
rpt->current_owner or rpt->current_entry anymore. Is that not needed?
@@ -1496,8 +1703,14 @@ rpl_parallel_thread_pool::get_thread(rpl_parallel_thread **owner,
rpl_parallel_thread *rpt;
mysql_mutex_lock(&LOCK_rpl_thread_pool);
while ((rpt= free_list) == NULL)
+ for (;
+
{
+ while (unlikely(busy))
+ mysql_cond_wait(&COND_rpl_thread_pool, &LOCK_rpl_thread_pool);
+ if ((rpt= free_list) != NULL)
+ break;
mysql_cond_wait(&COND_rpl_thread_pool, &LOCK_rpl_thread_pool);
+ }
Why not use:
while (unlikely(busy) || !(rpt= free_list))
mysql_cond_wait(&COND_rpl_thread_pool, &LOCK_rpl_thread_pool);
— a/sql/sql_parse.cc
+++ b/sql/sql_parse.cc
@@ -4259,6 +4259,17 @@ case SQLCOM_PREPARE:
break;
}
+ if (lex->type & REFRESH_READ_LOCK)
+
{
+ /*
+ We need to pause any parallel replication slave workers during FLUSH
+ TABLES WITH READ LOCK. Otherwise we might cause a deadlock, as
+ worker threads eun run in arbitrary order but need to commit in a
+ specific given order.
+ */
+ if (rpl_pause_for_ftwrl(thd))
+ goto error;
+ }
/*
reload_acl_and_cache() will tell us if we are allowed to write to the
binlog or not.
@@ -4289,6 +4300,8 @@ case SQLCOM_PREPARE:
if (!res)
my_ok(thd);
}
+ if (lex->type & REFRESH_READ_LOCK)
+ rpl_unpause_for_ftwrl(thd);
break;
}
Why have the above code in sql_parse.cc, instead of in
reload_acl_and_cache() around the call for
'thd->global_read_lock.lock_global_read_lock(thd))'
This would avoid calling rpl_pause_for_ftwrl(thd) if tables are locked
and in other possible error cases detected early.
On the other hand, you may also need to take care of the code just above
that calls
flush_tables_with_read_lock(thd, all_tables)
This happens when you give tables as a argument to FLUSH TABLES WITH READ LOCK.
I would assume that this can also cause a dead lock.
The easy fix is probably to add rpl_pause / rpl_unpause also to this
function just before it's calling lock_table_names().
------------
http://lists.askmonty.org/pipermail/commits/2015-June/008061.html
I didn't see test for doing FLUSH TABLES WITH READ LOCK, 2 in a row.
This should detect if there ever was any issues with
rpl_unpause_for_ftwrl() not beeing properly called after
rpl_pause...().
Kristian Nielsen
added a comment - Pushed to 10.0 and 10.1:
http://lists.askmonty.org/pipermail/commits/2015-November/008627.html
http://lists.askmonty.org/pipermail/commits/2015-November/008628.html
People
Kristian Nielsen
Guillaume Lefranc
Votes:
0Vote for this issue
Watchers:
9Start watching this issue
Dates
Created:
Updated:
Resolved:
Git Integration
Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.
{"report":{"fcp":1934.9000000953674,"ttfb":671.4000000953674,"pageVisibility":"visible","entityId":50603,"key":"jira.project.issue.view-issue","isInitial":true,"threshold":1000,"elementTimings":{},"userDeviceMemory":8,"userDeviceProcessors":32,"apdex":0.5,"journeyId":"8ebc7e44-8466-428b-84bb-863bb3408f1c","navigationType":0,"readyForUser":2035,"redirectCount":0,"resourceLoadedEnd":2281.600000143051,"resourceLoadedStart":677,"resourceTiming":[{"duration":633.8000001907349,"initiatorType":"link","name":"https://jira.mariadb.org/s/2c21342762a6a02add1c328bed317ffd-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/css/_super/batch.css","startTime":677,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":677,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1310.8000001907349,"responseStart":0,"secureConnectionStart":0},{"duration":634.2000000476837,"initiatorType":"link","name":"https://jira.mariadb.org/s/7ebd35e77e471bc30ff0eba799ebc151-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/css/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.css?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&slack-enabled=true&whisper-enabled=true","startTime":677.5,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":677.5,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1311.7000000476837,"responseStart":0,"secureConnectionStart":0},{"duration":666.5999999046326,"initiatorType":"script","name":"https://jira.mariadb.org/s/0917945aaa57108d00c5076fea35e069-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/js/_super/batch.js?locale=en","startTime":677.6000001430511,"connectEnd":677.6000001430511,"connectStart":677.6000001430511,"domainLookupEnd":677.6000001430511,"domainLookupStart":677.6000001430511,"fetchStart":677.6000001430511,"redirectEnd":0,"redirectStart":0,"requestStart":677.6000001430511,"responseEnd":1344.2000000476837,"responseStart":1344.2000000476837,"secureConnectionStart":677.6000001430511},{"duration":872.9000000953674,"initiatorType":"script","name":"https://jira.mariadb.org/s/2d8175ec2fa4c816e8023260bd8c1786-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/js/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&locale=en&slack-enabled=true&whisper-enabled=true","startTime":678,"connectEnd":678,"connectStart":678,"domainLookupEnd":678,"domainLookupStart":678,"fetchStart":678,"redirectEnd":0,"redirectStart":0,"requestStart":678,"responseEnd":1550.9000000953674,"responseStart":1550.9000000953674,"secureConnectionStart":678},{"duration":876.2999999523163,"initiatorType":"script","name":"https://jira.mariadb.org/s/a9324d6758d385eb45c462685ad88f1d-CDN/lu2cib/820016/12ta74/c92c0caa9a024ae85b0ebdbed7fb4bd7/_/download/contextbatch/js/atl.global,-_super/batch.js?locale=en","startTime":678.2000000476837,"connectEnd":678.2000000476837,"connectStart":678.2000000476837,"domainLookupEnd":678.2000000476837,"domainLookupStart":678.2000000476837,"fetchStart":678.2000000476837,"redirectEnd":0,"redirectStart":0,"requestStart":678.2000000476837,"responseEnd":1554.5,"responseStart":1554.5,"secureConnectionStart":678.2000000476837},{"duration":876.5999999046326,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-en/jira.webresources:calendar-en.js","startTime":678.4000000953674,"connectEnd":678.4000000953674,"connectStart":678.4000000953674,"domainLookupEnd":678.4000000953674,"domainLookupStart":678.4000000953674,"fetchStart":678.4000000953674,"redirectEnd":0,"redirectStart":0,"requestStart":678.4000000953674,"responseEnd":1555,"responseStart":1555,"secureConnectionStart":678.4000000953674},{"duration":876.7999999523163,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-localisation-moment/jira.webresources:calendar-localisation-moment.js","startTime":678.6000001430511,"connectEnd":678.6000001430511,"connectStart":678.6000001430511,"domainLookupEnd":678.6000001430511,"domainLookupStart":678.6000001430511,"fetchStart":678.6000001430511,"redirectEnd":0,"redirectStart":0,"requestStart":678.6000001430511,"responseEnd":1555.4000000953674,"responseStart":1555.4000000953674,"secureConnectionStart":678.6000001430511},{"duration":887.2999999523163,"initiatorType":"link","name":"https://jira.mariadb.org/s/b04b06a02d1959df322d9cded3aeecc1-CDN/lu2cib/820016/12ta74/a2ff6aa845ffc9a1d22fe23d9ee791fc/_/download/contextbatch/css/jira.global.look-and-feel,-_super/batch.css","startTime":678.7000000476837,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":678.7000000476837,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1566,"responseStart":0,"secureConnectionStart":0},{"duration":876.8000001907349,"initiatorType":"script","name":"https://jira.mariadb.org/rest/api/1.0/shortcuts/820016/47140b6e0a9bc2e4913da06536125810/shortcuts.js?context=issuenavigation&context=issueaction","startTime":679,"connectEnd":679,"connectStart":679,"domainLookupEnd":679,"domainLookupStart":679,"fetchStart":679,"redirectEnd":0,"redirectStart":0,"requestStart":679,"responseEnd":1555.8000001907349,"responseStart":1555.8000001907349,"secureConnectionStart":679},{"duration":887,"initiatorType":"link","name":"https://jira.mariadb.org/s/3ac36323ba5e4eb0af2aa7ac7211b4bb-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/css/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.css?jira.create.linked.issue=true","startTime":679.1000001430511,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":679.1000001430511,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1566.1000001430511,"responseStart":0,"secureConnectionStart":0},{"duration":877.2999999523163,"initiatorType":"script","name":"https://jira.mariadb.org/s/5d5e8fe91fbc506585e83ea3b62ccc4b-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/js/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.js?jira.create.linked.issue=true&locale=en","startTime":679.3000001907349,"connectEnd":679.3000001907349,"connectStart":679.3000001907349,"domainLookupEnd":679.3000001907349,"domainLookupStart":679.3000001907349,"fetchStart":679.3000001907349,"redirectEnd":0,"redirectStart":0,"requestStart":679.3000001907349,"responseEnd":1556.6000001430511,"responseStart":1556.5,"secureConnectionStart":679.3000001907349},{"duration":1579.5999999046326,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-js/jira.webresources:bigpipe-js.js","startTime":700.9000000953674,"connectEnd":700.9000000953674,"connectStart":700.9000000953674,"domainLookupEnd":700.9000000953674,"domainLookupStart":700.9000000953674,"fetchStart":700.9000000953674,"redirectEnd":0,"redirectStart":0,"requestStart":700.9000000953674,"responseEnd":2280.5,"responseStart":2280.5,"secureConnectionStart":700.9000000953674},{"duration":1561.2000000476837,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-init/jira.webresources:bigpipe-init.js","startTime":720.4000000953674,"connectEnd":720.4000000953674,"connectStart":720.4000000953674,"domainLookupEnd":720.4000000953674,"domainLookupStart":720.4000000953674,"fetchStart":720.4000000953674,"redirectEnd":0,"redirectStart":0,"requestStart":720.4000000953674,"responseEnd":2281.600000143051,"responseStart":2281.600000143051,"secureConnectionStart":720.4000000953674},{"duration":253.20000004768372,"initiatorType":"xmlhttprequest","name":"https://jira.mariadb.org/rest/webResources/1.0/resources","startTime":1509.7000000476837,"connectEnd":1509.7000000476837,"connectStart":1509.7000000476837,"domainLookupEnd":1509.7000000476837,"domainLookupStart":1509.7000000476837,"fetchStart":1509.7000000476837,"redirectEnd":0,"redirectStart":0,"requestStart":1509.7000000476837,"responseEnd":1762.9000000953674,"responseStart":1762.9000000953674,"secureConnectionStart":1509.7000000476837}],"fetchStart":0,"domainLookupStart":17,"domainLookupEnd":19,"connectStart":19,"connectEnd":49,"secureConnectionStart":30,"requestStart":49,"responseStart":671,"responseEnd":719,"domLoading":675,"domInteractive":2347,"domContentLoadedEventStart":2347,"domContentLoadedEventEnd":2402,"domComplete":2845,"loadEventStart":2845,"loadEventEnd":2845,"userAgent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)","marks":[{"name":"bigPipe.sidebar-id.start","time":2316.5},{"name":"bigPipe.sidebar-id.end","time":2317.300000190735},{"name":"bigPipe.activity-panel-pipe-id.start","time":2317.5},{"name":"bigPipe.activity-panel-pipe-id.end","time":2319.7000000476837},{"name":"activityTabFullyLoaded","time":2420.9000000953674}],"measures":[],"correlationId":"3f2db122ea3d0f","effectiveType":"4g","downlink":10,"rtt":0,"serverDuration":484,"dbReadsTimeInMs":12,"dbConnsTimeInMs":21,"applicationHash":"9d11dbea5f4be3d4cc21f03a88dd11d8c8687422","experiments":[]}}
Current idea:
When FTWRL starts, it first checks all parallel replication worker
threads. It finds the most recent GTID started by any of them. It then sets
a flag to tell the threads not to start on any newer GTIDs, and then waits
for all earlier GTIDs to fully commit. It also sets a flag to tell START
SLAVE, STOP SLAVE, and the SQL thread to not start any new slave activity.
Once all worker threads have reached their designated point, FTWLR continues
to take the global read lock. Once that is obtained, it clears the flags and
signals worker threads and other slave code that it can proceed. At this
point, the lock is held, so no real activity will be possible until the lock
is cleared with UNLOCK TABLES.
This should hopefully fix the deadlock, at least I got the test case of
Elena to pass with a preliminary patch along these lines.
Some care will probably be needed to guard against other deadlocks against
concurrent START SLAVE / STOP SLAVE, hopefully I can get that solved.