[MDEV-7818] Deadlock occurring with parallel replication and FTWRL Created: 2015-03-23  Updated: 2016-12-01  Resolved: 2015-11-14

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0
Fix Version/s: 10.0.23, 10.1.9

Type: Bug Priority: Critical
Reporter: Guillaume Lefranc Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: parallelslave, verified

Issue Links:
Relates
relates to MDEV-8318 Assertion `!pool->busy' failed in poo... Closed
Sprint: 10.0.22, 10.1.9-1, 10.1.9-2

 Description   

Thread 5 is locked by global read lock but FTWRL never completes and stays in waiting for commit lock; replication thread stays locked forever until FTWRL is killed

MariaDB(db-02)[(none)]> show processlist\G
*************************** 1. row ***************************
      Id: 3
    User: system user
    Host: 
      db: NULL
 Command: Connect
    Time: 12612
   State: Waiting for prior transaction to start commit before starting next transaction
    Info: NULL
Progress: 0.000
*************************** 2. row ***************************
      Id: 4
    User: system user
    Host: 
      db: NULL
 Command: Connect
    Time: 12612
   State: Waiting for prior transaction to commit
    Info: NULL
Progress: 0.000
*************************** 3. row ***************************
      Id: 5
    User: system user
    Host: 
      db: crypto_data
 Command: Connect
    Time: 12612
   State: Waiting for global read lock
    Info: INSERT INTO `market_orders_ok` (`label`, `marketid`, `ordertype`, `price`, `quantity`, `total`) VALU
Progress: 0.000
*************************** 4. row ***************************
      Id: 6
    User: system user
    Host: 
      db: NULL
 Command: Connect
    Time: 12612
   State: Waiting for prior transaction to start commit before starting next transaction
    Info: NULL
Progress: 0.000
*************************** 5. row ***************************
      Id: 7
    User: system user
    Host: 
      db: NULL
 Command: Connect
    Time: 1651381
   State: Waiting for master to send event
    Info: NULL
Progress: 0.000
*************************** 6. row ***************************
      Id: 8
    User: system user
    Host: 
      db: NULL
 Command: Connect
    Time: 13263
   State: Waiting for room in worker thread event queue
    Info: NULL
Progress: 0.000
*************************** 7. row ***************************
      Id: 39
    User: monitoring
    Host: localhost
      db: NULL
 Command: Sleep
    Time: 59
   State: 
    Info: NULL
Progress: 0.000
*************************** 8. row ***************************
      Id: 1568378
    User: tanje6acfc0c1213cfb63
    Host: localhost
      db: NULL
 Command: Sleep
    Time: 3
   State: 
    Info: NULL
Progress: 0.000
*************************** 9. row ***************************
      Id: 27871929
    User: backup
    Host: localhost
      db: NULL
 Command: Query
    Time: 12612
   State: Waiting for commit lock
    Info: FLUSH TABLES WITH READ LOCK
Progress: 0.000
*************************** 10. row ***************************
      Id: 28082858
    User: tanje6acfc0c1213cfb63
    Host: localhost
      db: NULL
 Command: Query
    Time: 0
   State: init
    Info: show processlist
Progress: 0.000
10 rows in set (0.00 sec)



 Comments   
Comment by Elena Stepanova [ 2015-03-24 ]

I am able to reproduce it with this (very crude and not very reliable) test:

--source include/master-slave.inc
--source include/have_binlog_format_statement.inc
--source include/have_innodb.inc
 
create table t1 (i int) engine=InnoDB;
create table t2 (i int) engine=InnoDB;
create table t3 (i int) engine=InnoDB;
 
--sync_slave_with_master
install soname 'metadata_lock_info';
 
set global binlog_commit_wait_count=2, binlog_commit_wait_usec=2000000;
--source include/stop_slave.inc
set global slave_parallel_threads=4;
 
--connection master
set global binlog_commit_wait_count=2, binlog_commit_wait_usec=1000000;
begin;
insert into t3 values (1),(2),(3),(4);
insert into t3 select * from t3 /* 1 */;
insert into t3 select * from t3 /* 2 */;
insert into t3 select * from t3 /* 3 */;
insert into t3 select * from t3 /* 4 */;
insert into t3 select * from t3 /* 5 */;
insert into t3 select * from t3 /* 6 */;
insert into t3 select * from t3 /* 7 */;
insert into t3 select * from t3 /* 8 */;
insert into t3 select * from t3 /* 9 */;
insert into t3 select * from t3 /* 10 */;
insert into t3 select * from t3 /* 11 */;
insert into t3 select * from t3 /* 12 */;
insert into t1 values (1),(2),(3),(4);
insert into t1 select * from t1;
--send commit
 
--connection master1
insert into t2 values (1),(2);
 
--connection master
--reap
 
--connection slave
--source include/start_slave.inc
--sleep 1
--send FLUSH TABLES WITH READ LOCK
 
--connection slave1
--let $show_statement= SHOW PROCESSLIST
--let $field= State
--let $condition= 'Waiting for commit lock'
--source include/wait_show_condition.inc
 
SHOW PROCESSLIST;
select * from information_schema.metadata_lock_info;
 
--connection slave
--reap
UNLOCK TABLES;

Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost:37845	test	Sleep	20		NULL	0.000
5	root	localhost:37846	test	Sleep	21		NULL	0.000
8	root	localhost:37852	test	Query	0	Waiting for global read lock	FLUSH TABLES WITH READ LOCK	0.000
9	root	localhost:37853	test	Query	0	init	SHOW PROCESSLIST	0.000
10	system user		NULL	Connect	1	Waiting for master to send event	NULL	0.000
11	system user		NULL	Connect	1	Waiting for work from SQL thread	NULL	0.000
12	system user		NULL	Connect	1	Waiting for prior transaction to commit	NULL	0.000
13	system user		NULL	Connect	1	Waiting for work from SQL thread	NULL	0.000
14	system user		test	Connect	17	Copying to tmp table	insert into t3 select * from t3 /* 9 */	0.000
15	system user		NULL	Connect	1	Slave has read all relay log; waiting for the slave I/O thread to update it	NULL	0.000

select * from information_schema.metadata_lock_info;
THREAD_ID	LOCK_MODE	LOCK_DURATION	LOCK_TYPE	TABLE_SCHEMA	TABLE_NAME
14	MDL_INTENTION_EXCLUSIVE	MDL_STATEMENT	Global read lock		
12	MDL_INTENTION_EXCLUSIVE	MDL_EXPLICIT	Commit lock		
12	MDL_SHARED_WRITE	MDL_TRANSACTION	Table metadata lock	mysql	gtid_slave_pos
14	MDL_SHARED_WRITE	MDL_TRANSACTION	Table metadata lock	test	t3
12	MDL_SHARED_WRITE	MDL_TRANSACTION	Table metadata lock	test	t2

Comment by Kristian Nielsen [ 2015-03-24 ]

Ok, so I think this is what happens.

We have transactions T1 followed by T2.

T2 gets ready to commit. It acquires the commit metadata lock in
ha_commit_trans().

Then in the binlog T2 needs to wait for T1 to commit first.

FLUSH TABLES WITH READ LOCK runs, it needs to wait for the commit lock held by
T2.

T1 runs its INSERT statement, it gets blocked by the running FLUSH TABLES WITH
READ LOCK.

At this point, we have a deadlock.

So the question is what the correct behaviour should be here.

It is not safe to abort the commit of T2 with an error (and possibly retry it
later). Because T2 may be non-transactional and not safe to roll back.

One option seems to be to allow T1 to run despite the pending FTWRL. Because
arguably, T1 already owns the commit lock together with T2. This would require
that all places that a query (like INSERT) waits for FTWRL, the wait can be
skipped in the parallel replication case, and it also requires some way to
ensure that a later T3 will be blocked by FTWRL. I am not sure how to handle
these points at the moment.

Another option might be to let T2 release the commit lock when it goes to wait
for a prior commit. This would be in queue_for_group_commit() (and probably
ha_commit_one_phase() also needs similar attention?). And then re-aquire the
lock after the wait. Or preferably I guess would be to not aquire the commit
lock at all until after the wait for prior commit. (Group commit might then
only have to aquire the lock once, in the leader). I am not at the moment sure
what is required exactly to make this correct, needs more knowledge of the
usage of the commit metadata lock...

Comment by Kristian Nielsen [ 2015-03-24 ]

A simpler, but less convenient, solution could be to forbid running FLUSH TABLES WITH READ LOCK while a parallel replication slave is running. The user would then need to STOP SLAVE; FLUSH TABLES WITH READ LOCK; START SLAVE. The net effect would be much the same, as the slave cannot do any work while FLUSH TABLES WITH READ LOCK is running anyway.

Comment by Guillaume Lefranc [ 2015-03-25 ]

I am concerned by the fact that this last solution would break most backup utilities when parallel replication is enabled.

If MariaDB would port backup locks over from Percona then that would not be a problem anymore

Comment by Kristian Nielsen [ 2015-03-27 ]

Agree, hopefully I can come up with a fix that "just works" and does not require any special user action.

Comment by Kristian Nielsen [ 2015-06-02 ]

Current idea:

When FTWRL starts, it first checks all parallel replication worker
threads. It finds the most recent GTID started by any of them. It then sets
a flag to tell the threads not to start on any newer GTIDs, and then waits
for all earlier GTIDs to fully commit. It also sets a flag to tell START
SLAVE, STOP SLAVE, and the SQL thread to not start any new slave activity.

Once all worker threads have reached their designated point, FTWLR continues
to take the global read lock. Once that is obtained, it clears the flags and
signals worker threads and other slave code that it can proceed. At this
point, the lock is held, so no real activity will be possible until the lock
is cleared with UNLOCK TABLES.

This should hopefully fix the deadlock, at least I got the test case of
Elena to pass with a preliminary patch along these lines.

Some care will probably be needed to guard against other deadlocks against
concurrent START SLAVE / STOP SLAVE, hopefully I can get that solved.

Comment by Kristian Nielsen [ 2015-06-09 ]

I have a fix for this now (that just removes the deadlock, without breaking
existing backup tools or otherwise changing user-visible behaviour).

It is in three parts. The actual fix is the second patch. The first patch is
just unrelated refactoring of the parallel replication code to make the
second patch cleaner. The third patch is a test case.

http://lists.askmonty.org/pipermail/commits/2015-June/008059.html
http://lists.askmonty.org/pipermail/commits/2015-June/008060.html
http://lists.askmonty.org/pipermail/commits/2015-June/008061.html

Elena: You asked some days ago - this patch should be ready, if you want to
apply it in further testing.

EDIT: Updated with newest version of patches after fixing MDEV-8318.

Comment by Elena Stepanova [ 2015-06-14 ]

See MDEV-8318. I made it a separate report not to pollute this one too much.

Comment by Michael Widenius [ 2015-10-22 ]

Review of MDEV-7818: Deadlock occurring with parallel replication and FTWRL

http://lists.askmonty.org/pipermail/commits/2015-June/008059.html
In this patch you added:

+/*
+ Do not start parallel execution of this event group until all prior groups
+ have reached the commit phase that are not safe to run in parallel with.
+*/
+static bool
+do_gco_wait(rpl_group_info *rgi, group_commit_orderer *gco,
+ bool *did_enter_cond, PSI_stage_info *old_stage)

You did however remove this comment:

  • /*
  • Register ourself to wait for the previous commit, if we need to do
  • such registration and that previous commit has not already
  • occured.
    -
  • Also do not start parallel execution of this event group until all
  • prior groups have reached the commit phase that are not safe to run
  • in parallel with.
  • */

Why did you remove the first part of the comment?
Is that not true anymore?

In sql/rpl_parallel.h you added:

+ /*
+ The largest sub_id that has started its transaction. Protected by
+ LOCK_parallel_entry.
+
+ (Transactions can start out-of-order, so this value signifies that no
+ transactions with larger sub_id have started, but not necessarily that all
+ transactions with smaller sub_id have started).
+ */
+ uint64 largest_started_sub_id;

But this is not used anywhere.

http://lists.askmonty.org/pipermail/commits/2015-June/008060.html

+static void
+do_ftwrl_wait(rpl_group_info *rgi,
+ bool *did_enter_cond, PSI_stage_info *old_stage)
+{
+ THD *thd= rgi->thd;
+ rpl_parallel_entry *entry= rgi->parallel_entry;
+ uint64 sub_id= rgi->gtid_sub_id;
+
+ mysql_mutex_assert_owner(&entry->LOCK_parallel_entry);

Would be good to have a comment for the following test, like:

/*
If ftwrl is activate, wait until ftwrl is finnished if we are a
new transaction that started after the ftwrl command was given
*/

Note that if you would store MAX_INT in entry->pause_sub_id when it's not
active, then the following test would be much easier:

+ if (unlikely(entry->pause_sub_id > 0) && sub_id > entry->pause_sub_id)
+ {
+ thd->ENTER_COND(&entry->COND_parallel_entry, &entry->LOCK_parallel_entry,
+ &stage_waiting_for_ftwrl, old_stage);

As you don't have an EXIT_COND() matching the above ENTER_COND, please
add a comment that EXIT_COND() will be called by the caller
(handle_rpl_parallel_thread())

<cut>

Wouldn't a better name for the following function be:
rpl_unpause_after_ftwrl(THD *thd) ?

void
+rpl_unpause_for_ftwrl(THD *thd)
+{
+ uint32 i;
+ rpl_parallel_thread_pool *pool= &global_rpl_thread_pool;
+
+ DBUG_ASSERT(pool->busy);
+
+ for (i= 0; i < pool->count; ++i)
+ {
+ rpl_parallel_entry *e;
+ rpl_parallel_thread *rpt= pool->threads[i];
+
+ mysql_mutex_lock(&rpt->LOCK_rpl_thread);
+ if (!rpt->current_owner)
+

{ + mysql_mutex_unlock(&rpt->LOCK_rpl_thread); + continue; Can the above thread ever have e->pause_sub_id != 0 ? For example if the thread did have a current_owner during pause and not an owner now ? + }

+ e= rpt->current_entry;
+ mysql_mutex_lock(&e->LOCK_parallel_entry);
+ mysql_mutex_unlock(&rpt->LOCK_rpl_thread);
+ e->pause_sub_id= 0;
+ mysql_cond_broadcast(&e->COND_parallel_entry);

Don't you need to unlock mysql_mutex_lock(&e->LOCK_parallel_entry) here ?

+ }

+ mysql_mutex_lock(&pool->LOCK_rpl_thread_pool);
+ pool_mark_not_busy(pool);
+ mysql_cond_broadcast(&pool->COND_rpl_thread_pool);
+ mysql_mutex_unlock(&pool->LOCK_rpl_thread_pool);

As pool_mark_not_busy() is already doing the broadcast, you can remove
it from above.

If we always have to take the above mutexes when calling pool_mark_not_busy()
it would be better if we move the lock and unlock inside pool_mark_not_busy()

<cut>
+rpl_pause_for_ftwrl(THD *thd)
+{
<cut>

+ for (i= 0; i < pool->count; ++i)
+ {
+ PSI_stage_info old_stage;
+ rpl_parallel_entry *e;
+ rpl_parallel_thread *rpt= pool->threads[i];
+
+ mysql_mutex_lock(&rpt->LOCK_rpl_thread);
+ if (!rpt->current_owner)
+

{ + mysql_mutex_unlock(&rpt->LOCK_rpl_thread); + continue; + }

+ e= rpt->current_entry;
+ mysql_mutex_lock(&e->LOCK_parallel_entry);
+ mysql_mutex_unlock(&rpt->LOCK_rpl_thread);
+ ++e->need_sub_id_signal;
+ if (!e->pause_sub_id)
+ e->pause_sub_id= e->largest_started_sub_id;

Why the above test?
In which case can pause_sub_id be != 0 here ?
If it's != 0, what does it mean?

+ thd->ENTER_COND(&e->COND_parallel_entry, &e->LOCK_parallel_entry,
+ &stage_waiting_for_ftwrl_threads_to_pause, &old_stage);
+ while (e->last_committed_sub_id < e->pause_sub_id && !err)
+ {
+ if (thd->check_killed())
+

{ + thd->send_kill_message(); + err= 1; + break; + }

+ mysql_cond_wait(&e->COND_parallel_entry, &e->LOCK_parallel_entry);
+ };
+ -e>need_sub_id_signal;
+ thd->EXIT_COND(&old_stage);
+ if (err)
+ break;
+ }
+
+ if (err)
+ rpl_unpause_for_ftwrl(thd);
+ return err;
+}

@@ -1106,7 +1302,14 @@ rpl_parallel_change_thread_count(rpl_parallel_thread_pool *pool,
*/
for (i= 0; i < pool->count; ++i)
{

  • rpl_parallel_thread *rpt= pool->get_thread(NULL, NULL);
    + rpl_parallel_thread *rpt;
    +
    + mysql_mutex_lock(&pool->LOCK_rpl_thread_pool);
    + while ((rpt= pool->free_list) == NULL)
    + mysql_cond_wait(&pool->COND_rpl_thread_pool, &pool->LOCK_rpl_thread_pool);
    + pool->free_list= rpt->next;
    + mysql_mutex_unlock(&pool->LOCK_rpl_thread_pool);
    + mysql_mutex_lock(&rpt->LOCK_rpl_thread);

I see you used the original code from get_thread(), but you don't reset
rpt->current_owner or rpt->current_entry anymore. Is that not needed?

@@ -1496,8 +1703,14 @@ rpl_parallel_thread_pool::get_thread(rpl_parallel_thread **owner,
rpl_parallel_thread *rpt;

mysql_mutex_lock(&LOCK_rpl_thread_pool);

  • while ((rpt= free_list) == NULL)
    + for (;
    + { + while (unlikely(busy)) + mysql_cond_wait(&COND_rpl_thread_pool, &LOCK_rpl_thread_pool); + if ((rpt= free_list) != NULL) + break; mysql_cond_wait(&COND_rpl_thread_pool, &LOCK_rpl_thread_pool); + }

Why not use:

while (unlikely(busy) || !(rpt= free_list))
mysql_cond_wait(&COND_rpl_thread_pool, &LOCK_rpl_thread_pool);

— a/sql/sql_parse.cc
+++ b/sql/sql_parse.cc
@@ -4259,6 +4259,17 @@ case SQLCOM_PREPARE:
break;
}

+ if (lex->type & REFRESH_READ_LOCK)
+

{ + /* + We need to pause any parallel replication slave workers during FLUSH + TABLES WITH READ LOCK. Otherwise we might cause a deadlock, as + worker threads eun run in arbitrary order but need to commit in a + specific given order. + */ + if (rpl_pause_for_ftwrl(thd)) + goto error; + }

/*
reload_acl_and_cache() will tell us if we are allowed to write to the
binlog or not.
@@ -4289,6 +4300,8 @@ case SQLCOM_PREPARE:
if (!res)
my_ok(thd);
}
+ if (lex->type & REFRESH_READ_LOCK)
+ rpl_unpause_for_ftwrl(thd);

break;
}

Why have the above code in sql_parse.cc, instead of in
reload_acl_and_cache() around the call for

'thd->global_read_lock.lock_global_read_lock(thd))'

This would avoid calling rpl_pause_for_ftwrl(thd) if tables are locked
and in other possible error cases detected early.

On the other hand, you may also need to take care of the code just above
that calls

flush_tables_with_read_lock(thd, all_tables)

This happens when you give tables as a argument to FLUSH TABLES WITH READ LOCK.

I would assume that this can also cause a dead lock.

The easy fix is probably to add rpl_pause / rpl_unpause also to this
function just before it's calling lock_table_names().

------------

http://lists.askmonty.org/pipermail/commits/2015-June/008061.html

I didn't see test for doing FLUSH TABLES WITH READ LOCK, 2 in a row.

This should detect if there ever was any issues with
rpl_unpause_for_ftwrl() not beeing properly called after
rpl_pause...().

Comment by Kristian Nielsen [ 2015-11-14 ]

Pushed to 10.0 and 10.1:

http://lists.askmonty.org/pipermail/commits/2015-November/008627.html
http://lists.askmonty.org/pipermail/commits/2015-November/008628.html

Generated at Thu Feb 08 07:22:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.