[MDEV-7818] Deadlock occurring with parallel replication and FTWRL Created: 2015-03-23 Updated: 2016-12-01 Resolved: 2015-11-14 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.0 |
| Fix Version/s: | 10.0.23, 10.1.9 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Guillaume Lefranc | Assignee: | Kristian Nielsen |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | parallelslave, verified | ||
| Issue Links: |
|
||||||||
| Sprint: | 10.0.22, 10.1.9-1, 10.1.9-2 | ||||||||
| Description |
|
Thread 5 is locked by global read lock but FTWRL never completes and stays in waiting for commit lock; replication thread stays locked forever until FTWRL is killed
|
| Comments |
| Comment by Elena Stepanova [ 2015-03-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I am able to reproduce it with this (very crude and not very reliable) test:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2015-03-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Ok, so I think this is what happens. We have transactions T1 followed by T2. T2 gets ready to commit. It acquires the commit metadata lock in Then in the binlog T2 needs to wait for T1 to commit first. FLUSH TABLES WITH READ LOCK runs, it needs to wait for the commit lock held by T1 runs its INSERT statement, it gets blocked by the running FLUSH TABLES WITH At this point, we have a deadlock. So the question is what the correct behaviour should be here. It is not safe to abort the commit of T2 with an error (and possibly retry it One option seems to be to allow T1 to run despite the pending FTWRL. Because Another option might be to let T2 release the commit lock when it goes to wait | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2015-03-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
A simpler, but less convenient, solution could be to forbid running FLUSH TABLES WITH READ LOCK while a parallel replication slave is running. The user would then need to STOP SLAVE; FLUSH TABLES WITH READ LOCK; START SLAVE. The net effect would be much the same, as the slave cannot do any work while FLUSH TABLES WITH READ LOCK is running anyway. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Guillaume Lefranc [ 2015-03-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I am concerned by the fact that this last solution would break most backup utilities when parallel replication is enabled. If MariaDB would port backup locks over from Percona then that would not be a problem anymore | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2015-03-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Agree, hopefully I can come up with a fix that "just works" and does not require any special user action. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2015-06-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Current idea: When FTWRL starts, it first checks all parallel replication worker Once all worker threads have reached their designated point, FTWLR continues This should hopefully fix the deadlock, at least I got the test case of Some care will probably be needed to guard against other deadlocks against | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2015-06-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have a fix for this now (that just removes the deadlock, without breaking It is in three parts. The actual fix is the second patch. The first patch is http://lists.askmonty.org/pipermail/commits/2015-June/008059.html Elena: You asked some days ago - this patch should be ready, if you want to EDIT: Updated with newest version of patches after fixing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2015-06-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
See | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Widenius [ 2015-10-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Review of http://lists.askmonty.org/pipermail/commits/2015-June/008059.html +/* You did however remove this comment:
Why did you remove the first part of the comment? In sql/rpl_parallel.h you added: + /* But this is not used anywhere. http://lists.askmonty.org/pipermail/commits/2015-June/008060.html +static void Would be good to have a comment for the following test, like: /* Note that if you would store MAX_INT in entry->pause_sub_id when it's not + if (unlikely(entry->pause_sub_id > 0) && sub_id > entry->pause_sub_id) As you don't have an EXIT_COND() matching the above ENTER_COND, please <cut> Wouldn't a better name for the following function be: void + e= rpt->current_entry; Don't you need to unlock mysql_mutex_lock(&e->LOCK_parallel_entry) here ? + } + mysql_mutex_lock(&pool->LOCK_rpl_thread_pool); As pool_mark_not_busy() is already doing the broadcast, you can remove If we always have to take the above mutexes when calling pool_mark_not_busy() <cut> + for (i= 0; i < pool->count; ++i) + e= rpt->current_entry; Why the above test? + thd->ENTER_COND(&e->COND_parallel_entry, &e->LOCK_parallel_entry, + mysql_cond_wait(&e->COND_parallel_entry, &e->LOCK_parallel_entry); @@ -1106,7 +1302,14 @@ rpl_parallel_change_thread_count(rpl_parallel_thread_pool *pool,
I see you used the original code from get_thread(), but you don't reset @@ -1496,8 +1703,14 @@ rpl_parallel_thread_pool::get_thread(rpl_parallel_thread **owner, mysql_mutex_lock(&LOCK_rpl_thread_pool);
Why not use: while (unlikely(busy) || !(rpt= free_list)) — a/sql/sql_parse.cc + if (lex->type & REFRESH_READ_LOCK) /* break; Why have the above code in sql_parse.cc, instead of in 'thd->global_read_lock.lock_global_read_lock(thd))' This would avoid calling rpl_pause_for_ftwrl(thd) if tables are locked On the other hand, you may also need to take care of the code just above flush_tables_with_read_lock(thd, all_tables) This happens when you give tables as a argument to FLUSH TABLES WITH READ LOCK. I would assume that this can also cause a dead lock. The easy fix is probably to add rpl_pause / rpl_unpause also to this ------------ http://lists.askmonty.org/pipermail/commits/2015-June/008061.html I didn't see test for doing FLUSH TABLES WITH READ LOCK, 2 in a row. This should detect if there ever was any issues with | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2015-11-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Pushed to 10.0 and 10.1: http://lists.askmonty.org/pipermail/commits/2015-November/008627.html |