Details
-
Bug
-
Status: Closed (View Workflow)
-
Blocker
-
Resolution: Fixed
-
10.1(EOL), 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5
Description
mtr --repeat=10 rpl.rpl_parallel2 sometimes hangs in 10.5
It works in 10.4.
The problem in 10.5 started after a merge from 10.3 -> 10.5 on 2 of July that enabled the test in
10.5. The issue is probably because of the slightly different implementation of FLUSH TABLES WITH READ LOCK in 10.5 compared to 10.4
The hangs happens in reap of "FLUSH TABLES WITH READ LOCK" at line
157 in suite/rpl/t/rpl_parallel2.test
Other things:
- The comments in ba02550166eb39c0375a6422ecaa4731421250b6 may be useful to find and fix the bug
- The code for flush_tables_with_read_lock is in sql_reload.cc. I would suggest that one looks at the differences
between the functions in 10.4 and 10.5 to try to find out what is going on.
Attachments
Issue Links
- relates to
-
MDEV-23381 rpl_parallel2 fails in 10.1 to 10.3 if slave_parallel_mode is changed to optimistic
-
- Closed
-
Activity
Prototype Patch
diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc
|
index 94882230682..3ff36a63830 100644
|
--- a/sql/rpl_parallel.cc
|
+++ b/sql/rpl_parallel.cc
|
@@ -396,13 +396,14 @@ do_gco_wait(rpl_group_info *rgi, group_commit_orderer *gco,
|
}
|
|
|
-static void
|
+static bool
|
do_ftwrl_wait(rpl_group_info *rgi,
|
bool *did_enter_cond, PSI_stage_info *old_stage)
|
{
|
THD *thd= rgi->thd;
|
rpl_parallel_entry *entry= rgi->parallel_entry;
|
uint64 sub_id= rgi->gtid_sub_id;
|
+ bool aborted= false;
|
DBUG_ENTER("do_ftwrl_wait");
|
|
mysql_mutex_assert_owner(&entry->LOCK_parallel_entry);
|
@@ -425,7 +426,10 @@ do_ftwrl_wait(rpl_group_info *rgi,
|
do
|
{
|
if (entry->force_abort || rgi->worker_error)
|
+ {
|
+ aborted= true;
|
break;
|
+ }
|
if (unlikely(thd->check_killed()))
|
{
|
slave_output_error_info(rgi, thd);
|
@@ -444,7 +448,7 @@ do_ftwrl_wait(rpl_group_info *rgi,
|
if (sub_id > entry->largest_started_sub_id)
|
entry->largest_started_sub_id= sub_id;
|
|
- DBUG_VOID_RETURN;
|
+ DBUG_RETURN(aborted);
|
}
|
|
|
@@ -1224,7 +1228,7 @@ handle_rpl_parallel_thread(void *arg)
|
rgi->worker_error= 1;
|
}
|
if (likely(!skip_event_group))
|
- do_ftwrl_wait(rgi, &did_enter_cond, &old_stage);
|
+ skip_event_group= do_ftwrl_wait(rgi, &did_enter_cond, &old_stage);
|
|
/*
|
Register ourself to wait for the previous commit, if we need to do
|
|
More info on deadlock
So the issue is this , we issue FTWRL
worker thread is waiting in do_ftwrl_wait()
then from other connection we issue STOP SLAVE , which turn force_abort = 1
So we exit out of this loop
do
|
{
|
if (entry->force_abort || rgi->worker_error)
|
break;
|
if (unlikely(thd->check_killed()))
|
{
|
slave_output_error_info(rgi, thd);
|
signal_error_to_sql_driver_thread(thd, rgi, 1);
|
break;
|
}
|
mysql_cond_wait(&entry->COND_parallel_entry, &entry->LOCK_parallel_entry);
|
} while (sub_id > entry->pause_sub_id);
|
in do_ftwrl_wait
and get deadlock later (that details are not important )
So the issue is this after force_abort we should not be processing events
But we have this logic that , if event is already in middle of processing , stop slave will wait till it completes
Hello Sachin,
Thank you for working on this issue. Changes look good.
Requesting monty review.
To be able to do my part of this patch, I would need some more information:
- Sujatha, first, never use restricted comments except when it comes to information related to customers.
- Sujatha, second: When you start working on a patch, you should at once assign it to you. When wanting a review, you should put the patch into the review stage.
- Sujatha, in which versions does the bug exits? When I tested things with 10.4 I didn't see any problems, while in 10.5 it seams to always fails (on some platforms at least).
- Do we have a problem in 10.4 or not? If not in 10.4, what is the difference between 10.4 and 10.5 that causes 10.5 to fail.
And last, but not least. There is no patch attached to this Jira entry that I can review! I assume that the 'prototype' patch is not the final version (as it's still marked as prototype).
It fails in 10.4 also but for that for have to change slave_parallel_mode to optimistic , In 10.1 to 10.3 there is debug assert when I change slave_parallel_mode , So the patch in bb-10.5-23089 applies to 10.4 and 10.5
If I change slave_parallel_mode in 10.1 , 10.2, 10.3 I get following assert
Thread 1 (Thread 0x7f1802b68700 (LWP 24673)):
|
#0 __pthread_kill (threadid=<optimized out>, signo=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
|
#1 0x0000555bd263e22f in my_write_core (sig=6) at mysys/stacktrace.c:477
|
#2 0x0000555bd1fecb79 in handle_fatal_signal (sig=6) at sql/signal_handler.cc:296
|
#3 <signal handler called>
|
#4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
|
#5 0x00007f1808af58b1 in __GI_abort () at abort.c:79
|
#6 0x00007f1808ae542a in __assert_fail_base (fmt=0x7f1808c6ca38 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x555bd2743518 "(mdl_request->type != MDL_INTENTION_EX
|
CLUSIVE && mdl_request->type != MDL_EXCLUSIVE) || !(get_thd()->rgi_slave && get_thd()->rgi_slave->is_parallel_exec && lock->check_if_conflicting_replication_locks(this))", file=file@entry=0x
|
555bd2742f8a "sql/mdl.cc", line=line@entry=2104, function=function@entry=0x555bd2743da0 <MDL_context::acquire_lock(MDL_request*, double)::__PRETTY_FUNCTION__> "bool MDL_context::acquire_lock
|
(MDL_request*, double)") at assert.c:92
|
#7 0x00007f1808ae54a2 in __GI___assert_fail (assertion=0x555bd2743518 "(mdl_request->type != MDL_INTENTION_EXCLUSIVE && mdl_request->type != MDL_EXCLUSIVE) || !(get_thd()->rgi_slave && get_
|
thd()->rgi_slave->is_parallel_exec && lock->check_if_conflicting_replication_locks(this))", file=0x555bd2742f8a "sql/mdl.cc", line=2104, function=0x555bd2743da0 <MDL_context::acquire_lock(MD
|
L_request*, double)::__PRETTY_FUNCTION__> "bool MDL_context::acquire_lock(MDL_request*, double)") at assert.c:101
|
#8 0x0000555bd1ef3567 in MDL_context::acquire_lock (this=0x7f17ef051168, mdl_request=0x7f1802b66fa0, lock_wait_timeout=31536000) at sql/mdl.cc:2100
|
#9 0x0000555bd1d46649 in open_table (thd=0x7f17ef051070, table_list=0x7f1802b67590, ot_ctx=0x7f1802b672e0) at sql/sql_base.cc:2403
|
#10 0x0000555bd1d496d4 in open_and_process_table (thd=0x7f17ef051070, tables=0x7f1802b67590, counter=0x7f1802b67374, flags=0, prelocking_strategy=0x7f1802b673f8, has_prelocking_list=false, o
|
t_ctx=0x7f1802b672e0) at sql/sql_base.cc:4168
|
#11 0x0000555bd1d4a4bd in open_tables (thd=0x7f17ef051070, options=..., start=0x7f1802b67358, counter=0x7f1802b67374, flags=0, prelocking_strategy=0x7f1802b673f8) at sql/sql_base.cc:4627
|
#12 0x0000555bd1d4bc22 in open_and_lock_tables (thd=0x7f17ef051070, options=..., tables=0x7f1802b67590, derived=false, flags=0, prelocking_strategy=0x7f1802b673f8) at sql/sql_base.cc:5386
|
#13 0x0000555bd1d149a1 in open_and_lock_tables (thd=0x7f17ef051070, tables=0x7f1802b67590, derived=false, flags=0) at sql/sql_base.h:547
|
#14 0x0000555bd1f4d7d2 in rpl_slave_state::record_gtid (this=0x7f1808047c00, thd=0x7f17ef051070, gtid=0x7f1802b67be0, sub_id=20, rgi=0x7f17f0c1a800, in_statement=false) at sql/rpl_gtid.cc:55
|
8
|
#15 0x0000555bd20ec9a8 in Xid_log_event::do_apply_event (this=0x7f17f0c7b670, rgi=0x7f17f0c1a800) at sql/log_event.cc:7703
|
#16 0x0000555bd1d068ad in Log_event::apply_event (this=0x7f17f0c7b670, rgi=0x7f17f0c1a800) at sql/log_event.h:1343
|
#17 0x0000555bd1cfc2f2 in apply_event_and_update_pos_apply (ev=0x7f17f0c7b670, thd=0x7f17ef051070, rgi=0x7f17f0c1a800, reason=0) at sql/slave.cc:3482
|
#18 0x0000555bd1cfc792 in apply_event_and_update_pos_for_parallel (ev=0x7f17f0c7b670, thd=0x7f17ef051070, rgi=0x7f17f0c1a800) at sql/slave.cc:3626
|
#19 0x0000555bd1f529d7 in rpt_handle_event (qev=0x7f17f0c6e270, rpt=0x7f17f0c7ece0) at sql/rpl_parallel.cc:50
|
#20 0x0000555bd1f5584f in handle_rpl_parallel_thread (arg=0x7f17f0c7ece0) at sql/rpl_parallel.cc:1274
|
#21 0x0000555bd2305f93 in pfs_spawn_thread (arg=0x7f17f0c58270) at storage/perfschema/pfs.cc:1868
|
#22 0x00007f18095d46db in start_thread (arg=0x7f1802b68700) at pthread_create.c:463
|
#23 0x00007f1808bd6a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
|
Thanks for the information, will do the review of the patch tomorrow morning
Removing version 10.1 to 10.3 , Since they fail with different error. I have created MDEV-23381 for it
It fails in 10.5 because default slave_parallel_mode is optimistic in 10.5 and conservative in 10.4 , If I change mode in 10.4 it fails and vica versa