[MDEV-23089] rpl_parallel2 fails in 10.5 - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Affects Version/s: 10.1(EOL), 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5
Fix Version/s: 10.1.46, 10.2.33, 10.3.24, 10.4.16, 10.5.5
Component/s: Replication, Tests
Labels:
- ftwrl
- parallelslave

Description

mtr --repeat=10 rpl.rpl_parallel2 sometimes hangs in 10.5
It works in 10.4.

The problem in 10.5 started after a merge from 10.3 -> 10.5 on 2 of July that enabled the test in
10.5. The issue is probably because of the slightly different implementation of FLUSH TABLES WITH READ LOCK in 10.5 compared to 10.4

The hangs happens in reap of "FLUSH TABLES WITH READ LOCK" at line
157 in suite/rpl/t/rpl_parallel2.test

Other things:

The comments in ba02550166eb39c0375a6422ecaa4731421250b6 may be useful to find and fix the bug
The code for flush_tables_with_read_lock is in sql_reload.cc. I would suggest that one looks at the differences
between the functions in 10.4 and 10.5 to try to find out what is going on.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

gdb.txt
52 kB
2020-07-28 07:55

Issue Links

relates to

MDEV-23381 rpl_parallel2 fails in 10.1 to 10.3 if slave_parallel_mode is changed to optimistic

Closed

Activity

Ascending order - Click to sort in descending order

Sachin Setiya (Inactive) added a comment - 2020-07-07 13:42

It fails in 10.5 because default slave_parallel_mode is optimistic in 10.5 and conservative in 10.4 , If I change mode in 10.4 it fails and vica versa

Sachin Setiya (Inactive) added a comment - 2020-07-07 13:42 It fails in 10.5 because default slave_parallel_mode is optimistic in 10.5 and conservative in 10.4 , If I change mode in 10.4 it fails and vica versa

Sachin Setiya (Inactive) added a comment - 2020-07-14 12:14 - edited

Prototype Patch

diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc

index 94882230682..3ff36a63830 100644

--- a/sql/rpl_parallel.cc

+++ b/sql/rpl_parallel.cc

@@ -396,13 +396,14 @@ do_gco_wait(rpl_group_info *rgi, group_commit_orderer *gco,

-static void

+static bool

 do_ftwrl_wait(rpl_group_info *rgi,

               bool *did_enter_cond, PSI_stage_info *old_stage)

   THD *thd= rgi->thd;

   rpl_parallel_entry *entry= rgi->parallel_entry;

   uint64 sub_id= rgi->gtid_sub_id;

+  bool aborted= false;

   DBUG_ENTER("do_ftwrl_wait");

   mysql_mutex_assert_owner(&entry->LOCK_parallel_entry);

@@ -425,7 +426,10 @@ do_ftwrl_wait(rpl_group_info *rgi,

do

       if (entry->force_abort || rgi->worker_error)

+      {

+        aborted= true;

         break;

+      }

       if (unlikely(thd->check_killed()))

         slave_output_error_info(rgi, thd);

@@ -444,7 +448,7 @@ do_ftwrl_wait(rpl_group_info *rgi,

   if (sub_id > entry->largest_started_sub_id)

     entry->largest_started_sub_id= sub_id;

-  DBUG_VOID_RETURN;

+  DBUG_RETURN(aborted);

@@ -1224,7 +1228,7 @@ handle_rpl_parallel_thread(void *arg)

           rgi->worker_error= 1;

         if (likely(!skip_event_group))

-          do_ftwrl_wait(rgi, &did_enter_cond, &old_stage);

+          skip_event_group= do_ftwrl_wait(rgi, &did_enter_cond, &old_stage);

/*

           Register ourself to wait for the previous commit, if we need to do

More info on deadlock
So the issue is this , we issue FTWRL
worker thread is waiting in do_ftwrl_wait()
then from other connection we issue STOP SLAVE , which turn force_abort = 1
So we exit out of this loop

do

      if (entry->force_abort || rgi->worker_error)

        break;

      if (unlikely(thd->check_killed()))

        slave_output_error_info(rgi, thd);

        signal_error_to_sql_driver_thread(thd, rgi, 1);

        break;

      mysql_cond_wait(&entry->COND_parallel_entry, &entry->LOCK_parallel_entry);

    } while (sub_id > entry->pause_sub_id);

in do_ftwrl_wait
and get deadlock later (that details are not important )
So the issue is this after force_abort we should not be processing events
But we have this logic that , if event is already in middle of processing , stop slave will wait till it completes

Sachin Setiya (Inactive) added a comment - 2020-07-14 12:14 - edited Prototype Patch diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index 94882230682..3ff36a63830 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc @@ -396,13 +396,14 @@ do_gco_wait(rpl_group_info *rgi, group_commit_orderer *gco, } -static void +static bool do_ftwrl_wait(rpl_group_info *rgi, bool *did_enter_cond, PSI_stage_info *old_stage) { THD *thd= rgi->thd; rpl_parallel_entry *entry= rgi->parallel_entry; uint64 sub_id= rgi->gtid_sub_id; + bool aborted= false; DBUG_ENTER("do_ftwrl_wait"); mysql_mutex_assert_owner(&entry->LOCK_parallel_entry); @@ -425,7 +426,10 @@ do_ftwrl_wait(rpl_group_info *rgi, do { if (entry->force_abort || rgi->worker_error) + { + aborted= true; break; + } if (unlikely(thd->check_killed())) { slave_output_error_info(rgi, thd); @@ -444,7 +448,7 @@ do_ftwrl_wait(rpl_group_info *rgi, if (sub_id > entry->largest_started_sub_id) entry->largest_started_sub_id= sub_id; - DBUG_VOID_RETURN; + DBUG_RETURN(aborted); } @@ -1224,7 +1228,7 @@ handle_rpl_parallel_thread(void *arg) rgi->worker_error= 1; } if (likely(!skip_event_group)) - do_ftwrl_wait(rgi, &did_enter_cond, &old_stage); + skip_event_group= do_ftwrl_wait(rgi, &did_enter_cond, &old_stage); /* Register ourself to wait for the previous commit, if we need to do More info on deadlock So the issue is this , we issue FTWRL worker thread is waiting in do_ftwrl_wait() then from other connection we issue STOP SLAVE , which turn force_abort = 1 So we exit out of this loop do { if (entry->force_abort || rgi->worker_error) break; if (unlikely(thd->check_killed())) { slave_output_error_info(rgi, thd); signal_error_to_sql_driver_thread(thd, rgi, 1); break; } mysql_cond_wait(&entry->COND_parallel_entry, &entry->LOCK_parallel_entry); } while (sub_id > entry->pause_sub_id); in do_ftwrl_wait and get deadlock later (that details are not important ) So the issue is this after force_abort we should not be processing events But we have this logic that , if event is already in middle of processing , stop slave will wait till it completes

Sachin Setiya (Inactive) added a comment - 2020-07-28 07:55

Hang thread apply all bt gdb.txt

Sachin Setiya (Inactive) added a comment - 2020-07-28 07:55 Hang thread apply all bt gdb.txt

Sujatha Sivakumar (Inactive) added a comment - 2020-07-29 07:15

Hello Sachin,

Thank you for working on this issue. Changes look good.
Requesting monty review.

Sujatha Sivakumar (Inactive) added a comment - 2020-07-29 07:15 Hello Sachin, Thank you for working on this issue. Changes look good. Requesting monty review.

Michael Widenius added a comment - 2020-08-01 12:53

To be able to do my part of this patch, I would need some more information:

Sujatha, first, never use restricted comments except when it comes to information related to customers.
Sujatha, second: When you start working on a patch, you should at once assign it to you. When wanting a review, you should put the patch into the review stage.

Sujatha, in which versions does the bug exits? When I tested things with 10.4 I didn't see any problems, while in 10.5 it seams to always fails (on some platforms at least).
Do we have a problem in 10.4 or not? If not in 10.4, what is the difference between 10.4 and 10.5 that causes 10.5 to fail.

And last, but not least. There is no patch attached to this Jira entry that I can review! I assume that the 'prototype' patch is not the final version (as it's still marked as prototype).

Michael Widenius added a comment - 2020-08-01 12:53 To be able to do my part of this patch, I would need some more information: Sujatha, first, never use restricted comments except when it comes to information related to customers. Sujatha, second: When you start working on a patch, you should at once assign it to you. When wanting a review, you should put the patch into the review stage. Sujatha, in which versions does the bug exits? When I tested things with 10.4 I didn't see any problems, while in 10.5 it seams to always fails (on some platforms at least). Do we have a problem in 10.4 or not? If not in 10.4, what is the difference between 10.4 and 10.5 that causes 10.5 to fail. And last, but not least. There is no patch attached to this Jira entry that I can review! I assume that the 'prototype' patch is not the final version (as it's still marked as prototype).

Sachin Setiya (Inactive) added a comment - 2020-08-01 12:57

monty patch is in this branch bb-10.5-23089

Sachin Setiya (Inactive) added a comment - 2020-08-01 12:57 monty patch is in this branch bb-10.5-23089

Sachin Setiya (Inactive) added a comment - 2020-08-01 12:59 - edited

It fails in 10.4 also but for that for have to change slave_parallel_mode to optimistic , In 10.1 to 10.3 there is debug assert when I change slave_parallel_mode , So the patch in bb-10.5-23089 applies to 10.4 and 10.5

Sachin Setiya (Inactive) added a comment - 2020-08-01 12:59 - edited It fails in 10.4 also but for that for have to change slave_parallel_mode to optimistic , In 10.1 to 10.3 there is debug assert when I change slave_parallel_mode , So the patch in bb-10.5-23089 applies to 10.4 and 10.5

Sachin Setiya (Inactive) added a comment - 2020-08-01 13:30

If I change slave_parallel_mode in 10.1 , 10.2, 10.3 I get following assert

Thread 1 (Thread 0x7f1802b68700 (LWP 24673)):

#0  __pthread_kill (threadid=<optimized out>, signo=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57

#1  0x0000555bd263e22f in my_write_core (sig=6) at mysys/stacktrace.c:477

#2  0x0000555bd1fecb79 in handle_fatal_signal (sig=6) at sql/signal_handler.cc:296

#3  <signal handler called>

#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51

#5  0x00007f1808af58b1 in __GI_abort () at abort.c:79

#6  0x00007f1808ae542a in __assert_fail_base (fmt=0x7f1808c6ca38 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x555bd2743518 "(mdl_request->type != MDL_INTENTION_EX

CLUSIVE && mdl_request->type != MDL_EXCLUSIVE) || !(get_thd()->rgi_slave && get_thd()->rgi_slave->is_parallel_exec && lock->check_if_conflicting_replication_locks(this))", file=file@entry=0x

555bd2742f8a "sql/mdl.cc", line=line@entry=2104, function=function@entry=0x555bd2743da0 <MDL_context::acquire_lock(MDL_request*, double)::__PRETTY_FUNCTION__> "bool MDL_context::acquire_lock

(MDL_request*, double)") at assert.c:92

#7  0x00007f1808ae54a2 in __GI___assert_fail (assertion=0x555bd2743518 "(mdl_request->type != MDL_INTENTION_EXCLUSIVE && mdl_request->type != MDL_EXCLUSIVE) || !(get_thd()->rgi_slave && get_

thd()->rgi_slave->is_parallel_exec && lock->check_if_conflicting_replication_locks(this))", file=0x555bd2742f8a "sql/mdl.cc", line=2104, function=0x555bd2743da0 <MDL_context::acquire_lock(MD

L_request*, double)::__PRETTY_FUNCTION__> "bool MDL_context::acquire_lock(MDL_request*, double)") at assert.c:101

#8  0x0000555bd1ef3567 in MDL_context::acquire_lock (this=0x7f17ef051168, mdl_request=0x7f1802b66fa0, lock_wait_timeout=31536000) at sql/mdl.cc:2100

#9  0x0000555bd1d46649 in open_table (thd=0x7f17ef051070, table_list=0x7f1802b67590, ot_ctx=0x7f1802b672e0) at sql/sql_base.cc:2403

#10 0x0000555bd1d496d4 in open_and_process_table (thd=0x7f17ef051070, tables=0x7f1802b67590, counter=0x7f1802b67374, flags=0, prelocking_strategy=0x7f1802b673f8, has_prelocking_list=false, o

t_ctx=0x7f1802b672e0) at sql/sql_base.cc:4168

#11 0x0000555bd1d4a4bd in open_tables (thd=0x7f17ef051070, options=..., start=0x7f1802b67358, counter=0x7f1802b67374, flags=0, prelocking_strategy=0x7f1802b673f8) at sql/sql_base.cc:4627

#12 0x0000555bd1d4bc22 in open_and_lock_tables (thd=0x7f17ef051070, options=..., tables=0x7f1802b67590, derived=false, flags=0, prelocking_strategy=0x7f1802b673f8) at sql/sql_base.cc:5386

#13 0x0000555bd1d149a1 in open_and_lock_tables (thd=0x7f17ef051070, tables=0x7f1802b67590, derived=false, flags=0) at sql/sql_base.h:547

#14 0x0000555bd1f4d7d2 in rpl_slave_state::record_gtid (this=0x7f1808047c00, thd=0x7f17ef051070, gtid=0x7f1802b67be0, sub_id=20, rgi=0x7f17f0c1a800, in_statement=false) at sql/rpl_gtid.cc:55

#15 0x0000555bd20ec9a8 in Xid_log_event::do_apply_event (this=0x7f17f0c7b670, rgi=0x7f17f0c1a800) at sql/log_event.cc:7703

#16 0x0000555bd1d068ad in Log_event::apply_event (this=0x7f17f0c7b670, rgi=0x7f17f0c1a800) at sql/log_event.h:1343

#17 0x0000555bd1cfc2f2 in apply_event_and_update_pos_apply (ev=0x7f17f0c7b670, thd=0x7f17ef051070, rgi=0x7f17f0c1a800, reason=0) at sql/slave.cc:3482

#18 0x0000555bd1cfc792 in apply_event_and_update_pos_for_parallel (ev=0x7f17f0c7b670, thd=0x7f17ef051070, rgi=0x7f17f0c1a800) at sql/slave.cc:3626

#19 0x0000555bd1f529d7 in rpt_handle_event (qev=0x7f17f0c6e270, rpt=0x7f17f0c7ece0) at sql/rpl_parallel.cc:50

#20 0x0000555bd1f5584f in handle_rpl_parallel_thread (arg=0x7f17f0c7ece0) at sql/rpl_parallel.cc:1274

#21 0x0000555bd2305f93 in pfs_spawn_thread (arg=0x7f17f0c58270) at storage/perfschema/pfs.cc:1868

#22 0x00007f18095d46db in start_thread (arg=0x7f1802b68700) at pthread_create.c:463

#23 0x00007f1808bd6a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Sachin Setiya (Inactive) added a comment - 2020-08-01 13:30 If I change slave_parallel_mode in 10.1 , 10.2, 10.3 I get following assert Thread 1 (Thread 0x7f1802b68700 (LWP 24673)): #0 __pthread_kill (threadid=<optimized out>, signo=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57 #1 0x0000555bd263e22f in my_write_core (sig=6) at mysys/stacktrace.c:477 #2 0x0000555bd1fecb79 in handle_fatal_signal (sig=6) at sql/signal_handler.cc:296 #3 <signal handler called> #4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #5 0x00007f1808af58b1 in __GI_abort () at abort.c:79 #6 0x00007f1808ae542a in __assert_fail_base (fmt=0x7f1808c6ca38 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x555bd2743518 "(mdl_request->type != MDL_INTENTION_EX CLUSIVE && mdl_request->type != MDL_EXCLUSIVE) || !(get_thd()->rgi_slave && get_thd()->rgi_slave->is_parallel_exec && lock->check_if_conflicting_replication_locks(this))", file=file@entry=0x 555bd2742f8a "sql/mdl.cc", line=line@entry=2104, function=function@entry=0x555bd2743da0 <MDL_context::acquire_lock(MDL_request*, double)::__PRETTY_FUNCTION__> "bool MDL_context::acquire_lock (MDL_request*, double)") at assert.c:92 #7 0x00007f1808ae54a2 in __GI___assert_fail (assertion=0x555bd2743518 "(mdl_request->type != MDL_INTENTION_EXCLUSIVE && mdl_request->type != MDL_EXCLUSIVE) || !(get_thd()->rgi_slave && get_ thd()->rgi_slave->is_parallel_exec && lock->check_if_conflicting_replication_locks(this))", file=0x555bd2742f8a "sql/mdl.cc", line=2104, function=0x555bd2743da0 <MDL_context::acquire_lock(MD L_request*, double)::__PRETTY_FUNCTION__> "bool MDL_context::acquire_lock(MDL_request*, double)") at assert.c:101 #8 0x0000555bd1ef3567 in MDL_context::acquire_lock (this=0x7f17ef051168, mdl_request=0x7f1802b66fa0, lock_wait_timeout=31536000) at sql/mdl.cc:2100 #9 0x0000555bd1d46649 in open_table (thd=0x7f17ef051070, table_list=0x7f1802b67590, ot_ctx=0x7f1802b672e0) at sql/sql_base.cc:2403 #10 0x0000555bd1d496d4 in open_and_process_table (thd=0x7f17ef051070, tables=0x7f1802b67590, counter=0x7f1802b67374, flags=0, prelocking_strategy=0x7f1802b673f8, has_prelocking_list=false, o t_ctx=0x7f1802b672e0) at sql/sql_base.cc:4168 #11 0x0000555bd1d4a4bd in open_tables (thd=0x7f17ef051070, options=..., start=0x7f1802b67358, counter=0x7f1802b67374, flags=0, prelocking_strategy=0x7f1802b673f8) at sql/sql_base.cc:4627 #12 0x0000555bd1d4bc22 in open_and_lock_tables (thd=0x7f17ef051070, options=..., tables=0x7f1802b67590, derived=false, flags=0, prelocking_strategy=0x7f1802b673f8) at sql/sql_base.cc:5386 #13 0x0000555bd1d149a1 in open_and_lock_tables (thd=0x7f17ef051070, tables=0x7f1802b67590, derived=false, flags=0) at sql/sql_base.h:547 #14 0x0000555bd1f4d7d2 in rpl_slave_state::record_gtid (this=0x7f1808047c00, thd=0x7f17ef051070, gtid=0x7f1802b67be0, sub_id=20, rgi=0x7f17f0c1a800, in_statement=false) at sql/rpl_gtid.cc:55 8 #15 0x0000555bd20ec9a8 in Xid_log_event::do_apply_event (this=0x7f17f0c7b670, rgi=0x7f17f0c1a800) at sql/log_event.cc:7703 #16 0x0000555bd1d068ad in Log_event::apply_event (this=0x7f17f0c7b670, rgi=0x7f17f0c1a800) at sql/log_event.h:1343 #17 0x0000555bd1cfc2f2 in apply_event_and_update_pos_apply (ev=0x7f17f0c7b670, thd=0x7f17ef051070, rgi=0x7f17f0c1a800, reason=0) at sql/slave.cc:3482 #18 0x0000555bd1cfc792 in apply_event_and_update_pos_for_parallel (ev=0x7f17f0c7b670, thd=0x7f17ef051070, rgi=0x7f17f0c1a800) at sql/slave.cc:3626 #19 0x0000555bd1f529d7 in rpt_handle_event (qev=0x7f17f0c6e270, rpt=0x7f17f0c7ece0) at sql/rpl_parallel.cc:50 #20 0x0000555bd1f5584f in handle_rpl_parallel_thread (arg=0x7f17f0c7ece0) at sql/rpl_parallel.cc:1274 #21 0x0000555bd2305f93 in pfs_spawn_thread (arg=0x7f17f0c58270) at storage/perfschema/pfs.cc:1868 #22 0x00007f18095d46db in start_thread (arg=0x7f1802b68700) at pthread_create.c:463 #23 0x00007f1808bd6a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Michael Widenius added a comment - 2020-08-01 13:51

Thanks for the information, will do the review of the patch tomorrow morning

Michael Widenius added a comment - 2020-08-01 13:51 Thanks for the information, will do the review of the patch tomorrow morning

Michael Widenius added a comment - 2020-08-02 10:55

Review sent by emal

Michael Widenius added a comment - 2020-08-02 10:55 Review sent by emal

Sachin Setiya (Inactive) added a comment - 2020-08-03 09:26

Removing version 10.1 to 10.3 , Since they fail with different error. I have created ~~MDEV-23381~~ for it

Sachin Setiya (Inactive) added a comment - 2020-08-03 09:26 Removing version 10.1 to 10.3 , Since they fail with different error. I have created MDEV-23381 for it

People

Assignee:: Sachin Setiya (Inactive)

Reporter:: Michael Widenius

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2020-07-03 23:16

Updated:: 2021-12-08 14:16

Resolved:: 2020-08-03 11:53

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration