[MDEV-25336] Parallel replication causes failed assert while restarting Created: 2021-04-05  Updated: 2021-05-18  Resolved: 2021-05-14

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.2, 10.3
Fix Version/s: 10.2.39, 10.3.30, 10.4.20, 10.5.11

Type: Bug Priority: Critical
Reporter: Sachin Setiya (Inactive) Assignee: Sachin Setiya (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Arch Linux, Ubuntu



 Description   

Test Case

--source include/master-slave.inc
 
--connection slave
--source include/stop_slave.inc
--let $old_parallel= `select @@GLOBAL.slave_parallel_threads`
SET GLOBAL slave_parallel_threads=8;
--source include/start_slave.inc
 
--let $rpl_server_no= 2
--source include/rpl_restart_server.inc
 
--connection slave
--eval SET GLOBAL slave_parallel_threads= $old_parallel
--source include/start_slave.inc
--source include/rpl_end.inc

It does not fail in 10.5

Failure

rpl.tmp 'mix'                            [ fail ]  Found warnings/errors in server log file!
        Test ended at 2021-04-05 09:19:24
line
mysqld: sql/sql_list.h:642: void ilink::assert_linked(): Assertion `prev != 0 && next != 0' failed.
mysqld: sql/sql_list.h:642: void ilink::assert_linked(): Assertion `prev != 0 && next != 0' failed.
mysqld: sql/sql_list.h:642: void ilink::assert_linked(): Assertion `prev != 0 && next != 0' failed.
mysqld: sql/sql_list.h:642: void ilink::assert_linked(): Assertion `prev != 0 && next != 0' failed.
Attempting backtrace. You can use the following information to find out
^ Found warnings in /home/sachin/10.3/mysql-test/var/log/mysqld.2.err
ok



 Comments   
Comment by Sachin Setiya (Inactive) [ 2021-04-05 ]

With rr it gives different call stack

(rr) bt
#0  0x0000000070000002 in ?? ()
#1  0x00007f4fce94c473 in _raw_syscall () at src/preload/raw_syscall.S:120
#2  0x00007f4fce94a477 in traced_raw_syscall (call=0x7f4f9cdfafa0) at src/preload/syscallbuf.c:274
#3  syscall_hook_internal (call=0x7f4f9cdfafa0) at src/preload/syscallbuf.c:3330
#4  syscall_hook (call=0x7f4f9cdfafa0) at src/preload/syscallbuf.c:3364
#5  0x00007f4fce947330 in _syscall_hook_trampoline () at src/preload/syscall_hook.S:313
#6  0x00007f4fce94738f in __morestack () at src/preload/syscall_hook.S:458
#7  0x00007f4fce947396 in _syscall_hook_trampoline_48_3d_01_f0_ff_ff () at src/preload/syscall_hook.S:472
#8  0x00007f4fcdf21201 in kill () from /usr/lib/libc.so.6
#9  0x0000555c5d145f23 in handle_fatal_signal (sig=6) at sql/signal_handler.cc:367
#10 <signal handler called>
#11 0x0000555c5d9d8f4f in my_timer_cycles () at mysys/my_rdtsc.c:170
#12 0x0000555c5d94fc97 in end_mutex_wait_v1 (locker=0x7f4f9d5fad00, rc=0) at storage/perfschema/pfs.cc:3488
#13 0x0000555c5d3a2de3 in PolicyMutex<TTASEventMutex<GenericPolicy> >::pfs_end (this=0x555c5e2a5658 <srv_sys+152>, locker=0x7f4f9d5fad00, ret=0) at storage/innobase/include/ib0mutex.h:738
#14 0x0000555c5d3a0e61 in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0x555c5e2a5658 <srv_sys+152>, n_spins=30, n_delay=4, name=0x555c5dcf91a8 "storage/innobase/srv/srv0srv.cc", line=944) at storage/innobase/include/ib0mutex.h:596
#15 0x0000555c5d591b28 in srv_release_threads (type=SRV_WORKER, n=3) at storage/innobase/srv/srv0srv.cc:944
#16 0x0000555c5d5963c7 in srv_purge_coordinator_thread (arg=0x0) at storage/innobase/srv/srv0srv.cc:2797
#17 0x00007f4fce90a299 in start_thread () from /usr/lib/libpthread.so.0
#18 0x00007f4fcdfe3053 in clone () from /usr/lib/libc.so.6

Comment by Sachin Setiya (Inactive) [ 2021-04-06 ]

So the issue is this

We have this kill server thread which calls close_connections()

  /*
    Force remaining threads to die by closing the connection to the client
    This will ensure that threads that are waiting for a command from the
    client on a blocking read call are aborted.
  */
 
  for (;;)
  {
    mysql_mutex_lock(&LOCK_thread_count); // For unlink from list
    if (!(tmp=threads.get()))
    {
      mysql_mutex_unlock(&LOCK_thread_count);
      break;
    }

when we call threads.get() , It unlinks the elements from link link as it returns

  inline struct ilink *get()
  {
    struct ilink *first_link=first;
    if (first_link == &last)
      return 0;
    first_link->unlink();			// Unlink from list
    return first_link;
  }
 
  inline void unlink()
  {
    /* Extra tests because element doesn't have to be linked */
    if (prev) *prev= next;
    if (next) next->prev=prev;
    prev=0 ; next=0;
  }
 

But in handle_rpl_parallel_thread, when we call

  THD_CHECK_SENTRY(thd);
  unlink_not_visible_thd(thd);
  delete thd;
 
inline void unlink_not_visible_thd(THD *thd)
{
  thd->assert_linked();
  mysql_mutex_lock(&LOCK_thread_count);
  thd->unlink();
  mysql_mutex_unlock(&LOCK_thread_count);
}
 
  inline void assert_linked()
  {
    DBUG_ASSERT(prev != 0 && next != 0);
  }
 

So if threads.get() in close connection is called before worker threads gets time to clean up we will get this assert failure

Comment by Sachin Setiya (Inactive) [ 2021-04-13 ]

It does not fail in 10.1 , In 10.1 we use

  thd->unlink();

instead of assert on unlinking

Comment by Sachin Setiya (Inactive) [ 2021-04-14 ]

Need to backport MDEV-20821 and MDEV-22370 (That will fix the issue)

Comment by Sachin Setiya (Inactive) [ 2021-04-14 ]

Patch branch bb-10.2-sachin

Comment by Andrei Elkin [ 2021-04-21 ]

Asked questions, suggested todo:s.

Comment by Sachin Setiya (Inactive) [ 2021-04-29 ]

Patch updated bb-10.2-sachin

Comment by Andrei Elkin [ 2021-05-07 ]

The patch looks good! Thanks.

Generated at Thu Feb 08 09:36:55 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.