Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25336

Parallel replication causes failed assert while restarting

Details

    Description

      Test Case

      --source include/master-slave.inc
       
      --connection slave
      --source include/stop_slave.inc
      --let $old_parallel= `select @@GLOBAL.slave_parallel_threads`
      SET GLOBAL slave_parallel_threads=8;
      --source include/start_slave.inc
       
      --let $rpl_server_no= 2
      --source include/rpl_restart_server.inc
       
      --connection slave
      --eval SET GLOBAL slave_parallel_threads= $old_parallel
      --source include/start_slave.inc
      --source include/rpl_end.inc
      
      

      It does not fail in 10.5

      Failure

      rpl.tmp 'mix'                            [ fail ]  Found warnings/errors in server log file!
              Test ended at 2021-04-05 09:19:24
      line
      mysqld: sql/sql_list.h:642: void ilink::assert_linked(): Assertion `prev != 0 && next != 0' failed.
      mysqld: sql/sql_list.h:642: void ilink::assert_linked(): Assertion `prev != 0 && next != 0' failed.
      mysqld: sql/sql_list.h:642: void ilink::assert_linked(): Assertion `prev != 0 && next != 0' failed.
      mysqld: sql/sql_list.h:642: void ilink::assert_linked(): Assertion `prev != 0 && next != 0' failed.
      Attempting backtrace. You can use the following information to find out
      ^ Found warnings in /home/sachin/10.3/mysql-test/var/log/mysqld.2.err
      ok
      
      

      Attachments

        Activity

          With rr it gives different call stack

          (rr) bt
          #0  0x0000000070000002 in ?? ()
          #1  0x00007f4fce94c473 in _raw_syscall () at src/preload/raw_syscall.S:120
          #2  0x00007f4fce94a477 in traced_raw_syscall (call=0x7f4f9cdfafa0) at src/preload/syscallbuf.c:274
          #3  syscall_hook_internal (call=0x7f4f9cdfafa0) at src/preload/syscallbuf.c:3330
          #4  syscall_hook (call=0x7f4f9cdfafa0) at src/preload/syscallbuf.c:3364
          #5  0x00007f4fce947330 in _syscall_hook_trampoline () at src/preload/syscall_hook.S:313
          #6  0x00007f4fce94738f in __morestack () at src/preload/syscall_hook.S:458
          #7  0x00007f4fce947396 in _syscall_hook_trampoline_48_3d_01_f0_ff_ff () at src/preload/syscall_hook.S:472
          #8  0x00007f4fcdf21201 in kill () from /usr/lib/libc.so.6
          #9  0x0000555c5d145f23 in handle_fatal_signal (sig=6) at sql/signal_handler.cc:367
          #10 <signal handler called>
          #11 0x0000555c5d9d8f4f in my_timer_cycles () at mysys/my_rdtsc.c:170
          #12 0x0000555c5d94fc97 in end_mutex_wait_v1 (locker=0x7f4f9d5fad00, rc=0) at storage/perfschema/pfs.cc:3488
          #13 0x0000555c5d3a2de3 in PolicyMutex<TTASEventMutex<GenericPolicy> >::pfs_end (this=0x555c5e2a5658 <srv_sys+152>, locker=0x7f4f9d5fad00, ret=0) at storage/innobase/include/ib0mutex.h:738
          #14 0x0000555c5d3a0e61 in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0x555c5e2a5658 <srv_sys+152>, n_spins=30, n_delay=4, name=0x555c5dcf91a8 "storage/innobase/srv/srv0srv.cc", line=944) at storage/innobase/include/ib0mutex.h:596
          #15 0x0000555c5d591b28 in srv_release_threads (type=SRV_WORKER, n=3) at storage/innobase/srv/srv0srv.cc:944
          #16 0x0000555c5d5963c7 in srv_purge_coordinator_thread (arg=0x0) at storage/innobase/srv/srv0srv.cc:2797
          #17 0x00007f4fce90a299 in start_thread () from /usr/lib/libpthread.so.0
          #18 0x00007f4fcdfe3053 in clone () from /usr/lib/libc.so.6
          
          

          sachin.setiya.007 Sachin Setiya (Inactive) added a comment - With rr it gives different call stack (rr) bt #0 0x0000000070000002 in ?? () #1 0x00007f4fce94c473 in _raw_syscall () at src/preload/raw_syscall.S:120 #2 0x00007f4fce94a477 in traced_raw_syscall (call=0x7f4f9cdfafa0) at src/preload/syscallbuf.c:274 #3 syscall_hook_internal (call=0x7f4f9cdfafa0) at src/preload/syscallbuf.c:3330 #4 syscall_hook (call=0x7f4f9cdfafa0) at src/preload/syscallbuf.c:3364 #5 0x00007f4fce947330 in _syscall_hook_trampoline () at src/preload/syscall_hook.S:313 #6 0x00007f4fce94738f in __morestack () at src/preload/syscall_hook.S:458 #7 0x00007f4fce947396 in _syscall_hook_trampoline_48_3d_01_f0_ff_ff () at src/preload/syscall_hook.S:472 #8 0x00007f4fcdf21201 in kill () from /usr/lib/libc.so.6 #9 0x0000555c5d145f23 in handle_fatal_signal (sig=6) at sql/signal_handler.cc:367 #10 <signal handler called> #11 0x0000555c5d9d8f4f in my_timer_cycles () at mysys/my_rdtsc.c:170 #12 0x0000555c5d94fc97 in end_mutex_wait_v1 (locker=0x7f4f9d5fad00, rc=0) at storage/perfschema/pfs.cc:3488 #13 0x0000555c5d3a2de3 in PolicyMutex<TTASEventMutex<GenericPolicy> >::pfs_end (this=0x555c5e2a5658 <srv_sys+152>, locker=0x7f4f9d5fad00, ret=0) at storage/innobase/include/ib0mutex.h:738 #14 0x0000555c5d3a0e61 in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0x555c5e2a5658 <srv_sys+152>, n_spins=30, n_delay=4, name=0x555c5dcf91a8 "storage/innobase/srv/srv0srv.cc", line=944) at storage/innobase/include/ib0mutex.h:596 #15 0x0000555c5d591b28 in srv_release_threads (type=SRV_WORKER, n=3) at storage/innobase/srv/srv0srv.cc:944 #16 0x0000555c5d5963c7 in srv_purge_coordinator_thread (arg=0x0) at storage/innobase/srv/srv0srv.cc:2797 #17 0x00007f4fce90a299 in start_thread () from /usr/lib/libpthread.so.0 #18 0x00007f4fcdfe3053 in clone () from /usr/lib/libc.so.6

          So the issue is this

          We have this kill server thread which calls close_connections()

            /*
              Force remaining threads to die by closing the connection to the client
              This will ensure that threads that are waiting for a command from the
              client on a blocking read call are aborted.
            */
           
            for (;;)
            {
              mysql_mutex_lock(&LOCK_thread_count); // For unlink from list
              if (!(tmp=threads.get()))
              {
                mysql_mutex_unlock(&LOCK_thread_count);
                break;
              }
          
          

          when we call threads.get() , It unlinks the elements from link link as it returns

            inline struct ilink *get()
            {
              struct ilink *first_link=first;
              if (first_link == &last)
                return 0;
              first_link->unlink();			// Unlink from list
              return first_link;
            }
           
            inline void unlink()
            {
              /* Extra tests because element doesn't have to be linked */
              if (prev) *prev= next;
              if (next) next->prev=prev;
              prev=0 ; next=0;
            }
           
          
          

          But in handle_rpl_parallel_thread, when we call

            THD_CHECK_SENTRY(thd);
            unlink_not_visible_thd(thd);
            delete thd;
           
          inline void unlink_not_visible_thd(THD *thd)
          {
            thd->assert_linked();
            mysql_mutex_lock(&LOCK_thread_count);
            thd->unlink();
            mysql_mutex_unlock(&LOCK_thread_count);
          }
           
            inline void assert_linked()
            {
              DBUG_ASSERT(prev != 0 && next != 0);
            }
           
          
          

          So if threads.get() in close connection is called before worker threads gets time to clean up we will get this assert failure

          sachin.setiya.007 Sachin Setiya (Inactive) added a comment - So the issue is this We have this kill server thread which calls close_connections() /* Force remaining threads to die by closing the connection to the client This will ensure that threads that are waiting for a command from the client on a blocking read call are aborted. */   for (;;) { mysql_mutex_lock(&LOCK_thread_count); // For unlink from list if (!(tmp=threads.get())) { mysql_mutex_unlock(&LOCK_thread_count); break; } when we call threads.get() , It unlinks the elements from link link as it returns inline struct ilink *get() { struct ilink *first_link=first; if (first_link == &last) return 0; first_link->unlink(); // Unlink from list return first_link; }   inline void unlink() { /* Extra tests because element doesn't have to be linked */ if (prev) *prev= next; if (next) next->prev=prev; prev=0 ; next=0; }   But in handle_rpl_parallel_thread, when we call THD_CHECK_SENTRY(thd); unlink_not_visible_thd(thd); delete thd;   inline void unlink_not_visible_thd(THD *thd) { thd->assert_linked(); mysql_mutex_lock(&LOCK_thread_count); thd->unlink(); mysql_mutex_unlock(&LOCK_thread_count); }   inline void assert_linked() { DBUG_ASSERT(prev != 0 && next != 0); }   So if threads.get() in close connection is called before worker threads gets time to clean up we will get this assert failure

          It does not fail in 10.1 , In 10.1 we use

            thd->unlink();
          

          instead of assert on unlinking

          sachin.setiya.007 Sachin Setiya (Inactive) added a comment - It does not fail in 10.1 , In 10.1 we use thd->unlink(); instead of assert on unlinking
          sachin.setiya.007 Sachin Setiya (Inactive) added a comment - - edited

          Need to backport MDEV-20821 and MDEV-22370 (That will fix the issue)

          sachin.setiya.007 Sachin Setiya (Inactive) added a comment - - edited Need to backport MDEV-20821 and MDEV-22370 (That will fix the issue)

          Patch branch bb-10.2-sachin

          sachin.setiya.007 Sachin Setiya (Inactive) added a comment - Patch branch bb-10.2-sachin
          Elkin Andrei Elkin added a comment -

          Asked questions, suggested todo:s.

          Elkin Andrei Elkin added a comment - Asked questions, suggested todo:s.

          Patch updated bb-10.2-sachin

          sachin.setiya.007 Sachin Setiya (Inactive) added a comment - Patch updated bb-10.2-sachin
          Elkin Andrei Elkin added a comment -

          The patch looks good! Thanks.

          Elkin Andrei Elkin added a comment - The patch looks good! Thanks.

          People

            sachin.setiya.007 Sachin Setiya (Inactive)
            sachin.setiya.007 Sachin Setiya (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.