Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22827

InnoDB: Failing assertion: purge_sys->n_stop == 0 in srv_purge_coordinator_suspend

Details

    Description

      10.2 e9dbbf112

      2020-06-07 18:38:29 139910667286272 [Note] InnoDB: FTS optimize thread exiting.
      2020-06-07 18:38:29 0x7f3f7cefa700  InnoDB: Assertion failure in file /home/mariadb/purge/10.2/storage/innobase/srv/srv0srv.cc line 2787
      InnoDB: Failing assertion: purge_sys->n_stop == 0
       
      Thread 20 received signal SIGABRT, Aborted.
      [Switching to Thread 50568.50589]
      __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
      51	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
      (rr) bt
      #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
      #1  0x00007f3fd52b8801 in __GI_abort () at abort.c:79
      #2  0x000055bda96095df in ut_dbg_assertion_failed (expr=0x55bda9bb19b8 "purge_sys->n_stop == 0", file=0x55bda9bb0c48 "/home/mariadb/purge/10.2/storage/innobase/srv/srv0srv.cc", 
          line=2787) at /home/mariadb/purge/10.2/storage/innobase/ut/ut0dbg.cc:60
      #3  0x000055bda95a7e51 in srv_purge_coordinator_suspend (slot=0x55bdaa305bc8 <srv_sys+328>, rseg_history_len=4) at /home/mariadb/purge/10.2/storage/innobase/srv/srv0srv.cc:2787
      #4  0x000055bda95a8149 in srv_purge_coordinator_thread (arg=0x0) at /home/mariadb/purge/10.2/storage/innobase/srv/srv0srv.cc:2864
      #5  0x00007f3fd5faf6db in start_thread (arg=0x7f3f7cefa700) at pthread_create.c:463
      #6  0x00007f3fd539988f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      It appears to be a very fresh regression, but without proper test case I can't guarantee it with 100% certainty.
      So far it has been observed on 10.2, the first revision it was observed on was 1bd5b75c73.

      Attachments

        Issue Links

          Activity

            According to the rr replay trace, the purge_sys->n_stop was incremented to 1 by row_quiesce_table_start() when executing FLUSH TABLES…FOR EXPORT. Shortly thereafter, shutdown was initiated. In MDEV-16159, the field was renamed to purge_sys.m_paused.

            That said, I believe that this is a regression that was caused by MDEV-22769’s introduction of srv_shutdown_state = SRV_SHUTDOWN_INITIATED. I first created the patch against 10.5 and then backported it to 10.2. I should have been more careful when backporting. The preceding code obviously should have checked for srv_shutdown_state <= SRV_SHUTDOWN_INITIATED:

            		stop = (srv_shutdown_state == SRV_SHUTDOWN_NONE
            			&& purge_sys->state == PURGE_STATE_STOP);
             
            		if (!stop) {
            			ut_a(purge_sys->n_stop == 0);
            

            The assertion does not exist in 10.3 or 10.4, but the logic seems to be wrong there as well. In MDEV-16264 this check was moved to purge_coordinator_timer_callback() and the reference to srv_shutdown_state was removed. That explains why the change was not needed in 10.5.

            marko Marko Mäkelä added a comment - According to the rr replay trace, the purge_sys->n_stop was incremented to 1 by row_quiesce_table_start() when executing FLUSH TABLES…FOR EXPORT . Shortly thereafter, shutdown was initiated. In MDEV-16159 , the field was renamed to purge_sys.m_paused . That said, I believe that this is a regression that was caused by MDEV-22769 ’s introduction of srv_shutdown_state = SRV_SHUTDOWN_INITIATED . I first created the patch against 10.5 and then backported it to 10.2. I should have been more careful when backporting. The preceding code obviously should have checked for srv_shutdown_state <= SRV_SHUTDOWN_INITIATED : stop = (srv_shutdown_state == SRV_SHUTDOWN_NONE && purge_sys->state == PURGE_STATE_STOP);   if (!stop) { ut_a(purge_sys->n_stop == 0); The assertion does not exist in 10.3 or 10.4, but the logic seems to be wrong there as well. In MDEV-16264 this check was moved to purge_coordinator_timer_callback() and the reference to srv_shutdown_state was removed. That explains why the change was not needed in 10.5.

            People

              marko Marko Mäkelä
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.