Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-13472

rpl.rpl_semi_sync_wait_point crashes because of thd_destructor_proxy

Details

    Description

      rpl.rpl_semi_sync_wait_point crashes because of thd_destructor_proxy kills innodb
      service threads before all slave threads has ended.

      What happens is that proxy detects that no transactions are active and starts
      srv_shutdown_bg_undo_sources(), but fails to take into account that new transactions
      can still start, especially be slave but also by other threads. In addition there is no
      mute when checking for active transaction so this is not safe.

      Suggestion is to mark innodb server threads and in close_connection first shutdown all other threads, including events, and then last inform destructor proxy and other innodb threads that they can now safely be shut down.

      Attachments

        Issue Links

          Activity

            There is a much simpler solution: relax the failing InnoDB debug assertion that I made too strict.

            diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc
            index c046c8b7b52..0f7b36266bc 100644
            --- a/storage/innobase/trx/trx0purge.cc
            +++ b/storage/innobase/trx/trx0purge.cc
            @@ -293,14 +293,16 @@ trx_purge_add_update_undo_to_history(
             
             	After the purge thread has been given permission to exit,
             	in fast shutdown, we may roll back transactions (trx->undo_no==0)
            -	in THD::cleanup() invoked from unlink_thd(). */
            +	in THD::cleanup() invoked from unlink_thd(), and we may also
            +	continue to execute user transactions. */
             	ut_ad(srv_undo_sources
             	      || ((srv_startup_is_before_trx_rollback_phase
             		   || trx_rollback_or_clean_is_active)
             		  && purge_sys->state == PURGE_STATE_INIT)
             	      || (srv_force_recovery >= SRV_FORCE_NO_BACKGROUND
             		  && purge_sys->state == PURGE_STATE_DISABLED)
            -	      || (trx->undo_no == 0 && srv_fast_shutdown));
            +	      || ((trx->undo_no == 0 || trx->in_mysql_trx_list)
            +		  && srv_fast_shutdown));
             
             	/* Add the log as the first in the history list */
             	flst_add_first(rseg_header + TRX_RSEG_HISTORY,
            

            I am sorry that this did not occur to me until now. It takes time to ‘populate the cache’ of my brain after a long vacation.

            marko Marko Mäkelä added a comment - There is a much simpler solution: relax the failing InnoDB debug assertion that I made too strict. diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc index c046c8b7b52..0f7b36266bc 100644 --- a/storage/innobase/trx/trx0purge.cc +++ b/storage/innobase/trx/trx0purge.cc @@ -293,14 +293,16 @@ trx_purge_add_update_undo_to_history( After the purge thread has been given permission to exit, in fast shutdown, we may roll back transactions (trx->undo_no==0) - in THD::cleanup() invoked from unlink_thd(). */ + in THD::cleanup() invoked from unlink_thd(), and we may also + continue to execute user transactions. */ ut_ad(srv_undo_sources || ((srv_startup_is_before_trx_rollback_phase || trx_rollback_or_clean_is_active) && purge_sys->state == PURGE_STATE_INIT) || (srv_force_recovery >= SRV_FORCE_NO_BACKGROUND && purge_sys->state == PURGE_STATE_DISABLED) - || (trx->undo_no == 0 && srv_fast_shutdown)); + || ((trx->undo_no == 0 || trx->in_mysql_trx_list) + && srv_fast_shutdown)); /* Add the log as the first in the history list */ flst_add_first(rseg_header + TRX_RSEG_HISTORY, I am sorry that this did not occur to me until now. It takes time to ‘populate the cache’ of my brain after a long vacation.

            As serg pointed out and I noted in my tentative fix, the above assertion relaxation may be insufficient: for innodb_fast_shutdown=0 we may need the solution that monty proposed.

            I would strongly advise against making innodb_fast_shutdown=2 any slower.
            It is perfectly OK to make innodb_fast_shutdown=0 as slow as it needs to be, but not the fast or crash-like-super-fast shutdown.

            marko Marko Mäkelä added a comment - As serg pointed out and I noted in my tentative fix , the above assertion relaxation may be insufficient: for innodb_fast_shutdown=0 we may need the solution that monty proposed. I would strongly advise against making innodb_fast_shutdown=2 any slower. It is perfectly OK to make innodb_fast_shutdown=0 as slow as it needs to be, but not the fast or crash-like-super-fast shutdown.

            The thd_destructor_proxy() was introduced for MDEV-5800 (indexed virtual columns) to ensure proper shutdown. In MDEV-13039 (innodb_fast_shutdown=0 may fail to purge all undo log), the predicate srv_purge_should_exit() was changed. That fix may have introduced this bug.

            marko Marko Mäkelä added a comment - The thd_destructor_proxy() was introduced for MDEV-5800 (indexed virtual columns) to ensure proper shutdown. In MDEV-13039 (innodb_fast_shutdown=0 may fail to purge all undo log), the predicate srv_purge_should_exit() was changed. That fix may have introduced this bug.

            People

              serg Sergei Golubchik
              monty Michael Widenius
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.