[MDEV-13472] rpl.rpl_semi_sync_wait_point crashes because of thd_destructor_proxy - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.3(EOL)
Fix Version/s: 10.2.8
Component/s: Storage Engine - InnoDB
Labels:
None
Environment:
bb-10.3-monty

Description

rpl.rpl_semi_sync_wait_point crashes because of thd_destructor_proxy kills innodb
service threads before all slave threads has ended.

What happens is that proxy detects that no transactions are active and starts
srv_shutdown_bg_undo_sources(), but fails to take into account that new transactions
can still start, especially be slave but also by other threads. In addition there is no
mute when checking for active transaction so this is not safe.

Suggestion is to mark innodb server threads and in close_connection first shutdown all other threads, including events, and then last inform destructor proxy and other innodb threads that they can now safely be shut down.

Attachments

Issue Links

relates to

MDEV-5800 indexes on virtual (not materialized) columns

Closed

MDEV-13039 innodb_fast_shutdown=0 may fail to purge all undo logs

Closed

MDEV-14080 InnoDB shutdown sometimes hangs

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2017-08-08 16:52

There is a much simpler solution: relax the failing InnoDB debug assertion that I made too strict.

diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc

index c046c8b7b52..0f7b36266bc 100644

--- a/storage/innobase/trx/trx0purge.cc

+++ b/storage/innobase/trx/trx0purge.cc

@@ -293,14 +293,16 @@ trx_purge_add_update_undo_to_history(

 	After the purge thread has been given permission to exit,

 	in fast shutdown, we may roll back transactions (trx->undo_no==0)

-	in THD::cleanup() invoked from unlink_thd(). */

+	in THD::cleanup() invoked from unlink_thd(), and we may also

+	continue to execute user transactions. */

 	ut_ad(srv_undo_sources

 	      || ((srv_startup_is_before_trx_rollback_phase

 		   || trx_rollback_or_clean_is_active)

 		  && purge_sys->state == PURGE_STATE_INIT)

 	      || (srv_force_recovery >= SRV_FORCE_NO_BACKGROUND

 		  && purge_sys->state == PURGE_STATE_DISABLED)

-	      || (trx->undo_no == 0 && srv_fast_shutdown));

+	      || ((trx->undo_no == 0 || trx->in_mysql_trx_list)

+		  && srv_fast_shutdown));

 	/* Add the log as the first in the history list */

 	flst_add_first(rseg_header + TRX_RSEG_HISTORY,

I am sorry that this did not occur to me until now. It takes time to ‘populate the cache’ of my brain after a long vacation.

Marko Mäkelä added a comment - 2017-08-08 16:52 There is a much simpler solution: relax the failing InnoDB debug assertion that I made too strict. diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc index c046c8b7b52..0f7b36266bc 100644 --- a/storage/innobase/trx/trx0purge.cc +++ b/storage/innobase/trx/trx0purge.cc @@ -293,14 +293,16 @@ trx_purge_add_update_undo_to_history( After the purge thread has been given permission to exit, in fast shutdown, we may roll back transactions (trx->undo_no==0) - in THD::cleanup() invoked from unlink_thd(). */ + in THD::cleanup() invoked from unlink_thd(), and we may also + continue to execute user transactions. */ ut_ad(srv_undo_sources || ((srv_startup_is_before_trx_rollback_phase || trx_rollback_or_clean_is_active) && purge_sys->state == PURGE_STATE_INIT) || (srv_force_recovery >= SRV_FORCE_NO_BACKGROUND && purge_sys->state == PURGE_STATE_DISABLED) - || (trx->undo_no == 0 && srv_fast_shutdown)); + || ((trx->undo_no == 0 || trx->in_mysql_trx_list) + && srv_fast_shutdown)); /* Add the log as the first in the history list */ flst_add_first(rseg_header + TRX_RSEG_HISTORY, I am sorry that this did not occur to me until now. It takes time to ‘populate the cache’ of my brain after a long vacation.

Marko Mäkelä added a comment - 2017-08-08 17:15

As serg pointed out and I noted in my tentative fix, the above assertion relaxation may be insufficient: for innodb_fast_shutdown=0 we may need the solution that monty proposed.

I would strongly advise against making innodb_fast_shutdown=2 any slower.
It is perfectly OK to make innodb_fast_shutdown=0 as slow as it needs to be, but not the fast or crash-like-super-fast shutdown.

Marko Mäkelä added a comment - 2017-08-08 17:15 As serg pointed out and I noted in my tentative fix , the above assertion relaxation may be insufficient: for innodb_fast_shutdown=0 we may need the solution that monty proposed. I would strongly advise against making innodb_fast_shutdown=2 any slower. It is perfectly OK to make innodb_fast_shutdown=0 as slow as it needs to be, but not the fast or crash-like-super-fast shutdown.

Marko Mäkelä added a comment - 2017-08-11 19:16

The thd_destructor_proxy() was introduced for ~~MDEV-5800~~ (indexed virtual columns) to ensure proper shutdown. In ~~MDEV-13039~~ (innodb_fast_shutdown=0 may fail to purge all undo log), the predicate srv_purge_should_exit() was changed. That fix may have introduced this bug.

Marko Mäkelä added a comment - 2017-08-11 19:16 The thd_destructor_proxy() was introduced for MDEV-5800 (indexed virtual columns) to ensure proper shutdown. In MDEV-13039 (innodb_fast_shutdown=0 may fail to purge all undo log), the predicate srv_purge_should_exit() was changed. That fix may have introduced this bug.

People

Assignee:: Sergei Golubchik

Reporter:: Michael Widenius

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2017-08-08 14:54

Updated:: 2017-11-30 16:01

Resolved:: 2017-08-14 19:53

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server