[MDEV-13472] rpl.rpl_semi_sync_wait_point crashes because of thd_destructor_proxy Created: 2017-08-08  Updated: 2017-11-30  Resolved: 2017-08-14

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.3
Fix Version/s: 10.2.8

Type: Bug Priority: Critical
Reporter: Michael Widenius Assignee: Sergei Golubchik
Resolution: Fixed Votes: 0
Labels: None
Environment:

bb-10.3-monty


Issue Links:
Relates
relates to MDEV-5800 indexes on virtual (not materialized)... Closed
relates to MDEV-13039 innodb_fast_shutdown=0 may fail to pu... Closed
relates to MDEV-14080 InnoDB shutdown sometimes hangs Closed

 Description   

rpl.rpl_semi_sync_wait_point crashes because of thd_destructor_proxy kills innodb
service threads before all slave threads has ended.

What happens is that proxy detects that no transactions are active and starts
srv_shutdown_bg_undo_sources(), but fails to take into account that new transactions
can still start, especially be slave but also by other threads. In addition there is no
mute when checking for active transaction so this is not safe.

Suggestion is to mark innodb server threads and in close_connection first shutdown all other threads, including events, and then last inform destructor proxy and other innodb threads that they can now safely be shut down.



 Comments   
Comment by Marko Mäkelä [ 2017-08-08 ]

There is a much simpler solution: relax the failing InnoDB debug assertion that I made too strict.

diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc
index c046c8b7b52..0f7b36266bc 100644
--- a/storage/innobase/trx/trx0purge.cc
+++ b/storage/innobase/trx/trx0purge.cc
@@ -293,14 +293,16 @@ trx_purge_add_update_undo_to_history(
 
 	After the purge thread has been given permission to exit,
 	in fast shutdown, we may roll back transactions (trx->undo_no==0)
-	in THD::cleanup() invoked from unlink_thd(). */
+	in THD::cleanup() invoked from unlink_thd(), and we may also
+	continue to execute user transactions. */
 	ut_ad(srv_undo_sources
 	      || ((srv_startup_is_before_trx_rollback_phase
 		   || trx_rollback_or_clean_is_active)
 		  && purge_sys->state == PURGE_STATE_INIT)
 	      || (srv_force_recovery >= SRV_FORCE_NO_BACKGROUND
 		  && purge_sys->state == PURGE_STATE_DISABLED)
-	      || (trx->undo_no == 0 && srv_fast_shutdown));
+	      || ((trx->undo_no == 0 || trx->in_mysql_trx_list)
+		  && srv_fast_shutdown));
 
 	/* Add the log as the first in the history list */
 	flst_add_first(rseg_header + TRX_RSEG_HISTORY,

I am sorry that this did not occur to me until now. It takes time to ‘populate the cache’ of my brain after a long vacation.

Comment by Marko Mäkelä [ 2017-08-08 ]

As serg pointed out and I noted in my tentative fix, the above assertion relaxation may be insufficient: for innodb_fast_shutdown=0 we may need the solution that monty proposed.

I would strongly advise against making innodb_fast_shutdown=2 any slower.
It is perfectly OK to make innodb_fast_shutdown=0 as slow as it needs to be, but not the fast or crash-like-super-fast shutdown.

Comment by Marko Mäkelä [ 2017-08-11 ]

The thd_destructor_proxy() was introduced for MDEV-5800 (indexed virtual columns) to ensure proper shutdown. In MDEV-13039 (innodb_fast_shutdown=0 may fail to purge all undo log), the predicate srv_purge_should_exit() was changed. That fix may have introduced this bug.

Generated at Thu Feb 08 08:05:51 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.