[MDEV-22827] InnoDB: Failing assertion: purge_sys->n_stop == 0 in srv_purge_coordinator_suspend Created: 2020-06-07  Updated: 2020-08-11  Resolved: 2020-06-08

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.2
Fix Version/s: 10.2.33, 10.3.24, 10.4.14

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: not-10.5, regression, rr-profile

Issue Links:
Problem/Incident
is caused by MDEV-22769 Long semaphore wait upon DROP DATABAS... Closed

 Description   

10.2 e9dbbf112

2020-06-07 18:38:29 139910667286272 [Note] InnoDB: FTS optimize thread exiting.
2020-06-07 18:38:29 0x7f3f7cefa700  InnoDB: Assertion failure in file /home/mariadb/purge/10.2/storage/innobase/srv/srv0srv.cc line 2787
InnoDB: Failing assertion: purge_sys->n_stop == 0
 
Thread 20 received signal SIGABRT, Aborted.
[Switching to Thread 50568.50589]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(rr) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f3fd52b8801 in __GI_abort () at abort.c:79
#2  0x000055bda96095df in ut_dbg_assertion_failed (expr=0x55bda9bb19b8 "purge_sys->n_stop == 0", file=0x55bda9bb0c48 "/home/mariadb/purge/10.2/storage/innobase/srv/srv0srv.cc", 
    line=2787) at /home/mariadb/purge/10.2/storage/innobase/ut/ut0dbg.cc:60
#3  0x000055bda95a7e51 in srv_purge_coordinator_suspend (slot=0x55bdaa305bc8 <srv_sys+328>, rseg_history_len=4) at /home/mariadb/purge/10.2/storage/innobase/srv/srv0srv.cc:2787
#4  0x000055bda95a8149 in srv_purge_coordinator_thread (arg=0x0) at /home/mariadb/purge/10.2/storage/innobase/srv/srv0srv.cc:2864
#5  0x00007f3fd5faf6db in start_thread (arg=0x7f3f7cefa700) at pthread_create.c:463
#6  0x00007f3fd539988f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

It appears to be a very fresh regression, but without proper test case I can't guarantee it with 100% certainty.
So far it has been observed on 10.2, the first revision it was observed on was 1bd5b75c73.



 Comments   
Comment by Marko Mäkelä [ 2020-06-08 ]

According to the rr replay trace, the purge_sys->n_stop was incremented to 1 by row_quiesce_table_start() when executing FLUSH TABLES…FOR EXPORT. Shortly thereafter, shutdown was initiated. In MDEV-16159, the field was renamed to purge_sys.m_paused.

That said, I believe that this is a regression that was caused by MDEV-22769’s introduction of srv_shutdown_state = SRV_SHUTDOWN_INITIATED. I first created the patch against 10.5 and then backported it to 10.2. I should have been more careful when backporting. The preceding code obviously should have checked for srv_shutdown_state <= SRV_SHUTDOWN_INITIATED:

		stop = (srv_shutdown_state == SRV_SHUTDOWN_NONE
			&& purge_sys->state == PURGE_STATE_STOP);
 
		if (!stop) {
			ut_a(purge_sys->n_stop == 0);

The assertion does not exist in 10.3 or 10.4, but the logic seems to be wrong there as well. In MDEV-16264 this check was moved to purge_coordinator_timer_callback() and the reference to srv_shutdown_state was removed. That explains why the change was not needed in 10.5.

Generated at Thu Feb 08 09:17:46 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.