[MDEV-21674] purge_sys.stop() no longer waits for purge workers to complete Created: 2020-02-06  Updated: 2020-02-07  Resolved: 2020-02-07

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.0
Fix Version/s: 10.5.1

Type: Bug Priority: Critical
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: None

Attachments: File 10.3-Test-Remove-FlushObserver-and-innodb_log_optimize_dd.patch     File 10.5-Test-Remove-innodb_log_optimize_ddl-and-FlushObserve.patch    
Issue Links:
Blocks
blocks MDEV-12353 Efficient InnoDB redo log record format Closed
Problem/Incident
is caused by MDEV-16264 Implement a common work queue for Inn... Closed

 Description   

As noted at the start of 10.5-Test-Remove-innodb_log_optimize_ddl-and-FlushObserve.patch (which is a minimal port of something that we want to do in MDEV-12353), purge_sys.stop() was broken in MDEV-16264. It no longer waits for the purge worker tasks to finish processing the current event.

There likely is a race condition, but that race condition is not prominent before we remove the FlushObserver, like the patch does. For the record, I also ported the change to 10.3, and did not observe any crash in the two tests that exercise FLUSH TABLES…FOR EXPORT.

The crash would usually be with the following assertion failure:

10.5ish

2020-02-06 10:01:33 4 [Note] InnoDB: Sync to disk of `test`.`t1` started.
2020-02-06 10:01:33 4 [Note] InnoDB: Stopping purge
mysqld: /mariadb/10.5-MDEV-12353bis/storage/innobase/buf/buf0lru.cc:657: void buf_flush_dirty_pages(buf_pool_t *, ulint, bool, ulint): Assertion `first || buf_pool_get_dirty_pages_count(buf_pool, id) == 0' failed.
#7  0x000055bbae5a916a in buf_flush_dirty_pages (buf_pool=0x55bbb1c269d0, id=5, flush=true, first=0) at /mariadb/10.5-MDEV-12353bis/storage/innobase/buf/buf0lru.cc:656
#8  buf_LRU_flush_or_remove_pages (id=5, flush=true, first=0) at /mariadb/10.5-MDEV-12353bis/storage/innobase/buf/buf0lru.cc:670
#9  0x000055bbae491d14 in row_quiesce_table_start (table=<optimized out>, trx=0x7fd726c62138) at /mariadb/10.5-MDEV-12353bis/storage/innobase/row/row0quiesce.cc:538



 Comments   
Comment by Marko Mäkelä [ 2020-02-06 ]

The following patch might be helpful if you cannot reproduce the problem by applying 10.5-Test-Remove-innodb_log_optimize_ddl-and-FlushObserve.patch to the problematic commit and running:

./mtr --parallel=auto --repeat=10 encryption.innodb-discard-import{,,,,}  innodb.innodb-wl5522{,,,}

diff --git a/storage/innobase/srv/srv0srv.cc b/storage/innobase/srv/srv0srv.cc
index c4e20c973a0..1e00e1f3fbd 100644
--- a/storage/innobase/srv/srv0srv.cc
+++ b/storage/innobase/srv/srv0srv.cc
@@ -2190,7 +2190,7 @@ void purge_worker_callback(void*)
 	ut_ad(srv_force_recovery < SRV_FORCE_NO_BACKGROUND);
 	void* ctx;
 	THD* thd = acquire_thd(&ctx);
-	while (srv_task_execute()){}
+	while (srv_task_execute()) { ut_ad(purge_sys.running()); }
 	release_thd(thd,ctx);
 }
 

Note: I am not yet sure if that assertion is valid. If it turns out to be, then I think that it should be part of the fix.
Also, while fixing this, I would suggest declaring and defining

bool purge_sys_t::running() const

(with the const qualifier).

Generated at Thu Feb 08 09:08:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.