[MDEV-24054] Assertion `in_LRU_list' failed in buf_page_t::ready_for_flush Created: 2020-10-29  Updated: 2020-10-30  Resolved: 2020-10-30

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.4
Fix Version/s: 10.5.7

Type: Bug Priority: Blocker
Reporter: Elena Stepanova Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: regression

Issue Links:
Problem/Incident
is caused by MDEV-15053 Reduce buf_pool_t::mutex contention Closed
Relates
relates to MDEV-23909 innodb_flush_neighbors=2 is treated l... Closed

 Description   

10.5 6d3356c1

mysqld: /data/src/10.5/storage/innobase/include/buf0buf.h:2208: bool buf_page_t::ready_for_flush() const: Assertion `in_LRU_list' failed.
201030  9:23:51 [ERROR] mysqld got signal 6 ;
 
#7  0x00007f330f779f36 in __GI___assert_fail (assertion=0x55cb13596309 "in_LRU_list", file=0x55cb13595a98 "/data/src/10.5/storage/innobase/include/buf0buf.h", line=2208, function=0x55cb135962e0 "bool buf_page_t::ready_for_flush() const") at assert.c:101
#8  0x000055cb12ee7150 in buf_page_t::ready_for_flush (this=0x55cb13b8e330 <buf_pool+5424>) at /data/src/10.5/storage/innobase/include/buf0buf.h:2208
#9  0x000055cb12ee1054 in buf_flush_try_neighbors (space=0x7f32b80bfcc8, page_id=..., contiguous=false, lru=false, n_flushed=214, n_to_flush=376) at /data/src/10.5/storage/innobase/buf/buf0flu.cc:1109
#10 0x000055cb12ee258c in buf_do_flush_list_batch (max_n=376, lsn=18446744073709551615) at /data/src/10.5/storage/innobase/buf/buf0flu.cc:1426
#11 0x000055cb12ee2d0f in buf_flush_lists (max_n=376, lsn=18446744073709551615) at /data/src/10.5/storage/innobase/buf/buf0flu.cc:1540
#12 0x000055cb12ee59a7 in buf_flush_page_cleaner () at /data/src/10.5/storage/innobase/buf/buf0flu.cc:2125
#13 0x00007f330fc91609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#14 0x00007f330f865293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Below is the test I use to reproduce it. However, the success seems to largely depend on the machine: I can reproduce it fairly easily on two very different machines, but can't reproduce it at all on two other machines, even after dozens of runs.
Unfortunately two machines where I can reproduce it can't run rr, so I have no information so far whether it can even be reproduced with rr.

git clone https://github.com/MariaDB/randgen --branch mdev24054 rqg-mdev24054
cd rqg-mdev24054
perl ./runall-trials.pl --basedir1=/data/bld/10.5 --grammar=conf/mariadb/oltp-transactional.yy --gendata-advanced --duration=350 --seed=1603918722 --reporters=Backtrace,ErrorLog,Deadlock --partitions --views --engine=InnoDB --filter=conf/mariadb/10.4-combo-filter.ff --redefine=conf/mariadb/bulk_insert.yy --redefine=conf/mariadb/alter_table.yy --redefine=conf/mariadb/sp.yy --redefine=conf/mariadb/modules/admin.yy --redefine=conf/mariadb/modules/alter_table_columns.yy --redefine=conf/mariadb/modules/alter_table_indexes.yy --redefine=conf/mariadb/modules/foreign_keys.yy --redefine=conf/mariadb/modules/locks.yy --mysqld=--server-id=111 --mysqld=--log_output=FILE --mysqld=--max-statement-time=20 --mysqld=--lock-wait-timeout=10 --mysqld=--innodb-lock-wait-timeout=5 --threads=2 --redefine=conf/mariadb/modules/dynamic_variables.yy --redefine=conf/mariadb/sequences.yy --redefine=conf/mariadb/versioning.yy --redefine=conf/mariadb/modules/optimizer_trace.yy --redefine=conf/mariadb/modules/parser_changes.yy --validators=TransformerNoComparator --transformers=ExecuteAsDeleteReturning,ExecuteAsInsertSelect,ExecuteAsUpdateDelete,EnableOptimizations,ExecuteAsInsertReturning --vardir1=/dev/shm/var_mdev24054 --trials=10 --output="buf_page_t::ready_for_flush"

Remember to set the correct basedir path.



 Comments   
Comment by Marko Mäkelä [ 2020-10-30 ]

It should be very hard to reproduce this bug. I got the crucial hint from the stack trace from the core dump:

10.5 f6549e95448e5cce52336361f5a59d1edcdac46a

#8  0x0000556310587fec in buf_page_t::ready_for_flush (
    this=0x55631141cd30 <buf_pool+5424>)
    at /home/elenst/src/10.5/storage/innobase/include/buf0buf.h:2208

This actually tells us that this==&buf_pool.watch[32]. Looking at the code, the following should fix this:

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
index cc2f4c36fa4..ac6c45deeab 100644
--- a/storage/innobase/buf/buf0flu.cc
+++ b/storage/innobase/buf/buf0flu.cc
@@ -1106,7 +1106,8 @@ static ulint buf_flush_try_neighbors(fil_space_t *space,
       because the flushed blocks are soon freed */
       if (!lru || id == page_id || bpage->is_old())
       {
-        if (bpage->ready_for_flush() && buf_flush_page(bpage, lru, space))
+        if (!buf_pool.watch_is_sentinel(*bpage) &&
+            bpage->ready_for_flush() && buf_flush_page(bpage, lru, space))
         {
           ++count;
           continue;

This ought to be a regression that was caused by MDEV-15053 already. But only with the improved concurrency thanks to MDEV-23399 and MDEV-23855 it could be easier to hit this. This involves both delete buffering and neighbour flushing. There is a similar check buf_flush_check_neighbor(), but it looks like elenst got extremely lucky and buf_pool.watch_set() had been invoked between that initial check and this problematic spot.

Comment by Marko Mäkelä [ 2020-10-30 ]

This bug should have been much harder to hit before MDEV-23909 was fixed.

Generated at Thu Feb 08 09:27:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.