[MDEV-18286] Assertion `pagecache->cnt_for_resize_op == 0' failed in check_pagecache_is_cleaned_up on server shutdown Created: 2019-01-17  Updated: 2020-05-23  Resolved: 2020-04-26

Status: Closed
Project: MariaDB Server
Component/s: Locking, Storage Engine - Aria
Affects Version/s: 10.4, 10.5
Fix Version/s: 10.4.13

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Michael Widenius
Resolution: Fixed Votes: 0
Labels: None


 Description   

Note: The description was updated much later than the comments were added. The comments discuss the initial report, which only had the stack trace, without a test case or any other indication how the problem could be reproduced.

10.4 88cf6f1c7f with the injection below

2020-04-25  4:55:50 0 [Note] /data/src/10.4-bug/sql/mysqld (initiated by: root[root] @ localhost []): Normal shutdown
2020-04-25  4:55:50 0 [Note] Event Scheduler: Purging the queue. 0 events
mysqld: /data/src/10.4-bug/storage/maria/ma_pagecache.c:1133: check_pagecache_is_cleaned_up: Assertion `pagecache->cnt_for_resize_op == 0' failed.
200425  4:55:50 [ERROR] mysqld got signal 6 ;
 
#6  0x00007f355a45840f in __assert_fail_base (fmt=0x7f355a5baee0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55a54b2bb2a8 "pagecache->cnt_for_resize_op == 0", file=0x55a54b2baf78 "/data/src/10.4-bug/storage/maria/ma_pagecache.c", line=1133, function=<optimized out>) at assert.c:92
#7  0x00007f355a466102 in __GI___assert_fail (assertion=0x55a54b2bb2a8 "pagecache->cnt_for_resize_op == 0", file=0x55a54b2baf78 "/data/src/10.4-bug/storage/maria/ma_pagecache.c", line=1133, function=0x55a54b2bd2e0 <__PRETTY_FUNCTION__.18326> "check_pagecache_is_cleaned_up") at assert.c:101
#8  0x000055a54ac6a97f in check_pagecache_is_cleaned_up (pagecache=0x55a54c287d00 <maria_pagecache_var>) at /data/src/10.4-bug/storage/maria/ma_pagecache.c:1133
#9  0x000055a54ac6aba0 in end_pagecache (pagecache=0x55a54c287d00 <maria_pagecache_var>, cleanup=1 '\001') at /data/src/10.4-bug/storage/maria/ma_pagecache.c:1175
#10 0x000055a54ac862ba in maria_end () at /data/src/10.4-bug/storage/maria/ma_init.c:110
#11 0x000055a54acd2157 in maria_panic (flag=HA_PANIC_CLOSE) at /data/src/10.4-bug/storage/maria/ma_panic.c:135
#12 0x000055a54ac478c6 in maria_hton_panic (hton=0x55a54dd19590, flag=HA_PANIC_CLOSE) at /data/src/10.4-bug/storage/maria/ha_maria.cc:3347
#13 0x000055a54a5a8fe9 in ha_finalize_handlerton (plugin=0x55a54dd0ba88) at /data/src/10.4-bug/sql/handler.cc:496
#14 0x000055a54a271157 in plugin_deinitialize (plugin=0x55a54dd0ba88, ref_check=true) at /data/src/10.4-bug/sql/sql_plugin.cc:1241
#15 0x000055a54a2716c1 in reap_plugins () at /data/src/10.4-bug/sql/sql_plugin.cc:1317
#16 0x000055a54a273ae0 in plugin_shutdown () at /data/src/10.4-bug/sql/sql_plugin.cc:1986
#17 0x000055a54a115e20 in clean_up (print_message=true) at /data/src/10.4-bug/sql/mysqld.cc:1988
#18 0x000055a54a11e4ed in mysqld_main (argc=130, argv=0x55a54dbd98d0) at /data/src/10.4-bug/sql/mysqld.cc:5922
#19 0x000055a54a112275 in main (argc=6, argv=0x7fffa3e800b8) at /data/src/10.4-bug/sql/main.cc:25

Debug injection for 10.4, e.g. 88cf6f1c7f

diff --git a/storage/maria/ma_pagecache.c b/storage/maria/ma_pagecache.c
index f486d8c704f..9f02355b44f 100644
--- a/storage/maria/ma_pagecache.c
+++ b/storage/maria/ma_pagecache.c
@@ -4416,6 +4416,7 @@ static int flush_cached_blocks(PAGECACHE *pagecache,
       @todo change argument of functions to be File.
     */
     pagecache_pthread_mutex_unlock(&pagecache->cache_lock);
+    DBUG_EXECUTE_IF("sleep_mutex",{while(pagecache->cache_lock.m_mutex.thread == 0) my_sleep(100000);};);
     error= pagecache_fwrite(pagecache, &block->hash_link->file,
                             block->buffer,
                             block->hash_link->pageno,

Even with the injection above, the test case is still non-deterministic, run with --repeat=N if it doesn't fail right away. Currently it fails for me roughly every other time.

Test case to be run with the injection above

--source include/restart_mysqld.inc
 
CREATE TABLE t1 (pk INT, a VARCHAR(8), PRIMARY KEY(pk)) ENGINE=Aria TRANSACTIONAL=1;
 
--connect (con1,localhost,root,,test)
BACKUP STAGE START;
SET debug_dbug= '+d,sleep_mutex';
--send
  BACKUP STAGE BLOCK_COMMIT;
 
--connection default
INSERT INTO t1 VALUES (1,'a');
INSERT INTO t1 VALUES (2,'b');
INSERT INTO t1 VALUES (3,'c');
INSERT INTO t1 VALUES (4,'c');
INSERT INTO t1 VALUES (5,'c');
INSERT INTO t1 VALUES (6,'c');
INSERT INTO t1 VALUES (7,'c');
INSERT INTO t1 VALUES (8,'c');
INSERT INTO t1 VALUES (9,'c');
 
--connection con1
--reap
 
--sleep 1
--source include/restart_mysqld.inc
 
DROP TABLE t1;



 Comments   
Comment by Alice Sherepa [ 2019-02-13 ]

10.4 7dfbb66fcb529faad01, the same assertion during shutdown

2019-02-12  3:25:33 0 [Note] InnoDB: Shutdown completed; log sequence number 874694; transaction id 1500
2019-02-12  3:25:33 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
mysqld: /10.4/storage/maria/ma_pagecache.c:1133: check_pagecache_is_cleaned_up: Assertion `pagecache->cnt_for_resize_op == 0' failed.
 
maria/ma_pagecache.c:1135(check_pagecache_is_cleaned_up)[0x562110d3796a]
maria/ma_pagecache.c:1178(end_pagecache)[0x562110d37b4b]
maria/ma_init.c:111(maria_end)[0x562110d513bf]
maria/ma_panic.c:136(maria_panic)[0x562110d9a563]
maria/ha_maria.cc:3384(maria_hton_panic(handlerton*, ha_panic_function))[0x562110d1610e]
sql/handler.cc:471(ha_finalize_handlerton(st_plugin_int*))[0x562110ae5d6b]
sql/sql_plugin.cc:1240(plugin_deinitialize(st_plugin_int*, bool))[0x5621107ce37b]
sql/sql_plugin.cc:1315(reap_plugins())[0x5621107ce86f]
sql/sql_plugin.cc:1983(plugin_shutdown())[0x5621107d0a41]
sql/mysqld.cc:1980(clean_up(bool))[0x56211068034e]
sql/mysqld.cc:5879(mysqld_main(int, char**))[0x562110688587]
sql/main.cc:26(main)[0x56211067c8d0]
csu/libc-start.c:325(__libc_start_main)[0x7f8e2bef9830]

Comment by Sergey Vojtovich [ 2019-03-06 ]

This doesn't look like a race condition. But there seem to be a possibility for the reference counter leak.

pagecache_read(), pagecache_delete(), pagecache_write_part() restart on make_lock_and_pin() failure. On every restart they increment cnt_for_resize_op again, without corresponding decrement.

I attempted to simulate lock failure from make_lock_and_pin(), but it doesn't seem to be covered by our test suite and most probably needs some concurrency.

elenst, could you check if the following patch increases failure possibility in your tests?

diff --git a/storage/maria/ma_pagecache.c b/storage/maria/ma_pagecache.c
index d10595f..f763310 100644
--- a/storage/maria/ma_pagecache.c
+++ b/storage/maria/ma_pagecache.c
@@ -2466,6 +2466,8 @@ static my_bool get_rdlock(PAGECACHE *pagecache,
                       block, file.file, block->hash_link->file.file,
                       (ulong) pageno, (ulong) block->hash_link->pageno));
   PCBLOCK_INFO(block);
+  if (rand() % 2)
+    DBUG_RETURN(1);
   while (block->wlocks && !pthread_equal(block->write_locker, locker))
   {
     /* Lock failed we will wait */

Comment by Matthias Leich [ 2019-04-11 ]

10.4 71848585f81690babb70cba345719a5d178b00d7 2019-04-10T11:19:38+03:00
with the patch from above applied and compiled with debug leads to
Logging: 10.4/mysql-test/mysql-test-run.pl  --mem 1st
vardir: 10.4/bld_debug/mysql-test/var
Checking leftover processes...
Removing old var directory...
Creating var directory '10.4/bld_debug/mysql-test/var'...
 - symlinking 'var' to '/dev/shm/var_auto_2u03'
Checking supported features...
MariaDB Version 10.4.5-MariaDB-debug
 - SSL connections supported
 - binaries are debug compiled
Collecting tests...
Installing system database...
mysql-test-run: *** ERROR: Error executing mysqld --bootstrap
Could not install system database from 10.4/bld_debug/mysql-test/var/log/bootstrap.sql
The 10.4/bld_debug/mysql-test/var/log/bootstrap.log file contains:
10.4/bld_debug/sql/mysqld --no-defaults --disable-getopt-prefix-matching --bootstrap --basedir=10.4 --datadir=10.4/bld_debug/mysql-test/var/install.db --plugin-dir=10.4/bld_debug/mysql-test/var/plugins --default-storage-engine=myisam --loose-skip-plugin-feedback --loose-skip-plugin-innodb --loose-skip-plugin-innodb-buffer-page --loose-skip-plugin-innodb-buffer-page-lru --loose-skip-plugin-innodb-buffer-pool-stats --loose-skip-plugin-innodb-cmp --loose-skip-plugin-innodb-cmp-per-index --loose-skip-plugin-innodb-cmp-per-index-reset --loose-skip-plugin-innodb-cmp-reset --loose-skip-plugin-innodb-cmpmem --loose-skip-plugin-innodb-cmpmem-reset --loose-skip-plugin-innodb-ft-being-deleted --loose-skip-plugin-innodb-ft-config --loose-skip-plugin-innodb-ft-default-stopword --loose-skip-plugin-innodb-ft-deleted --loose-skip-plugin-innodb-ft-index-cache --loose-skip-plugin-innodb-ft-index-table --loose-skip-plugin-innodb-lock-waits --loose-skip-plugin-innodb-locks --loose-skip-plugin-innodb-metrics --loose-skip-plugin-innodb-mutexes --loose-skip-plugin-innodb-sys-columns --loose-skip-plugin-innodb-sys-datafiles --loose-skip-plugin-innodb-sys-fields --loose-skip-plugin-innodb-sys-foreign --loose-skip-plugin-innodb-sys-foreign-cols --loose-skip-plugin-innodb-sys-indexes --loose-skip-plugin-innodb-sys-semaphore-waits --loose-skip-plugin-innodb-sys-tables --loose-skip-plugin-innodb-sys-tablespaces --loose-skip-plugin-innodb-sys-tablestats --loose-skip-plugin-innodb-sys-virtual --loose-skip-plugin-innodb-tablespaces-encryption --loose-skip-plugin-innodb-tablespaces-scrubbing --loose-skip-plugin-innodb-trx --loose-skip-plugin-partition --loose-skip-plugin-sequence --loose-skip-plugin-unix-socket --loose-skip-plugin-user-variables --loose-innodb --loose-innodb-log-file-size=5M --disable-sync-frm --tmpdir=10.4/bld_debug/mysql-test/var/tmp/ --core-file --console --lc-messages-dir=10.4/bld_debug/sql/share/ --character-sets-dir=10.4/sql/share/charsets
2019-04-11 14:26:12 0 [Note] 10.4/bld_debug/sql/mysqld (mysqld 10.4.5-MariaDB-debug) starting as ...
2019-04-11 14:26:13 0 [Note] Plugin 'unix_socket' is disabled.
mysqld: 10.4/storage/maria/ma_pagecache.c:2667: make_lock_and_pin: Assertion `block->hash_link->requests > 0' failed.
190411 14:26:13 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.4.5-MariaDB-debug
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=6
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467808 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x55e7e09fc5b8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fff187172f8 thread_stack 0x49000
10.4/bld_debug/sql/mysqld(my_print_stacktrace+0x40)[0x55e7ddf52fe9]
mysys/stacktrace.c:269(my_print_stacktrace)[0x55e7dd786a92]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11670)[0x7f1e8f630670]
linux/raise.c:58(__GI_raise)[0x7f1e8e7e177f]
stdlib/abort.c:91(__GI_abort)[0x7f1e8e7e337a]
assert/assert.c:92(__assert_fail_base)[0x7f1e8e7d9b47]
/lib/x86_64-linux-gnu/libc.so.6(+0x2dbf2)[0x7f1e8e7d9bf2]
maria/ma_pagecache.c:2668(make_lock_and_pin)[0x55e7dddae1d7]
maria/ma_pagecache.c:4400(flush_cached_blocks)[0x55e7dddb260d]
maria/ma_pagecache.c:4718(flush_pagecache_blocks_int)[0x55e7dddb320c]
maria/ma_pagecache.c:4833(flush_pagecache_blocks_with_filter)[0x55e7dddb363c]
maria/ma_extra.c:650(_ma_flush_table_files)[0x55e7dddcef0b]
maria/ma_check.c:2310(protect_against_repair_crash)[0x55e7dde17d8d]
maria/ma_check.c:2404(initialize_variables_for_repair)[0x55e7dde1819a]
maria/ma_check.c:3702(maria_repair_by_sort)[0x55e7dde1c07a]
maria/ha_maria.cc:1724(ha_maria::repair(THD*, st_handler_check_param*, bool))[0x55e7ddd83e25]
maria/ha_maria.cc:2082(ha_maria::enable_indexes(unsigned int))[0x55e7ddd84cf2]
maria/ha_maria.cc:2291(ha_maria::end_bulk_insert())[0x55e7ddd854f4]
sql/handler.h:3287(handler::ha_end_bulk_insert())[0x55e7dd41523f]
sql/sql_insert.cc:3932(select_insert::prepare_eof())[0x55e7dd411d3d]
sql/sql_insert.cc:4029(select_insert::send_eof())[0x55e7dd412456]
sql/sql_select.cc:19630(do_select(JOIN*, Procedure*))[0x55e7dd4cb348]
sql/sql_select.cc:4388(JOIN::exec_inner())[0x55e7dd4a2269]
sql/sql_select.cc:4171(JOIN::exec())[0x55e7dd4a1416]
sql/sql_select.cc:4604(mysql_select(THD*, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_select_lex_unit*, st_select_lex*))[0x55e7dd4a2abb]
sql/sql_select.cc:424(handle_select(THD*, LEX*, select_result*, unsigned long))[0x55e7dd4930de]
sql/sql_parse.cc:4897(mysql_execute_command(THD*))[0x55e7dd45524c]
sql/sql_parse.cc:8154(mysql_parse(THD*, char*, unsigned int, Parser_state*, bool, bool))[0x55e7dd460841]
sql/sql_parse.cc:1100(bootstrap(st_mysql_file*))[0x55e7dd449aaa]
sql/mysqld.cc:5794(mysqld_main(int, char**))[0x55e7dd3232fd]
sql/main.cc:26(main)[0x55e7dd3179d0]
csu/libc-start.c:325(__libc_start_main)[0x7f1e8e7cc3f1]
10.4/bld_debug/sql/mysqld(_start+0x2a)[0x55e7dd3178aa]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x55e7e1857f60): INSERT INTO proxies_priv SELECT * FROM tmp_proxies_priv WHERE @had_proxies_priv_table=0;
Connection ID (thread ID): 6
Status: NOT_KILLED
 
Previous attempts without the patch from above were leading to
1. mysql-test-run.pl  --mem 1st      passes
2. I made more than 100 runs of the RQG test mentioned in the bug report above but all failed quite early
    showing already reported bugs.
Sergey Vojtovich, could you provide another patch?
 
 

Comment by Michael Widenius [ 2020-04-25 ]

The reason for the crash is that the counter-of-pinned-pages in the MariaDB pagecache goes wrong.
This only affects debug builds, as in these we do an assert on shutdown to check if the counter is not 0 (some page was left pinned).

The bug was that in 2 places in the page cache, when not succeeding to pin a page and a retry was made, the
counter-of-pinned-pages counter was not properly adjusted.

In the given test case, BLOCK_COMMIT flushed all Aria files. If a block was flushed at the same time the insert tried to access it, the insert would retry to get the block and that would cause the counter to go wrong.

Comment by Michael Widenius [ 2020-04-26 ]

Fix pushed into 10.4
Will push a fix to 10.5 a bit later as it require a bit more code.

Generated at Thu Feb 08 08:42:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.