[MDEV-22031] Assertion `bpage->in_page_hash' failed in buf_pool_watch_set upon normal upgrade from 10.1 Created: 2020-03-24  Updated: 2020-03-25  Resolved: 2020-03-25

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5
Fix Version/s: 10.5.2

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: None

Attachments: File data-10.1.44.tar.gz    
Issue Links:
Relates
relates to MDEV-15053 Reduce buf_pool_t::mutex contention Closed

 Description   

Upon upgrade from 10.1, InnoDB sporadically fails to startup with

10.5 d23c3e03

mysqld: /data/src/10.5/storage/innobase/buf/buf0buf.cc:2723: buf_page_t* buf_pool_watch_set(page_id_t, rw_lock_t**): Assertion `bpage->in_page_hash' failed.
200325  0:36:27 [ERROR] mysqld got signal 6 ;
 
#3  0x00007fc65dbaff12 in __GI___assert_fail (assertion=0x5578057cbd16 "bpage->in_page_hash", file=0x5578057cc350 "/data/src/10.5/storage/innobase/buf/buf0buf.cc", line=2723, 
    function=0x5578057d0e20 <buf_pool_watch_set(page_id_t, rw_lock_t**)::__PRETTY_FUNCTION__> "buf_page_t* buf_pool_watch_set(page_id_t, rw_lock_t**)") at assert.c:101
#4  0x00005578050e2376 in buf_pool_watch_set (page_id=..., hash_lock=0x7fc6395f9ba0) at /data/src/10.5/storage/innobase/buf/buf0buf.cc:2723
#5  0x00005578050e491d in buf_page_get_gen (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=15, file=0x557805765b48 "/data/src/10.5/storage/innobase/row/row0row.cc", line=1300, mtr=0x7fc6395fb0f0, err=0x7fc6395f9e30, 
    allow_ibuf_merge=true) at /data/src/10.5/storage/innobase/buf/buf0buf.cc:3349
#6  0x00005578050a2ecf in btr_cur_search_to_nth_level_func (index=0x7fc62c00e988, level=0, tuple=0x7fc62c00a6f8, mode=PAGE_CUR_LE, latch_mode=2, cursor=0x7fc6395fae50, ahi_latch=0x0, 
    file=0x557805765b48 "/data/src/10.5/storage/innobase/row/row0row.cc", line=1300, mtr=0x7fc6395fb0f0, autoinc=0) at /data/src/10.5/storage/innobase/btr/btr0cur.cc:1606
#7  0x0000557804fb4e4e in btr_pcur_open_low (index=0x7fc62c00e988, level=0, tuple=0x7fc62c00a6f8, mode=PAGE_CUR_LE, latch_mode=8194, cursor=0x7fc6395fae50, file=0x557805765b48 "/data/src/10.5/storage/innobase/row/row0row.cc", 
    line=1300, autoinc=0, mtr=0x7fc6395fb0f0) at /data/src/10.5/storage/innobase/include/btr0pcur.ic:441
#8  0x0000557804fb8fa4 in row_search_index_entry (index=0x7fc62c00e988, entry=0x7fc62c00a6f8, mode=8194, pcur=0x7fc6395fae50, mtr=0x7fc6395fb0f0) at /data/src/10.5/storage/innobase/row/row0row.cc:1300
#9  0x0000557804fac8c0 in row_purge_remove_sec_if_poss_leaf (node=0x557809e9e1e8, index=0x7fc62c00e988, entry=0x7fc62c00a6f8) at /data/src/10.5/storage/innobase/row/row0purge.cc:457
#10 0x0000557804face18 in row_purge_remove_sec_if_poss (node=0x557809e9e1e8, index=0x7fc62c00e988, entry=0x7fc62c00a6f8) at /data/src/10.5/storage/innobase/row/row0purge.cc:569
#11 0x0000557804fad029 in row_purge_del_mark (node=0x557809e9e1e8) at /data/src/10.5/storage/innobase/row/row0purge.cc:639
#12 0x0000557804faea4e in row_purge_record_func (node=0x557809e9e1e8, undo_rec=0x557809ea0170 "", thr=0x557809e9e130, updated_extern=false) at /data/src/10.5/storage/innobase/row/row0purge.cc:1048
#13 0x0000557804faed17 in row_purge (node=0x557809e9e1e8, undo_rec=0x557809ea0170 "", thr=0x557809e9e130) at /data/src/10.5/storage/innobase/row/row0purge.cc:1110
#14 0x0000557804faee95 in row_purge_step (thr=0x557809e9e130) at /data/src/10.5/storage/innobase/row/row0purge.cc:1159
#15 0x0000557804f188bd in que_thr_step (thr=0x557809e9e130) at /data/src/10.5/storage/innobase/que/que0que.cc:1038
#16 0x0000557804f18b54 in que_run_threads_low (thr=0x557809e9e130) at /data/src/10.5/storage/innobase/que/que0que.cc:1100
#17 0x0000557804f18da4 in que_run_threads (thr=0x557809e9e130) at /data/src/10.5/storage/innobase/que/que0que.cc:1140
#18 0x0000557804ff6519 in srv_task_execute () at /data/src/10.5/storage/innobase/srv/srv0srv.cc:2055
#19 0x0000557804ff6ca1 in purge_worker_callback () at /data/src/10.5/storage/innobase/srv/srv0srv.cc:2205
#20 0x000055780522bbca in tpool::task_group::execute (this=0x557806a503a0 <purge_task_group>, t=0x557806a50160 <purge_worker_task>) at /data/src/10.5/tpool/task_group.cc:55
#21 0x000055780522bf80 in tpool::task::execute (this=0x557806a50160 <purge_worker_task>) at /data/src/10.5/tpool/task.cc:47
#22 0x00005578052256a0 in tpool::thread_pool_generic::worker_main (this=0x557807c89b70, thread_var=0x557807c99700) at /data/src/10.5/tpool/tpool_generic.cc:518
#23 0x000055780522b9db in std::__invoke_impl<void, void (tpool::thread_pool_generic::* const&)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>(std::__invoke_memfun_deref, void (tpool::thread_pool_generic::* const&)(tpool::worker_data*), tpool::thread_pool_generic*&&, tpool::worker_data*&&) (__f=
    @0x7fc650001d68: (void (tpool::thread_pool_generic::*)(tpool::thread_pool_generic * const, tpool::worker_data *)) 0x557805225608 <tpool::thread_pool_generic::worker_main(tpool::worker_data*)>, 
    __t=<unknown type in /data/bld/10.5-daily/bin/mariadbd, CU 0x7fb5ae4, DIE 0x7fc725c>, __args#0=<unknown type in /data/bld/10.5-daily/bin/mariadbd, CU 0x7fb5ae4, DIE 0x7fc7268>) at /usr/include/c++/6/functional:227
#24 0x000055780522b94d in std::__invoke<void (tpool::thread_pool_generic::* const&)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>(void (tpool::thread_pool_generic::* const&)(tpool::worker_data*), tpool::thread_pool_generic*&&, tpool::worker_data*&&) (__fn=
    @0x7fc650001d68: (void (tpool::thread_pool_generic::*)(tpool::thread_pool_generic * const, tpool::worker_data *)) 0x557805225608 <tpool::thread_pool_generic::worker_main(tpool::worker_data*)>, 
    __args#0=<unknown type in /data/bld/10.5-daily/bin/mariadbd, CU 0x7fb5ae4, DIE 0x7fc725c>, __args#1=<unknown type in /data/bld/10.5-daily/bin/mariadbd, CU 0x7fb5ae4, DIE 0x7fc7268>) at /usr/include/c++/6/functional:251
#25 0x000055780522b8e1 in std::_Mem_fn_base<void (tpool::thread_pool_generic::*)(tpool::worker_data*), true>::operator()<tpool::thread_pool_generic*, tpool::worker_data*>(tpool::thread_pool_generic*&&, tpool::worker_data*&&) const (this=0x7fc650001d68, __args#0=<unknown type in /data/bld/10.5-daily/bin/mariadbd, CU 0x7fb5ae4, DIE 0x7fc725c>, __args#1=<unknown type in /data/bld/10.5-daily/bin/mariadbd, CU 0x7fb5ae4, DIE 0x7fc7268>)
    at /usr/include/c++/6/functional:604
#26 0x000055780522b893 in std::_Bind_simple<std::_Mem_fn<void (tpool::thread_pool_generic::*)(tpool::worker_data*)> (tpool::thread_pool_generic*, tpool::worker_data*)>::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) (
    this=0x7fc650001d58) at /usr/include/c++/6/functional:1391
#27 0x000055780522b791 in std::_Bind_simple<std::_Mem_fn<void (tpool::thread_pool_generic::*)(tpool::worker_data*)> (tpool::thread_pool_generic*, tpool::worker_data*)>::operator()() (this=0x7fc650001d58)
    at /usr/include/c++/6/functional:1380
#28 0x000055780522b770 in std::thread::_State_impl<std::_Bind_simple<std::_Mem_fn<void (tpool::thread_pool_generic::*)(tpool::worker_data*)> (tpool::thread_pool_generic*, tpool::worker_data*)> >::_M_run() (this=0x7fc650001d50)
    at /usr/include/c++/6/thread:197
#29 0x00007fc65e4f7e6f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#30 0x00007fc65fb384a4 in start_thread (arg=0x7fc6395fc700) at pthread_create.c:456
#31 0x00007fc65dc6cd0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

The failure can be reproduced by starting the server on the attached datadir numerous times in the loop. The data was created on 10.1.44 release server by running DDL / DML flow and shutting down the server normally at the end. The general and error log are also in the attached archive.
Startup options:

--innodb-page-size=4K --innodb-compression-algorithm=none --loose-innodb_log_compressed_pages=on

Couldn't reproduce on 10.4 in reasonable time.



 Comments   
Comment by Marko Mäkelä [ 2020-03-25 ]

I am unable to reproduce the reported crash locally. I tried a debug build with and without, ASAN build, and a release build. The rr replay that was shared indicates that in buf_pool_watch_set(), bpage==buf_pool.watch. This is almost certainly a race condition that was caused by my recent change to buf_pool_watch_unset() which was motivated by MDEV-15053. I will revert that.

Comment by Marko Mäkelä [ 2020-03-25 ]

The problem is that the revised buf_pool_watch_unset() only protects the debug field buf_page_t::in_page_hash by the hash bucket rw-lock, no longer by buf_pool.mutex. Basically, the failing assertion is a bogus one. buf_pool_watch_set() is checking the consistency of a block that it is not going to allocate, while buf_pool_watch_unset() is in the process of freeing that block. The other assertion in that branch is fine, because it is being protected by buf_pool.mutex for both threads:

diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc
index 2ff61702203..08dc4789d52 100644
--- a/storage/innobase/buf/buf0buf.cc
+++ b/storage/innobase/buf/buf0buf.cc
@@ -2720,7 +2720,6 @@ buf_pool_watch_set(
 
                        return(NULL);
                case BUF_BLOCK_ZIP_PAGE:
-                       ut_ad(bpage->in_page_hash);
                        ut_ad(bpage->buf_fix_count > 0);
                        break;
                default:

The general assumption seems to be that buf_page_t::in_page_hash is protected by buf_pool.mutex. For now, I will revert the change to buf_pool_watch_unset(), and in MDEV-15053 I plan to relax the condition or remove the debug field altogether.

Generated at Thu Feb 08 09:11:40 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.