Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33613

InnoDB may still hang when temporarily running out of buffer pool

Details

    Description

      The test encryption.innochecksum very often hangs, timing out in 900 seconds (15 minutes). I found it while working on MDEV-29445, but I see that the test fails also in other branches. The following (refining something that was modified in MDEV-33053) would fix the hang for me:

      diff --git a/storage/innobase/buf/buf0lru.cc b/storage/innobase/buf/buf0lru.cc
      index 5ecaa513f41..fa833d96476 100644
      --- a/storage/innobase/buf/buf0lru.cc
      +++ b/storage/innobase/buf/buf0lru.cc
      @@ -449,24 +449,18 @@ buf_block_t *buf_LRU_get_free_block(bool have_mutex)
       			goto got_block;
       		}
       		const bool wake = buf_pool.need_LRU_eviction();
      -		mysql_mutex_unlock(&buf_pool.mutex);
       		mysql_mutex_lock(&buf_pool.flush_list_mutex);
      -		const auto n_flush = buf_pool.n_flush();
       		if (wake && !buf_pool.page_cleaner_active()) {
       			buf_pool.page_cleaner_wakeup(true);
       		}
       		mysql_mutex_unlock(&buf_pool.flush_list_mutex);
      -		mysql_mutex_lock(&buf_pool.mutex);
      -		if (!n_flush) {
      -			goto not_found;
      -		}
      -		if (!buf_pool.try_LRU_scan) {
      +		if (wake) {
       			my_cond_wait(&buf_pool.done_free,
       				     &buf_pool.mutex.m_mutex);
       		}
       	}
       
      -not_found:
      +	IF_DBUG(not_found:,);
       	if (n_iterations > 1) {
       		MONITOR_INC( MONITOR_LRU_GET_FREE_WAITS );
       	}
      

      It looks like this hang was introduced in MDEV-26827. The above fix would depend on the fix of MDEV-33053.

      Attachments

        Issue Links

          Activity

            While the test encryption.innochecksum occasionally hangs on 10.5 as well, it cannot be caused by MDEV-26827, which is 10.6 only.

            marko Marko Mäkelä added a comment - While the test encryption.innochecksum occasionally hangs on 10.5 as well, it cannot be caused by MDEV-26827 , which is 10.6 only.

            I am done with the review and have given my comments. I would need to re-check once after the comments are addressed.

            It would be good if the stack trace for the hang is updated here.

            debarun Debarun Banerjee added a comment - I am done with the review and have given my comments. I would need to re-check once after the comments are addressed. It would be good if the stack trace for the hang is updated here.

            Here is an example of a hang:

            bb-11.5-bar-MDEV-25829 1e0e0e27b862fc0907f9c9599e467edfc4c9b3c4

                                  encryption.innochecksum '8k,ctr,innodb,release,strict_full_crc32' w6 [ fail ]  Found warnings/errors in server log file!
                    Test ended at 2024-03-14 01:04:32
            line
            Attempting backtrace. You can use the following information to find out
            ^ Found warnings in /home/buildbot/s390x-rhel-9/build/mysql-test/var/6/log/mysqld.1.err
            …
            Thread 6 (Thread 0x3ff7bfff840 (LWP 31032)):
            #0  0x000003ff8731ab26 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
            No symbol table info available.
            #1  0x000003ff8731d774 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libc.so.6
            No symbol table info available.
            #2  0x000002aa04576862 in buf_flush_page_cleaner () at /home/buildbot/s390x-rhel-9/build/storage/innobase/buf/buf0flu.cc:2460
                    soft_lsn_limit = <optimized out>
                    oldest_lsn = <optimized out>
                    pct_lwm = <optimized out>
                    n_flushed = <optimized out>
                    n = <optimized out>
                    dirty_blocks = <optimized out>
                    dirty_pct = <optimized out>
                    last_pages = <optimized out>
                    abstime = {tv_sec = 1710378205, tv_nsec = 429323000}
                    lsn_limit = 0
                    last_activity_count = <optimized out>
            #3  0x000003ff876f298e in execute_native_thread_routine () from /lib64/libstdc++.so.6
            No symbol table info available.
            #4  0x000003ff8731e2e6 in start_thread () from /lib64/libc.so.6
            No symbol table info available.
            #5  0x000003ff872afdbe in thread_start () from /lib64/libc.so.6
            No symbol table info available.
            …
            Thread 3 (Thread 0x3ff2b7fe840 (LWP 31199)):
            #0  0x000003ff8731ab26 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
            No symbol table info available.
            #1  0x000003ff8731d774 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libc.so.6
            No symbol table info available.
            #2  0x000002aa0457a1aa in buf_LRU_get_free_block (get=get@entry=have_mutex) at /home/buildbot/s390x-rhel-9/build/storage/innobase/buf/buf0lru.cc:481
                    wake = true
                    n_flush = 0
                    retry = <optimized out>
                    n_iterations = 0
                    flush_failures = 0
                    block = <optimized out>
            #3  0x000002aa0457af94 in buf_read_page (page_id=page_id@entry={m_id = 51539607582}, zip_size=zip_size@entry=0, chain=@0x2aa06b52a10: {first = 0x0}) at /home/buildbot/s390x-rhel-9/build/storage/innobase/buf/buf0rea.cc:429
                    space = 0x3ff541c7068
                    block = 0x0
                    err = <optimized out>
            #4  0x000002aa03df85f2 in buf_page_get_gen (page_id=page_id@entry={m_id = 51539607582}, zip_size=zip_size@entry=0, rw_latch=4396028032552, rw_latch@entry=729795592, guess=0x0, guess@entry=0x3ff2b7fcb68, mode=4395911718976, mode@entry=10, mtr=0x3ff2b7fd358, err=0x3ff2b7fcb0c) at /home/buildbot/s390x-rhel-9/build/storage/innobase/buf/buf0buf.cc:2677
                    local_err = <optimized out>
                    loop = <optimized out>
                    retries = <optimized out>
                    chain = @0x2aa0707af60: {first = 0x38800000001}
                    hash_lock = @0x2aa04fd91c0: {lk = {writer = {lock = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}}, readers = {<std::__atomic_base<unsigned int>> = {_M_i = 26054432}, <No data fields>}}}
                    block = 0x0
                    state = <optimized out>
            #5  0x000002aa04555d66 in btr_cur_t::search_leaf (this=this@entry=0x2aa06c2ab80, tuple=tuple@entry=0x3ff1c028368, mode=mode@entry=PAGE_CUR_LE, latch_mode=729795360, mtr=0xaa, mtr@entry=0x3ff2b7fd358) at /home/buildbot/s390x-rhel-9/build/storage/innobase/btr/btr0cur.cc:1142
                    same_key_root = <optimized out>
                    search_loop = <optimized out>
                    guess = 0x3ff2b7fcb68
                    lock_intention = <optimized out>
                    detected_same_key_root = true
                    heap = 0x0
                    offsets_ = {300, 2, 32774, 4, 8, 2, 1718, 59160, 0, 12, 0, 0, 0, 1023, 32919, 49152, 0, 0, 0, 8192, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 34, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1023, 34722, 46448, 0, 1023, 11135, 51968, 240, 29806, 60917, 2424, 0, 0, 0, 8162, 0, 1023, 34610, 56812, 0, 1023, 11135, 52000, 0, 0, 32, 30286, 0, 0, 0, 170, 0, 1023, 11135, 52232, 0, 1023, 34811, 45312, 0, 1023, 11135, 52072, 0, 682, 1277, 37312, 0, 1023, 11135, 53252, 0, 682, 1087, 27786, 0, 1023, 11135, 52072, 0, 0, 0, 34, 0, 1023, 7168, 2256, 0, 0, 256, 64562, 0, 0, 0, 5, 0, 0, 0, 22, 0, 1023, 11135, 52600, 0, 1023, 11135, 55241, 0, 1023, 11135, 55256, 0 <repeats 11 times>, 1, 0, 682, 1730, 43849, 0, 1023, 11135, 52176, 57034, 19424, 30044, 27368, 0, 1023, 11135, 52896, 0, 1023, 11135, 52888, 0, 1023, 11135, 52896, 0, 0, 0, 22, 0, 1023, 11135, 52600, 0, 1023, 11135, 52264, 0, 682, 1277, 37312, 0, 0, 0, 14, 0, 682, 986, 63592, 0, 1023, 11135, 52264, 0, 682, 1277, 37312, 0...}
                    offsets = 0x20764e
                    offsets2_ = {300, 0, 0, 0, 0, 11263, 44297, 21140, 0, 0, 0, 34, 0, 0, 0, 16, 0, 1023, 32919, 49152, 0, 1023, 11135, 52552, 0, 682, 1277, 37312, 0, 682, 1718, 59160, 0, 682, 1110, 33984, 0, 1023, 11135, 52552, 0, 0, 0, 8192, 0, 682, 1284, 14896, 0, 0, 0, 8192, 0, 1023, 11135, 52848, 0, 1023, 0, 0, 0, 682, 1718, 59496, 0, 1023, 11135, 59456, 0, 0, 0, 0, 0, 1, 0, 57, 0, 1, 0, 57, 0, 1, 0, 57, 0, 682, 1718, 60024, 0, 1023, 11135, 52688, 0, 682, 1277, 37312, 0, 0, 0, 0, 0, 0, 0, 107, 0, 1023, 32913, 36352, 0, 0, 0, 8192, 0, 1023, 11135, 52912, 0, 0, 0, 0, 0, 1023, 11135, 52752, 0, 7, 0, 6, 0, 1023, 33532, 30208, 0, 0, 0, 0, 0, 682, 1630, 21248, 0, 1023, 34440, 4608, 0, 0, 0, 0, 0, 1023, 11135, 52928, 0, 1023, 11135, 52928, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1023, 11135, 52896, 0, 1023, 11135, 52896, 45057, 3072, 11823, 29797, 57034, 19424, 45691, 62183, 25600, 1023, 11135, 55816, 0, 682, 1284, 2324, 0 <repeats 13 times>, 1023, 32768...}
                    offsets2 = 0x22
                    latch_by_caller = <optimized out>
                    info = <optimized out>
                    ahi_enabled = <optimized out>
                    savepoint = 0
                    node_ptr_max_size = 4396027131248
                    compress_limit = 0
                    rw_latch = (RW_NO_LATCH | unknown: 0x2b7fcc00)
                    err = DB_SUCCESS
                    zip_size = <optimized out>
                    page_id = {m_id = 51539607582}
                    page_mode = 3992258936
                    height = 2929251422656
                    block_savepoint = 2
                    block = <optimized out>
                    not_first_access = <optimized out>
                    page_level = <optimized out>
            #6  0x000002aa04500e18 in btr_pcur_open (mtr=0x3ff2b7fd358, cursor=0x2aa06c2ab80, latch_mode=<optimized out>, mode=PAGE_CUR_LE, tuple=0x3ff1c028368) at /home/buildbot/s390x-rhel-9/build/storage/innobase/include/btr0pcur.h:431
            No locals.
            #7  row_search_on_row_ref (pcur=pcur@entry=0x2aa06c2ab80, mode=<optimized out>, table=<optimized out>, ref=0x3ff1c028368, mtr=mtr@entry=0x3ff2b7fd358) at /home/buildbot/s390x-rhel-9/build/storage/innobase/row/row0row.cc:1211
                    index = 0x3ff541c7290
            #8  0x000002aa044fb59a in row_purge_reposition_pcur (mode=mode@entry=BTR_MODIFY_LEAF, node=node@entry=0x2aa06c2aaf8, mtr=mtr@entry=0x3ff2b7fd358) at /home/buildbot/s390x-rhel-9/build/storage/innobase/row/row0purge.cc:83
            No locals.
            #9  0x000002aa044fbd30 in row_purge_reset_trx_id (node=node@entry=0x2aa06c2aaf8, mtr=mtr@entry=0x3ff2b7fd358) at /home/buildbot/s390x-rhel-9/build/storage/innobase/row/row0purge.cc:604
            No locals.
            #10 0x000002aa044fcc9c in row_purge_record_func (node=node@entry=0x2aa06c2aaf8, undo_rec=0x3ff2b7fd378 "", undo_rec@entry=0x3ff8097d078 "", thr=thr@entry=0x3ff2b7fda08, updated_extern=<optimized out>) at /home/buildbot/s390x-rhel-9/build/storage/innobase/row/row0purge.cc:1094
                    mtr = {m_last = 0x0, m_last_offset = 0, m_log_mode = 0, m_modifications = 0, m_made_dirty = 0, m_latch_ex = 0, m_trim_pages = 0, m_crc = 2067013353, m_memo = {<small_vector_base> = {BeginX = 0x3ff2b7fd378, Size = 2, Capacity = 16}, small = {{object = 0x3ff541c7390, type = MTR_MEMO_S_LOCK}, {object = 0x3ff8081a160, type = MTR_MEMO_PAGE_S_FIX}, {object = 0x3ff80a1a5c0, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x2aa04fd91c0, type = 0}, {object = 0x2aa0449d592 <mem_heap_create_block_func(mem_block_info_t*, unsigned long, unsigned long)+210>, type = 1023}, {object = 0x6, type = 1023}, {object = 0x15df, type = 682}, {object = 0x3ff2b7fda08, type = 0}, {object = 0x6, type = 1023}, {object = 0x1, type = 15758446}, {object = 0x3ff2b7fd5b7, type = 0}, {object = 0x3ff2b7fd448, type = 0}, {object = 0x3ff2b7fe840, type = 682}, {object = 0x3ff2b7fd9f8, type = 682}, {object = 0x3ff2b7fd400, type = 682}, {object = 0x10000003c, type = 682}}}, m_log = {m_heap = 0x0, m_list = {<ilist<mtr_buf_t::block_t, void>> = {sentinel_ = {next = 0x3ff2b7fd4a0, prev = 0x3ff2b7fd4a0}}, size_ = 1}, m_size = 0, m_first_block = {<ilist_node<void>> = {next = 0x3ff2b7fd480, prev = 0x3ff2b7fd480}, m_data = "F\f\035\212E\006\000\262\000\200\303\000\006\000\000<", '\000' <repeats 15 times>, "\002\000\000\003\377+\177\324\340\000\000\000\000\000\000\000\020\000\000\003\377\200a\210`\000\000\000\001\000\000\000\002\000\000\003\377+\177test\000\377+\177\324\320\000\000\000\000\000\000\000\000\000\000\002\252\006\245\354\340\000\000\000\000\000\000\000\001\336\312K\340u]\326\350\000\000\000\000\000\000\001\347\000\000\000\000\000#\342\333\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\000\336\312K\340\034\243Z\305\000\000\002\252\006e\377H\000\000\000\000\000\000\000\200\000\000\002\252\006\245\356\000\001\312K\340\034\243F\305\000\000\000\000\000\000\000\000\000\000\003\377+\177\325\220"..., m_used = 0}}, m_user_space = 0x0, m_commit_lsn = 0, m_freed_space = 0x0, m_freed_pages = 0x0}
                    purged = true
            #11 0x000002aa044fd2da in row_purge (thr=<optimized out>, undo_rec=<optimized out>, node=0x2aa06c2aaf8) at /home/buildbot/s390x-rhel-9/build/storage/innobase/row/row0purge.cc:1138
                    purged = <optimized out>
                    updated_extern = false
            #12 row_purge_step (thr=thr@entry=0x2aa06c2aa58) at /home/buildbot/s390x-rhel-9/build/storage/innobase/row/row0purge.cc:1201
                    purge_rec = {undo_rec = <optimized out>, roll_ptr = <optimized out>}
                    node = 0x2aa06c2aaf8
            #13 0x000002aa044c5292 in que_thr_step (thr=0x2aa06c2aa58) at /home/buildbot/s390x-rhel-9/build/storage/innobase/que/que0que.cc:588
                    node = 0x2aa06c2aaf8
                    old_thr = 0x2aa06c2aa58
                    trx = 0x3ff80d17200
                    type = <optimized out>
                    node = <optimized out>
                    old_thr = <optimized out>
                    trx = <optimized out>
                    type = <optimized out>
            #14 que_run_threads_low (thr=<optimized out>) at /home/buildbot/s390x-rhel-9/build/storage/innobase/que/que0que.cc:644
                    next_thr = <optimized out>
                    trx = <optimized out>
            #15 que_run_threads (thr=0x2aa06c2aa58) at /home/buildbot/s390x-rhel-9/build/storage/innobase/que/que0que.cc:664
                    loop = <optimized out>
                    trx = 0x3ff80d17200
            #16 0x000002aa03df2dd0 in trx_purge (n_tasks=n_tasks@entry=4, history_size=<optimized out>) at /home/buildbot/s390x-rhel-9/build/storage/innobase/trx/trx0purge.cc:1484
                    no_throttle = <optimized out>
                    thd = 0x2aa06fc52a8
                    head = {trx_no = <optimized out>, undo_no = <optimized out>}
                    n_pages = <optimized out>
                    thr = <optimized out>
            #17 0x000002aa0451c67c in purge_coordinator_state::do_purge (this=this@entry=0x2aa05980320 <purge_state>) at /home/buildbot/s390x-rhel-9/build/storage/innobase/srv/srv0srv.cc:1425
                    n_pages_handled = <optimized out>
                    n_threads = <optimized out>
            #18 0x000002aa0451c29a in purge_coordinator_callback () at /home/buildbot/s390x-rhel-9/build/storage/innobase/srv/srv0srv.cc:1503
                    ctx = 0x3ff1c000b60
                    thd = 0x2aa06fc52a8
            #19 0x000002aa045ff348 in tpool::task_group::execute (this=0x2aa05980628 <purge_coordinator_task_group>, t=0x2aa05980338 <purge_coordinator_task>) at /home/buildbot/s390x-rhel-9/build/tpool/task_group.cc:73
                    lk = {_M_device = <optimized out>, _M_owns = false}
            #20 0x000002aa045fd904 in tpool::thread_pool_generic::worker_main (this=0x2aa06a5ece0, thread_var=0x2aa06a5fe20) at /home/buildbot/s390x-rhel-9/build/tpool/tpool_generic.cc:583
                    task = 0x2aa05980338 <purge_coordinator_task>
            #21 0x000003ff876f298e in execute_native_thread_routine () from /lib64/libstdc++.so.6
            No symbol table info available.
            #22 0x000003ff8731e2e6 in start_thread () from /lib64/libc.so.6
            No symbol table info available.
            #23 0x000003ff872afdbe in thread_start () from /lib64/libc.so.6
            No symbol table info available.
            …
            Thread 1 (Thread 0x3ff881ebca0 (LWP 31001)):
            #0  0x000003ff87320126 in __pthread_kill_implementation () from /lib64/libc.so.6
            No symbol table info available.
            #1  0x000002aa0417fae6 in handle_fatal_signal (sig=<optimized out>) at /home/buildbot/s390x-rhel-9/build/sql/signal_handler.cc:357
                    curr_time = 1710378264
                    tm = {tm_sec = 24, tm_min = 4, tm_hour = 1, tm_mday = 14, tm_mon = 2, tm_year = 124, tm_wday = 4, tm_yday = 73, tm_isdst = 0, tm_gmtoff = 0, tm_zone = 0x2aa066b4ef0 "UTC"}
                    thd = 0x0
                    print_invalid_query_pointer = 89
            #2  <signal handler called>
            No symbol table info available.
            #3  0x000003ff8731ab26 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
            No symbol table info available.
            #4  0x000003ff8731d774 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libc.so.6
            No symbol table info available.
            #5  0x000003ff876eb8f8 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
            No symbol table info available.
            #6  0x000002aa045ff682 in tpool::waitable_task::wait (this=this@entry=0x2aa05980338 <purge_coordinator_task>, lk=@0x3ffd1776330: {_M_device = 0x2aa05980358 <purge_coordinator_task+32>, _M_owns = true}) at /home/buildbot/s390x-rhel-9/build/tpool/task.cc:63
            No locals.
            #7  0x000002aa045ff794 in tpool::waitable_task::disable (this=this@entry=0x2aa05980338 <purge_coordinator_task>) at /home/buildbot/s390x-rhel-9/build/tpool/task.cc:80
                    lk = {_M_device = 0x2aa05980358 <purge_coordinator_task+32>, _M_owns = true}
            #8  0x000002aa0451bfb2 in srv_shutdown_purge_tasks () at /home/buildbot/s390x-rhel-9/build/storage/innobase/srv/srv0srv.cc:1515
                    lk = {_M_device = 0xdeca4be0719c7ee8, _M_owns = false}
                    lk = {_M_device = <optimized out>, _M_owns = <optimized out>}
            #9  srv_purge_shutdown () at /home/buildbot/s390x-rhel-9/build/storage/innobase/srv/srv0srv.cc:1580
                    history_size = 6
            #10 0x000002aa0451d13c in innodb_preshutdown () at /home/buildbot/s390x-rhel-9/build/storage/innobase/srv/srv0start.cc:1993
                    first_time = false
            #11 innodb_preshutdown () at /home/buildbot/s390x-rhel-9/build/storage/innobase/srv/srv0start.cc:1979
                    first_time = false
            #12 0x000002aa0417fe42 in plugin_pre_shutdown (plugin=<optimized out>) at /home/buildbot/s390x-rhel-9/build/sql/handler.cc:1113
                    hton = <optimized out>
            #13 0x000002aa03f1f624 in plugin_foreach_with_mask (thd=thd@entry=0x0, func=func@entry=0x2aa0417fe20 <plugin_pre_shutdown(THD*, plugin_ref, void*)>, type=type@entry=1, state_mask=state_mask@entry=10, arg=arg@entry=0x0) at /home/buildbot/s390x-rhel-9/build/sql/sql_plugin.cc:2559
                    idx = 8
                    total = 1146
                    plugin = <optimized out>
                    plugins = 0x3ffd17765e0
                    res = 0 '\000'
            #14 0x000002aa04183cac in ha_pre_shutdown () at /home/buildbot/s390x-rhel-9/build/sql/handler.cc:1120
            No locals.
            #15 0x000002aa03e0add8 in mysqld_main (argc=<optimized out>, argv=<optimized out>) at /home/buildbot/s390x-rhel-9/build/sql/mysqld.cc:6001
                    please_close_stdin = <optimized out>
                    ho_error = <optimized out>
                    new_thread_stack_size = <optimized out>
                    user = <optimized out>
            #16 0x000003ff872b08b2 in __libc_start_call_main () from /lib64/libc.so.6
            No symbol table info available.
            #17 0x000003ff872b0990 in __libc_start_main_impl () from /lib64/libc.so.6
            No symbol table info available.
            #18 0x000002aa03dfd1c0 in _start () at /home/buildbot/s390x-rhel-9/build/sql/my_decimal.h:147
            No symbol table info available.
            

            I only included InnoDB threads above. We can see that both buf_LRU_get_free_block() and buf_flush_page_cleaner() are in an indefinite (untimed) wait. The caller of buf_LRU_get_free_block() is a purge worker task, which Thread 1 is waiting for. At one point during the development of MDEV-29445 I was able to reproduce this scenario deterministically on my local system.

            marko Marko Mäkelä added a comment - Here is an example of a hang : bb-11.5-bar-MDEV-25829 1e0e0e27b862fc0907f9c9599e467edfc4c9b3c4 encryption.innochecksum '8k,ctr,innodb,release,strict_full_crc32' w6 [ fail ] Found warnings/errors in server log file! Test ended at 2024-03-14 01:04:32 line Attempting backtrace. You can use the following information to find out ^ Found warnings in /home/buildbot/s390x-rhel-9/build/mysql-test/var/6/log/mysqld.1.err … Thread 6 (Thread 0x3ff7bfff840 (LWP 31032)): #0 0x000003ff8731ab26 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6 No symbol table info available. #1 0x000003ff8731d774 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libc.so.6 No symbol table info available. #2 0x000002aa04576862 in buf_flush_page_cleaner () at /home/buildbot/s390x-rhel-9/build/storage/innobase/buf/buf0flu.cc:2460 soft_lsn_limit = <optimized out> oldest_lsn = <optimized out> pct_lwm = <optimized out> n_flushed = <optimized out> n = <optimized out> dirty_blocks = <optimized out> dirty_pct = <optimized out> last_pages = <optimized out> abstime = {tv_sec = 1710378205, tv_nsec = 429323000} lsn_limit = 0 last_activity_count = <optimized out> #3 0x000003ff876f298e in execute_native_thread_routine () from /lib64/libstdc++.so.6 No symbol table info available. #4 0x000003ff8731e2e6 in start_thread () from /lib64/libc.so.6 No symbol table info available. #5 0x000003ff872afdbe in thread_start () from /lib64/libc.so.6 No symbol table info available. … Thread 3 (Thread 0x3ff2b7fe840 (LWP 31199)): #0 0x000003ff8731ab26 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6 No symbol table info available. #1 0x000003ff8731d774 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libc.so.6 No symbol table info available. #2 0x000002aa0457a1aa in buf_LRU_get_free_block (get=get@entry=have_mutex) at /home/buildbot/s390x-rhel-9/build/storage/innobase/buf/buf0lru.cc:481 wake = true n_flush = 0 retry = <optimized out> n_iterations = 0 flush_failures = 0 block = <optimized out> #3 0x000002aa0457af94 in buf_read_page (page_id=page_id@entry={m_id = 51539607582}, zip_size=zip_size@entry=0, chain=@0x2aa06b52a10: {first = 0x0}) at /home/buildbot/s390x-rhel-9/build/storage/innobase/buf/buf0rea.cc:429 space = 0x3ff541c7068 block = 0x0 err = <optimized out> #4 0x000002aa03df85f2 in buf_page_get_gen (page_id=page_id@entry={m_id = 51539607582}, zip_size=zip_size@entry=0, rw_latch=4396028032552, rw_latch@entry=729795592, guess=0x0, guess@entry=0x3ff2b7fcb68, mode=4395911718976, mode@entry=10, mtr=0x3ff2b7fd358, err=0x3ff2b7fcb0c) at /home/buildbot/s390x-rhel-9/build/storage/innobase/buf/buf0buf.cc:2677 local_err = <optimized out> loop = <optimized out> retries = <optimized out> chain = @0x2aa0707af60: {first = 0x38800000001} hash_lock = @0x2aa04fd91c0: {lk = {writer = {lock = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}}, readers = {<std::__atomic_base<unsigned int>> = {_M_i = 26054432}, <No data fields>}}} block = 0x0 state = <optimized out> #5 0x000002aa04555d66 in btr_cur_t::search_leaf (this=this@entry=0x2aa06c2ab80, tuple=tuple@entry=0x3ff1c028368, mode=mode@entry=PAGE_CUR_LE, latch_mode=729795360, mtr=0xaa, mtr@entry=0x3ff2b7fd358) at /home/buildbot/s390x-rhel-9/build/storage/innobase/btr/btr0cur.cc:1142 same_key_root = <optimized out> search_loop = <optimized out> guess = 0x3ff2b7fcb68 lock_intention = <optimized out> detected_same_key_root = true heap = 0x0 offsets_ = {300, 2, 32774, 4, 8, 2, 1718, 59160, 0, 12, 0, 0, 0, 1023, 32919, 49152, 0, 0, 0, 8192, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 34, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1023, 34722, 46448, 0, 1023, 11135, 51968, 240, 29806, 60917, 2424, 0, 0, 0, 8162, 0, 1023, 34610, 56812, 0, 1023, 11135, 52000, 0, 0, 32, 30286, 0, 0, 0, 170, 0, 1023, 11135, 52232, 0, 1023, 34811, 45312, 0, 1023, 11135, 52072, 0, 682, 1277, 37312, 0, 1023, 11135, 53252, 0, 682, 1087, 27786, 0, 1023, 11135, 52072, 0, 0, 0, 34, 0, 1023, 7168, 2256, 0, 0, 256, 64562, 0, 0, 0, 5, 0, 0, 0, 22, 0, 1023, 11135, 52600, 0, 1023, 11135, 55241, 0, 1023, 11135, 55256, 0 <repeats 11 times>, 1, 0, 682, 1730, 43849, 0, 1023, 11135, 52176, 57034, 19424, 30044, 27368, 0, 1023, 11135, 52896, 0, 1023, 11135, 52888, 0, 1023, 11135, 52896, 0, 0, 0, 22, 0, 1023, 11135, 52600, 0, 1023, 11135, 52264, 0, 682, 1277, 37312, 0, 0, 0, 14, 0, 682, 986, 63592, 0, 1023, 11135, 52264, 0, 682, 1277, 37312, 0...} offsets = 0x20764e offsets2_ = {300, 0, 0, 0, 0, 11263, 44297, 21140, 0, 0, 0, 34, 0, 0, 0, 16, 0, 1023, 32919, 49152, 0, 1023, 11135, 52552, 0, 682, 1277, 37312, 0, 682, 1718, 59160, 0, 682, 1110, 33984, 0, 1023, 11135, 52552, 0, 0, 0, 8192, 0, 682, 1284, 14896, 0, 0, 0, 8192, 0, 1023, 11135, 52848, 0, 1023, 0, 0, 0, 682, 1718, 59496, 0, 1023, 11135, 59456, 0, 0, 0, 0, 0, 1, 0, 57, 0, 1, 0, 57, 0, 1, 0, 57, 0, 682, 1718, 60024, 0, 1023, 11135, 52688, 0, 682, 1277, 37312, 0, 0, 0, 0, 0, 0, 0, 107, 0, 1023, 32913, 36352, 0, 0, 0, 8192, 0, 1023, 11135, 52912, 0, 0, 0, 0, 0, 1023, 11135, 52752, 0, 7, 0, 6, 0, 1023, 33532, 30208, 0, 0, 0, 0, 0, 682, 1630, 21248, 0, 1023, 34440, 4608, 0, 0, 0, 0, 0, 1023, 11135, 52928, 0, 1023, 11135, 52928, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1023, 11135, 52896, 0, 1023, 11135, 52896, 45057, 3072, 11823, 29797, 57034, 19424, 45691, 62183, 25600, 1023, 11135, 55816, 0, 682, 1284, 2324, 0 <repeats 13 times>, 1023, 32768...} offsets2 = 0x22 latch_by_caller = <optimized out> info = <optimized out> ahi_enabled = <optimized out> savepoint = 0 node_ptr_max_size = 4396027131248 compress_limit = 0 rw_latch = (RW_NO_LATCH | unknown: 0x2b7fcc00) err = DB_SUCCESS zip_size = <optimized out> page_id = {m_id = 51539607582} page_mode = 3992258936 height = 2929251422656 block_savepoint = 2 block = <optimized out> not_first_access = <optimized out> page_level = <optimized out> #6 0x000002aa04500e18 in btr_pcur_open (mtr=0x3ff2b7fd358, cursor=0x2aa06c2ab80, latch_mode=<optimized out>, mode=PAGE_CUR_LE, tuple=0x3ff1c028368) at /home/buildbot/s390x-rhel-9/build/storage/innobase/include/btr0pcur.h:431 No locals. #7 row_search_on_row_ref (pcur=pcur@entry=0x2aa06c2ab80, mode=<optimized out>, table=<optimized out>, ref=0x3ff1c028368, mtr=mtr@entry=0x3ff2b7fd358) at /home/buildbot/s390x-rhel-9/build/storage/innobase/row/row0row.cc:1211 index = 0x3ff541c7290 #8 0x000002aa044fb59a in row_purge_reposition_pcur (mode=mode@entry=BTR_MODIFY_LEAF, node=node@entry=0x2aa06c2aaf8, mtr=mtr@entry=0x3ff2b7fd358) at /home/buildbot/s390x-rhel-9/build/storage/innobase/row/row0purge.cc:83 No locals. #9 0x000002aa044fbd30 in row_purge_reset_trx_id (node=node@entry=0x2aa06c2aaf8, mtr=mtr@entry=0x3ff2b7fd358) at /home/buildbot/s390x-rhel-9/build/storage/innobase/row/row0purge.cc:604 No locals. #10 0x000002aa044fcc9c in row_purge_record_func (node=node@entry=0x2aa06c2aaf8, undo_rec=0x3ff2b7fd378 "", undo_rec@entry=0x3ff8097d078 "", thr=thr@entry=0x3ff2b7fda08, updated_extern=<optimized out>) at /home/buildbot/s390x-rhel-9/build/storage/innobase/row/row0purge.cc:1094 mtr = {m_last = 0x0, m_last_offset = 0, m_log_mode = 0, m_modifications = 0, m_made_dirty = 0, m_latch_ex = 0, m_trim_pages = 0, m_crc = 2067013353, m_memo = {<small_vector_base> = {BeginX = 0x3ff2b7fd378, Size = 2, Capacity = 16}, small = {{object = 0x3ff541c7390, type = MTR_MEMO_S_LOCK}, {object = 0x3ff8081a160, type = MTR_MEMO_PAGE_S_FIX}, {object = 0x3ff80a1a5c0, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x2aa04fd91c0, type = 0}, {object = 0x2aa0449d592 <mem_heap_create_block_func(mem_block_info_t*, unsigned long, unsigned long)+210>, type = 1023}, {object = 0x6, type = 1023}, {object = 0x15df, type = 682}, {object = 0x3ff2b7fda08, type = 0}, {object = 0x6, type = 1023}, {object = 0x1, type = 15758446}, {object = 0x3ff2b7fd5b7, type = 0}, {object = 0x3ff2b7fd448, type = 0}, {object = 0x3ff2b7fe840, type = 682}, {object = 0x3ff2b7fd9f8, type = 682}, {object = 0x3ff2b7fd400, type = 682}, {object = 0x10000003c, type = 682}}}, m_log = {m_heap = 0x0, m_list = {<ilist<mtr_buf_t::block_t, void>> = {sentinel_ = {next = 0x3ff2b7fd4a0, prev = 0x3ff2b7fd4a0}}, size_ = 1}, m_size = 0, m_first_block = {<ilist_node<void>> = {next = 0x3ff2b7fd480, prev = 0x3ff2b7fd480}, m_data = "F\f\035\212E\006\000\262\000\200\303\000\006\000\000<", '\000' <repeats 15 times>, "\002\000\000\003\377+\177\324\340\000\000\000\000\000\000\000\020\000\000\003\377\200a\210`\000\000\000\001\000\000\000\002\000\000\003\377+\177test\000\377+\177\324\320\000\000\000\000\000\000\000\000\000\000\002\252\006\245\354\340\000\000\000\000\000\000\000\001\336\312K\340u]\326\350\000\000\000\000\000\000\001\347\000\000\000\000\000#\342\333\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\000\336\312K\340\034\243Z\305\000\000\002\252\006e\377H\000\000\000\000\000\000\000\200\000\000\002\252\006\245\356\000\001\312K\340\034\243F\305\000\000\000\000\000\000\000\000\000\000\003\377+\177\325\220"..., m_used = 0}}, m_user_space = 0x0, m_commit_lsn = 0, m_freed_space = 0x0, m_freed_pages = 0x0} purged = true #11 0x000002aa044fd2da in row_purge (thr=<optimized out>, undo_rec=<optimized out>, node=0x2aa06c2aaf8) at /home/buildbot/s390x-rhel-9/build/storage/innobase/row/row0purge.cc:1138 purged = <optimized out> updated_extern = false #12 row_purge_step (thr=thr@entry=0x2aa06c2aa58) at /home/buildbot/s390x-rhel-9/build/storage/innobase/row/row0purge.cc:1201 purge_rec = {undo_rec = <optimized out>, roll_ptr = <optimized out>} node = 0x2aa06c2aaf8 #13 0x000002aa044c5292 in que_thr_step (thr=0x2aa06c2aa58) at /home/buildbot/s390x-rhel-9/build/storage/innobase/que/que0que.cc:588 node = 0x2aa06c2aaf8 old_thr = 0x2aa06c2aa58 trx = 0x3ff80d17200 type = <optimized out> node = <optimized out> old_thr = <optimized out> trx = <optimized out> type = <optimized out> #14 que_run_threads_low (thr=<optimized out>) at /home/buildbot/s390x-rhel-9/build/storage/innobase/que/que0que.cc:644 next_thr = <optimized out> trx = <optimized out> #15 que_run_threads (thr=0x2aa06c2aa58) at /home/buildbot/s390x-rhel-9/build/storage/innobase/que/que0que.cc:664 loop = <optimized out> trx = 0x3ff80d17200 #16 0x000002aa03df2dd0 in trx_purge (n_tasks=n_tasks@entry=4, history_size=<optimized out>) at /home/buildbot/s390x-rhel-9/build/storage/innobase/trx/trx0purge.cc:1484 no_throttle = <optimized out> thd = 0x2aa06fc52a8 head = {trx_no = <optimized out>, undo_no = <optimized out>} n_pages = <optimized out> thr = <optimized out> #17 0x000002aa0451c67c in purge_coordinator_state::do_purge (this=this@entry=0x2aa05980320 <purge_state>) at /home/buildbot/s390x-rhel-9/build/storage/innobase/srv/srv0srv.cc:1425 n_pages_handled = <optimized out> n_threads = <optimized out> #18 0x000002aa0451c29a in purge_coordinator_callback () at /home/buildbot/s390x-rhel-9/build/storage/innobase/srv/srv0srv.cc:1503 ctx = 0x3ff1c000b60 thd = 0x2aa06fc52a8 #19 0x000002aa045ff348 in tpool::task_group::execute (this=0x2aa05980628 <purge_coordinator_task_group>, t=0x2aa05980338 <purge_coordinator_task>) at /home/buildbot/s390x-rhel-9/build/tpool/task_group.cc:73 lk = {_M_device = <optimized out>, _M_owns = false} #20 0x000002aa045fd904 in tpool::thread_pool_generic::worker_main (this=0x2aa06a5ece0, thread_var=0x2aa06a5fe20) at /home/buildbot/s390x-rhel-9/build/tpool/tpool_generic.cc:583 task = 0x2aa05980338 <purge_coordinator_task> #21 0x000003ff876f298e in execute_native_thread_routine () from /lib64/libstdc++.so.6 No symbol table info available. #22 0x000003ff8731e2e6 in start_thread () from /lib64/libc.so.6 No symbol table info available. #23 0x000003ff872afdbe in thread_start () from /lib64/libc.so.6 No symbol table info available. … Thread 1 (Thread 0x3ff881ebca0 (LWP 31001)): #0 0x000003ff87320126 in __pthread_kill_implementation () from /lib64/libc.so.6 No symbol table info available. #1 0x000002aa0417fae6 in handle_fatal_signal (sig=<optimized out>) at /home/buildbot/s390x-rhel-9/build/sql/signal_handler.cc:357 curr_time = 1710378264 tm = {tm_sec = 24, tm_min = 4, tm_hour = 1, tm_mday = 14, tm_mon = 2, tm_year = 124, tm_wday = 4, tm_yday = 73, tm_isdst = 0, tm_gmtoff = 0, tm_zone = 0x2aa066b4ef0 "UTC"} thd = 0x0 print_invalid_query_pointer = 89 #2 <signal handler called> No symbol table info available. #3 0x000003ff8731ab26 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6 No symbol table info available. #4 0x000003ff8731d774 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libc.so.6 No symbol table info available. #5 0x000003ff876eb8f8 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6 No symbol table info available. #6 0x000002aa045ff682 in tpool::waitable_task::wait (this=this@entry=0x2aa05980338 <purge_coordinator_task>, lk=@0x3ffd1776330: {_M_device = 0x2aa05980358 <purge_coordinator_task+32>, _M_owns = true}) at /home/buildbot/s390x-rhel-9/build/tpool/task.cc:63 No locals. #7 0x000002aa045ff794 in tpool::waitable_task::disable (this=this@entry=0x2aa05980338 <purge_coordinator_task>) at /home/buildbot/s390x-rhel-9/build/tpool/task.cc:80 lk = {_M_device = 0x2aa05980358 <purge_coordinator_task+32>, _M_owns = true} #8 0x000002aa0451bfb2 in srv_shutdown_purge_tasks () at /home/buildbot/s390x-rhel-9/build/storage/innobase/srv/srv0srv.cc:1515 lk = {_M_device = 0xdeca4be0719c7ee8, _M_owns = false} lk = {_M_device = <optimized out>, _M_owns = <optimized out>} #9 srv_purge_shutdown () at /home/buildbot/s390x-rhel-9/build/storage/innobase/srv/srv0srv.cc:1580 history_size = 6 #10 0x000002aa0451d13c in innodb_preshutdown () at /home/buildbot/s390x-rhel-9/build/storage/innobase/srv/srv0start.cc:1993 first_time = false #11 innodb_preshutdown () at /home/buildbot/s390x-rhel-9/build/storage/innobase/srv/srv0start.cc:1979 first_time = false #12 0x000002aa0417fe42 in plugin_pre_shutdown (plugin=<optimized out>) at /home/buildbot/s390x-rhel-9/build/sql/handler.cc:1113 hton = <optimized out> #13 0x000002aa03f1f624 in plugin_foreach_with_mask (thd=thd@entry=0x0, func=func@entry=0x2aa0417fe20 <plugin_pre_shutdown(THD*, plugin_ref, void*)>, type=type@entry=1, state_mask=state_mask@entry=10, arg=arg@entry=0x0) at /home/buildbot/s390x-rhel-9/build/sql/sql_plugin.cc:2559 idx = 8 total = 1146 plugin = <optimized out> plugins = 0x3ffd17765e0 res = 0 '\000' #14 0x000002aa04183cac in ha_pre_shutdown () at /home/buildbot/s390x-rhel-9/build/sql/handler.cc:1120 No locals. #15 0x000002aa03e0add8 in mysqld_main (argc=<optimized out>, argv=<optimized out>) at /home/buildbot/s390x-rhel-9/build/sql/mysqld.cc:6001 please_close_stdin = <optimized out> ho_error = <optimized out> new_thread_stack_size = <optimized out> user = <optimized out> #16 0x000003ff872b08b2 in __libc_start_call_main () from /lib64/libc.so.6 No symbol table info available. #17 0x000003ff872b0990 in __libc_start_main_impl () from /lib64/libc.so.6 No symbol table info available. #18 0x000002aa03dfd1c0 in _start () at /home/buildbot/s390x-rhel-9/build/sql/my_decimal.h:147 No symbol table info available. I only included InnoDB threads above. We can see that both buf_LRU_get_free_block() and buf_flush_page_cleaner() are in an indefinite (untimed) wait. The caller of buf_LRU_get_free_block() is a purge worker task, which Thread 1 is waiting for. At one point during the development of MDEV-29445 I was able to reproduce this scenario deterministically on my local system.

            debarun made a valid point that we’d better use a timed wait in buf_LRU_get_free_block(), so that we can test and set buf_pool.LRU_warned and possibly display a warning message if the buf_flush_page_cleaner() thread appears to be hung. I do not currently think that the other review comments mandate any code changes.

            marko Marko Mäkelä added a comment - debarun made a valid point that we’d better use a timed wait in buf_LRU_get_free_block() , so that we can test and set buf_pool.LRU_warned and possibly display a warning message if the buf_flush_page_cleaner() thread appears to be hung. I do not currently think that the other review comments mandate any code changes.

            By design, InnoDB has always hung when permanently running out of buffer pool, for example when several threads are waiting to allocate a block, and all of the buffer pool is buffer-fixed by the active threads. This hang here occurs when the buffer pool is only temporarily running out and the situation can be rescued by writing out some dirty pages or evicting some clean pages.

            marko Marko Mäkelä added a comment - By design, InnoDB has always hung when permanently running out of buffer pool, for example when several threads are waiting to allocate a block, and all of the buffer pool is buffer-fixed by the active threads. This hang here occurs when the buffer pool is only temporarily running out and the situation can be rescued by writing out some dirty pages or evicting some clean pages.

            The parameter innodb_lru_flush_size became useless as part of this fix. debarun, maybe we should file a follow-up, make that parameter a data member of buf_pool and consider a patch like this:

            diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
            index 46129a037b0..4f1b40e309c 100644
            --- a/storage/innobase/buf/buf0flu.cc
            +++ b/storage/innobase/buf/buf0flu.cc
            @@ -2054,7 +2054,7 @@ static void buf_flush_sync_for_checkpoint(lsn_t lsn)
                   /* We intend to only evict pages keeping maximum flush bandwidth for
                   flush list pages advancing checkpoint. However, if the LRU tail is full
                   of dirty pages, we might need some flushing. */
            -      std::ignore= buf_flush_LRU(srv_io_capacity);
            +      std::ignore= buf_flush_LRU(innodb_lru_flush_size);
                 mysql_mutex_unlock(&buf_pool.mutex);
             
                 mysql_mutex_lock(&buf_pool.flush_list_mutex);
            

            marko Marko Mäkelä added a comment - The parameter innodb_lru_flush_size became useless as part of this fix. debarun , maybe we should file a follow-up, make that parameter a data member of buf_pool and consider a patch like this: diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index 46129a037b0..4f1b40e309c 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -2054,7 +2054,7 @@ static void buf_flush_sync_for_checkpoint(lsn_t lsn) /* We intend to only evict pages keeping maximum flush bandwidth for flush list pages advancing checkpoint. However, if the LRU tail is full of dirty pages, we might need some flushing. */ - std::ignore= buf_flush_LRU(srv_io_capacity); + std::ignore= buf_flush_LRU(innodb_lru_flush_size); mysql_mutex_unlock(&buf_pool.mutex); mysql_mutex_lock(&buf_pool.flush_list_mutex);

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.