Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34265

Possible hang during IO burst with innodb_flush_sync enabled

Details

    Description

      The issue was found while analyzing the intermittent CI failures.The tests would fail with timeout.
      CI failure: innodb.insert_into_empty
      CI failure: encryption.innochecksum
      [ fail ] timeout after 900 seconds

      This is the relevant stack traces from 2 threads showing the issue. User session thread and purge thread gets blocked. They wait free page cleaner to free pages in BP.

      Page cleaner thread has entered "furious flush" stage and is busy flushing pages to pull forward the checkpoint LSN. It fails to progress as some dirty pages are already latched by mtr from session thread(s). This creates a deadlock.

      Thread 9 (Thread 0xffff85552fc0 (LWP 29458)):
      #0  0x0000ffff9d2b28cc in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
      No symbol table info available.
      #1  0x0000ffff9d2b5c60 in pthread_cond_timedwait@@GLIBC_2.17 () from /lib64/libc.so.6
      No symbol table info available.
      #2  0x0000aaaab58536c8 in buf_LRU_get_free_block (have_mutex=have_mutex@entry=false) at /home/buildbot/aarch64-fedora-40/build/storage/innobase/buf/buf0lru.cc:452
      #3  0x0000aaaab5853bf8 in buf_page_init_for_read (mode=mode@entry=132, page_id=page_id@entry={m_id = 800}, zip_size=zip_size@entry=0, unzip=<optimized out>) at /home/buildbot/aarch64-fedora-40/build/storage/innobase/buf/buf0rea.cc:121
      #4  0x0000aaaab58543bc in buf_read_page_low (space=0xaaaadee50008, sync=sync@entry=true, mode=mode@entry=132, page_id={m_id = 800}, zip_size=zip_size@entry=0, unzip=unzip@entry=false) at /home/buildbot/aarch64-fedora-40/build/storage/innobase/buf/buf0rea.cc:295
      #5  0x0000aaaab5855124 in buf_read_page (page_id=page_id@entry={m_id = 800}, zip_size=zip_size@entry=0) at /home/buildbot/aarch64-fedora-40/build/storage/innobase/buf/buf0rea.cc:464
      #6  0x0000aaaab5843320 in buf_page_get_low (page_id={m_id = 800}, zip_size=zip_size@entry=0, rw_latch=rw_latch@entry=1, guess=0x0, mode=16, mtr=0xffff855520d8, err=0x0, allow_ibuf_merge=false) at /home/buildbot/aarch64-fedora-40/build/storage/innobase/buf/buf0buf.cc:2831
      #7  0x0000aaaab5843f2c in buf_page_get_gen (page_id=<optimized out>, zip_size=zip_size@entry=0, rw_latch=rw_latch@entry=1, guess=guess@entry=0x0, mode=mode@entry=16, mtr=mtr@entry=0xffff855520d8, err=err@entry=0x0, allow_ibuf_merge=allow_ibuf_merge@entry=false) at /home/buildbot/aarch64-fedora-40/build/storage/innobase/buf/buf0buf.cc:3246
      #8  0x0000aaaab57fd7c8 in purge_sys_t::get_page (this=this@entry=0xaaaab6bfdf80 <purge_sys>, id=id@entry={m_id = 800}) at /home/buildbot/aarch64-fedora-40/build/storage/innobase/trx/trx0purge.cc:787
      #9  0x0000aaaab57fdba0 in purge_sys_t::choose_next_log (this=this@entry=0xaaaab6bfdf80 <purge_sys>) at /home/buildbot/aarch64-fedora-40/build/storage/innobase/trx/trx0purge.cc:909
      #10 0x0000aaaab57fe6ec in purge_sys_t::fetch_next_rec (this=0xaaaab6bfdf80 <purge_sys>) at /home/buildbot/aarch64-fedora-40/build/storage/innobase/trx/trx0purge.cc:1022
      #11 trx_purge_attach_undo_recs (n_purge_threads=4, thd=0xaaaadf1c51d8) at /home/buildbot/aarch64-fedora-40/build/storage/innobase/trx/trx0purge.cc:1281
      #12 trx_purge (n_tasks=n_tasks@entry=4, history_size=<optimized out>) at /home/buildbot/aarch64-fedora-40/build/storage/innobase/trx/trx0purge.cc:1406
       
      Thread 2 (Thread 0xffff95516fc0 (LWP 29374)):
      #3  0x0000aaaab5697a50 in pfs_end_mutex_wait_v1 (locker=<optimized out>, rc=0) at /home/buildbot/aarch64-fedora-40/build/storage/perfschema/pfs.cc:4006
      #4  0x0000aaaab5194cb8 in psi_mutex_lock (that=<optimized out>, file=<optimized out>, line=<optimized out>) at /home/buildbot/aarch64-fedora-40/build/mysys/my_thr_init.c:494
      #5  0x0000aaaab58476c4 in inline_mysql_mutex_lock (that=0xaaaab6c06760 <buf_dblwr+16>, src_file=0xaaaab5a9cba0 "/home/buildbot/aarch64-fedora-40/build/storage/innobase/buf/buf0dblwr.cc", src_line=727) at /home/buildbot/aarch64-fedora-40/build/include/mysql/psi/mysql_thread.h:746
      #6  buf_dblwr_t::flush_buffered_writes (this=0xaaaab6c06750 <buf_dblwr>) at /home/buildbot/aarch64-fedora-40/build/storage/innobase/buf/buf0dblwr.cc:727
      #7  0x0000aaaab5190d24 in buf_flush_list (max_n=<optimized out>, lsn=lsn@entry=9419484) at /home/buildbot/aarch64-fedora-40/build/storage/innobase/buf/buf0flu.cc:1576
      #8  0x0000aaaab584f574 in buf_flush_sync_for_checkpoint (lsn=9419484) at /home/buildbot/aarch64-fedora-40/build/storage/innobase/buf/buf0flu.cc:2187
      #9  buf_flush_page_cleaner () at /home/buildbot/aarch64-fedora-40/build/storage/innobase/buf/buf0flu.cc:2458
      

      Attachments

        Issue Links

          Activity

            debarun Debarun Banerjee added a comment - - edited

            The issue could be repeated with following debug patch and test attached on 10.11.

            commit 3ee6f69d49a58422f994f51a0bd7a0cb1242a3dd (HEAD -> 10.11, origin/10.11)
            Date: Thu May 2 22:14:19 2024 +0200

            1. Source need to be rebuild after applying the patch for debug checks/sleeps
            2. The test need to be run with page size 4k.

                ./mtr --mem --mysqld=--innodb_page_size=4k  innodb.insert_into_empty
            

            • The test repeats the hang in almost alternate runs.
            • Don't use --repeat as it uses the same server and LSN adjustments are different. If it doesn't repeat in first run just retry.
            debarun Debarun Banerjee added a comment - - edited The issue could be repeated with following debug patch and test attached on 10.11. commit 3ee6f69d49a58422f994f51a0bd7a0cb1242a3dd (HEAD -> 10.11, origin/10.11) Date: Thu May 2 22:14:19 2024 +0200 1. Source need to be rebuild after applying the patch for debug checks/sleeps 2. The test need to be run with page size 4k. ./mtr --mem --mysqld=--innodb_page_size=4k innodb.insert_into_empty The test repeats the hang in almost alternate runs. Don't use --repeat as it uses the same server and LSN adjustments are different. If it doesn't repeat in first run just retry.

            When checkpoint age goes beyond the sync flush threshold and buf_flush_sync_lsn is set, page cleaner enters into "furious flush" stage to aggressively flush dirty pages from flush list and pull checkpoint LSN above safe margin. In this stage, page cleaner skips doing LRU flush and eviction.

            In 10.6, all other threads entirely rely on page cleaner to generate free pages. If free pages get over while page cleaner is busy in "furious flush" stage, a session thread could wait for free page in the middle of a min-transaction(mtr) while holding latches on other pages.

            It, in turn, can prevent page cleaner to flush such pages preventing checkpoint LSN to move forward creating a deadlock situation. Even otherwise, it could create a stall and hang like situation for large BP with plenty of dirty pages to flush before the stage could finish.

            Fix: During furious flush, check and evict LRU pages after each flush iteration.

            I could repeat the issue in 10.6 also. The repeatability is about 1/2 in 10 runs - a little less frequent compared to 10.11. With the fix the test runs fine in 100 consecutive runs.
            https://github.com/MariaDB/server/pull/3298

            debarun Debarun Banerjee added a comment - When checkpoint age goes beyond the sync flush threshold and buf_flush_sync_lsn is set, page cleaner enters into "furious flush" stage to aggressively flush dirty pages from flush list and pull checkpoint LSN above safe margin. In this stage, page cleaner skips doing LRU flush and eviction. In 10.6, all other threads entirely rely on page cleaner to generate free pages. If free pages get over while page cleaner is busy in "furious flush" stage, a session thread could wait for free page in the middle of a min-transaction(mtr) while holding latches on other pages. It, in turn, can prevent page cleaner to flush such pages preventing checkpoint LSN to move forward creating a deadlock situation. Even otherwise, it could create a stall and hang like situation for large BP with plenty of dirty pages to flush before the stage could finish. Fix: During furious flush, check and evict LRU pages after each flush iteration. I could repeat the issue in 10.6 also. The repeatability is about 1/2 in 10 runs - a little less frequent compared to 10.11. With the fix the test runs fine in 100 consecutive runs. https://github.com/MariaDB/server/pull/3298

            Thank you. I think that the general idea is correct. I posted some suggestions.

            marko Marko Mäkelä added a comment - Thank you. I think that the general idea is correct. I posted some suggestions.

            While analyzing this problem, I have found out an independent issue which makes the "furious flush" stage unnecessarily long. This is more of a resource (I/O) usage issue which could in turn affect performance. Currently the target is set to current system LSN which is far too aggressive compared to the 5.7 design.

            I will file a separate issue on it which should ideally be fixed 10.5 onwards.

            debarun Debarun Banerjee added a comment - While analyzing this problem, I have found out an independent issue which makes the "furious flush" stage unnecessarily long. This is more of a resource (I/O) usage issue which could in turn affect performance. Currently the target is set to current system LSN which is far too aggressive compared to the 5.7 design. I will file a separate issue on it which should ideally be fixed 10.5 onwards.

            Can this bug cause a performance degradation for the user?
            Can this be seen as a performance degradation from an earlier 10.6 version? If yes, in which version was the problem introduced?

            monty Michael Widenius added a comment - Can this bug cause a performance degradation for the user? Can this be seen as a performance degradation from an earlier 10.6 version? If yes, in which version was the problem introduced?

            monty The independent issue of long "furious flush" stage could impact performance for IO bound workload in theory when system IO is under stress. I would say it is is not a general case. It is from earlier versions and present in 10.5. I will investigate and try to find out at what point it was introduced and add all the details to the new bug I need to file. Thanks for reminding.

            debarun Debarun Banerjee added a comment - monty The independent issue of long "furious flush" stage could impact performance for IO bound workload in theory when system IO is under stress. I would say it is is not a general case. It is from earlier versions and present in 10.5. I will investigate and try to find out at what point it was introduced and add all the details to the new bug I need to file. Thanks for reminding.

            A workaround for this bug is to do:

            SET GLOBAL innodb_flush_sync=OFF;

            At the same time, you should ensure that innodb_max_dirty_pages_pct_lwm is not 0, in order to ensure that the InnoDB checkpoint age (amount of log written since the latest checkpoint) will remain low. If we get close to the limit, then all SQL write operations will be blocked until the checkpoint has been advanced and it is crash-safe to write more log. With innodb_flush_sync=OFF the page writing activity during this time will be throttled by innodb_io_capacity, which means that the server could be extremely slow during such time.

            monty Michael Widenius added a comment - A workaround for this bug is to do: SET GLOBAL innodb_flush_sync=OFF; At the same time, you should ensure that innodb_max_dirty_pages_pct_lwm is not 0, in order to ensure that the InnoDB checkpoint age (amount of log written since the latest checkpoint) will remain low. If we get close to the limit, then all SQL write operations will be blocked until the checkpoint has been advanced and it is crash-safe to write more log. With innodb_flush_sync=OFF the page writing activity during this time will be throttled by innodb_io_capacity, which means that the server could be extremely slow during such time.

            People

              debarun Debarun Banerjee
              debarun Debarun Banerjee
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.