Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31343

Another server hang with innodb_undo_log_truncate=ON

Details

    Description

      axel reproduced one more hang related to innodb_undo_log_truncate=ON, similar to MDEV-30180. Here is a description of a hang that was reproduced with innodb_use_native_aio=0:

      1. trx_purge_truncate_history() writes the message InnoDB: Truncating and is about to truncate an undo log tablespace.
      2. trx_purge_truncate_history() is busy-looping in a scan of buf_pool.flush_list because one of the pages belonging to the undo tablespace is write-fixed.
      3. During the time trx_purge_truncate_history() releases and re-acquires buf_pool.flush_list_mutex, other threads that are waiting for it cannot grab it, in this version of GNU libc. This is similar to MDEV-30180, which could only be reproduced in the same particular environment.
      4. buf_dblwr_t::flush_buffered_writes_completed() was waiting for log_sys.mutex in log_write_up_to(), while trying to write the block that trx_purge_truncate_history() is trying to lock.
      5. log_sys.mutex was be held by buf_flush_page_cleaner(), which is waiting for buf_pool.flush_list_mutex.

      A possible fix would be that trx_purge_truncate_history() buffer-fixes the block, releases buf_pool.flush_list_mutex, waits for an exclusive latch on the block and finally reacquire buf_pool.flush_list_mutex. In that way, the blocking of other threads is minimized. The buffer-fix will prevent the eviction or relocation of the block in the buffer pool while no mutex is held by trx_purge_truncate_history().

      Attachments

        Issue Links

          Activity

            axel Axel Schwenke added a comment -

            Commit f410444a76b from the bb-10.6-MDEV-31343 branch survived a 1 hour run of sysbench-tpcc. So the fix is most probably complete (without it the server hung within few minutes).

            It still has severe impact of performance. Enabling innodb_undo_log_truncate=ON has lead to 50% performance loss (~3000 tps vs. ~6000 tps) in my benchmark:

            axel Axel Schwenke added a comment - Commit f410444a76b from the bb-10.6- MDEV-31343 branch survived a 1 hour run of sysbench-tpcc. So the fix is most probably complete (without it the server hung within few minutes). It still has severe impact of performance. Enabling innodb_undo_log_truncate=ON has lead to 50% performance loss (~3000 tps vs. ~6000 tps) in my benchmark:

            Also in MDEV-29401 and MDEV-30628 we concluded that something needs to be done about the history list length. One possibility would be that any thread that is acquiring an exclusive latch on an index page for other reasons will attempt to remove any purgeable history, so that the actual purge threads will have less work to do.

            marko Marko Mäkelä added a comment - Also in MDEV-29401 and MDEV-30628 we concluded that something needs to be done about the history list length. One possibility would be that any thread that is acquiring an exclusive latch on an index page for other reasons will attempt to remove any purgeable history, so that the actual purge threads will have less work to do.

            The history list length issue was greatly improved by MDEV-32050.

            We experienced a similar problem again in this same environment, in MDEV-33009.

            marko Marko Mäkelä added a comment - The history list length issue was greatly improved by MDEV-32050 . We experienced a similar problem again in this same environment, in MDEV-33009 .

            People

              mleich Matthias Leich
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.