Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24302

RESET MASTER hangs as Innodb does not report on binlog checkpoint

Details

    Description

      The failed one in http://buildbot.askmonty.org/buildbot/builders/winx64-debug/builds/22164
      commit 657fcdf430f39a3103dff51a6a2b2bd3 with the following stack trace

      server!inline_mysql_cond_wait(struct st_mysql_cond * that = 0x00007ffb`ec653b68, struct st_mysql_mutex * mutex = 0x00007ffb`ec653b38, char * src_file = 0x00007ffb`ea98b140 "D:\winx64-debug\build\src\sql\log.cc", unsigned int src_line = 0x10f8) [D:\winx64-debug\build\src\include\mysql\psi\mysql_thread.h @ 1222]
      server!MYSQL_BIN_LOG::reset_logs(class THD * thd = 0x00000165`d43d15c8, bool create_new_log = true, struct rpl_gtid * init_state = 0x00000000`00000000, unsigned int init_state_len = 0, unsigned long next_log_number = 0) [D:\winx64-debug\build\src\sql\log.cc @ 4345]
      server!reset_master(class THD * thd = 0x00000165`d43d15c8, struct rpl_gtid * init_state = 0x00000000`00000000, unsigned int init_state_len = 0, unsigned long next_log_number = 0) [D:\winx64-debug\build\src\sql\sql_repl.cc @ 3966]
      server!reload_acl_and_cache(class THD * thd = 0x00000165`d43d15c8, unsigned int64 options = 0x80, struct TABLE_LIST * tables = 0x00000000`00000000, int * write_to_binlog = 0x0000002a`c42fdc44) [D:\winx64-debug\build\src\sql\sql_reload.cc @ 362]
      server!mysql_execute_command(class THD * thd = 0x00000165`d43d15c8) [D:\winx64-debug\build\src\sql\sql_parse.cc @ 5480]
      server!mysql_parse(class THD * thd = 0x00000165`d43d15c8, char * rawbuf = 0x00000165`d44da280 "--- memory read error at address 0x00000165`d44da280 ---", unsigned int length = 0xc, class Parser_state * parser_state = 
      

      As it was never observed before the extra to 10.5 commits
      need examination to clear out
      possibility of missed out commit_checkpoint_notify_ha() invocation from innobase_mysql_log_notify() which is one of possibilities.
      657fcdf430f and 7b1252c03d7 are rated as potentially relevant to the failure showdown.

      A similar failure

      main.mysqldump-max 'innodb'              w2 [ fail ]  timeout after 900 seconds
      

      exists in http://buildbot.askmonty.org/buildbot/builders/win32-debug/builds/18570/steps/test/logs/stdio

      Upon more analysis the most probable suspect

      --- a/storage/innobase/handler/ha_innodb.cc
      +++ b/storage/innobase/handler/ha_innodb.cc
      @@ -4444,12 +4444,6 @@ innobase_mysql_log_notify(
       	struct pending_checkpoint *	entry;
       	struct pending_checkpoint *	last_ready;
       
      -	/* It is safe to do a quick check for NULL first without lock.
      -	Even if we should race, we will at most skip one checkpoint and
      -	take the next one, which is harmless. */
      -	if (!pending_checkpoint_list)
      -		return;
      -
      

      is identified. , to eliminate (the pasted block is from being tested fixes). Removal won't hurt performance in normal cases when binlog rotation is not frequent event (say not few in a second).

      Attachments

        Issue Links

          Activity

            The probability of hang could be improved by running with innodb_force_recovery=2 (to prevent the purge of InnoDB transaction history from running). Any ‘extra’ writes to the redo log (outside transaction commit) could ‘rescue’ the hang.

            marko Marko Mäkelä added a comment - The probability of hang could be improved by running with innodb_force_recovery=2 (to prevent the purge of InnoDB transaction history from running). Any ‘extra’ writes to the redo log (outside transaction commit) could ‘rescue’ the hang.

            I pushed some code cleanup to the branch bb-10.5-MDEV-24302 as well as Elkin’s test case for demonstrating the innodb_force_recovery=2 hang if a workaround is not present. I determined with rr replay that the missing log write of the last transaction commit is related to trx_t::flush_log_later and trx_t::active_commit_ordered, which to my understanding only exist to satisfy some requirement of the binlog.

            I am starting to think that it is not safe to read log_sys.lsn in innobase_hton->commit_checkpoint_request and expect that something is guaranteed to write everything up to that LSN to the log. If there is no pressure to perform a redo log checkpoint, that write might never occur. Only ‘important’ state transitions (such as transaction commit) are normally persisted to the redo log, and also that can be prevented by overriding the default setting of innodb_flush_log_at_trx_commit=1 or by the said trx_t flags. If some ‘unimportant’ write to the log happened after a transaction commit, we may wait forever for that ‘unimportant’ log_sys.lsn to be persisted.

            Also, in the cmake -DWITH_PMEM=ON case when the redo log is on persistent memory (MDEV-9905), we are missing a call to innobase_mysql_log_notify() from log_write().

            marko Marko Mäkelä added a comment - I pushed some code cleanup to the branch bb-10.5-MDEV-24302 as well as Elkin ’s test case for demonstrating the innodb_force_recovery=2 hang if a workaround is not present. I determined with rr replay that the missing log write of the last transaction commit is related to trx_t::flush_log_later and trx_t::active_commit_ordered , which to my understanding only exist to satisfy some requirement of the binlog. I am starting to think that it is not safe to read log_sys.lsn in innobase_hton->commit_checkpoint_request and expect that something is guaranteed to write everything up to that LSN to the log. If there is no pressure to perform a redo log checkpoint, that write might never occur. Only ‘important’ state transitions (such as transaction commit) are normally persisted to the redo log, and also that can be prevented by overriding the default setting of innodb_flush_log_at_trx_commit=1 or by the said trx_t flags. If some ‘unimportant’ write to the log happened after a transaction commit, we may wait forever for that ‘unimportant’ log_sys.lsn to be persisted. Also, in the cmake -DWITH_PMEM=ON case when the redo log is on persistent memory ( MDEV-9905 ), we are missing a call to innobase_mysql_log_notify() from log_write() .

            I figured out that with innodb_force_recovery=2, the srv_master_callback() or its predecessor srv_master_thread will never be registered. For that special case, we can initiate a log write directly without any buffering, just like our implementation of the interface for MyRocks does.

            I pushed a revised version to the branch. It includes some cleanup for readability and performance. The commit message identifies a change that potentially prevents the hang. An equivalent change was also present in Elkin’s version. wlad, please review it.

            I think that we must keep this ticket open for some time after pushing the tentative fix, to see if it really fixes the hangs, because the hangs are occurring rather rarely.

            marko Marko Mäkelä added a comment - I figured out that with innodb_force_recovery=2 , the srv_master_callback() or its predecessor srv_master_thread will never be registered. For that special case, we can initiate a log write directly without any buffering, just like our implementation of the interface for MyRocks does. I pushed a revised version to the branch. It includes some cleanup for readability and performance. The commit message identifies a change that potentially prevents the hang. An equivalent change was also present in Elkin ’s version. wlad , please review it. I think that we must keep this ticket open for some time after pushing the tentative fix, to see if it really fixes the hangs, because the hangs are occurring rather rarely.

            Elkin pointed out another race condition that I addressed in a follow-up fix. That fix will also change log_sys.lsn and log_sys.flushed_to_disk_lsn to use std::memory_order_acquire loads and std::memory_order_release stores. It should make no difference on instruction set architectures that use Total Store Ordering (IA-32, AMD64, SPARC), but there is some difference on ARM, POWER and RISC-V RVWMO, for example. Before 10.5, those variables were protected by log_sys.mutex. This follow-up fix may correct some race conditions related to page flushing that could have caused recovery failures on such ‘exotic’ platforms.

            marko Marko Mäkelä added a comment - Elkin pointed out another race condition that I addressed in a follow-up fix . That fix will also change log_sys.lsn and log_sys.flushed_to_disk_lsn to use std::memory_order_acquire loads and std::memory_order_release stores. It should make no difference on instruction set architectures that use Total Store Ordering (IA-32, AMD64, SPARC), but there is some difference on ARM, POWER and RISC-V RVWMO, for example. Before 10.5, those variables were protected by log_sys.mutex . This follow-up fix may correct some race conditions related to page flushing that could have caused recovery failures on such ‘exotic’ platforms.

            Unfortunately, it looks like the InnoDB side fixes were insufficient. I filed MDEV-25611 for another hang in RESET MASTER that was observed in 10.6 in the test versioning.rpl. I have reasons to believe that the failure affects 10.5 as well.

            marko Marko Mäkelä added a comment - Unfortunately, it looks like the InnoDB side fixes were insufficient. I filed MDEV-25611 for another hang in RESET MASTER  that was observed in 10.6 in the test versioning.rpl . I have reasons to believe that the failure affects 10.5 as well.

            People

              marko Marko Mäkelä
              Elkin Andrei Elkin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.