Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24302

RESET MASTER hangs as Innodb does not report on binlog checkpoint

Details

    Description

      The failed one in http://buildbot.askmonty.org/buildbot/builders/winx64-debug/builds/22164
      commit 657fcdf430f39a3103dff51a6a2b2bd3 with the following stack trace

      server!inline_mysql_cond_wait(struct st_mysql_cond * that = 0x00007ffb`ec653b68, struct st_mysql_mutex * mutex = 0x00007ffb`ec653b38, char * src_file = 0x00007ffb`ea98b140 "D:\winx64-debug\build\src\sql\log.cc", unsigned int src_line = 0x10f8) [D:\winx64-debug\build\src\include\mysql\psi\mysql_thread.h @ 1222]
      server!MYSQL_BIN_LOG::reset_logs(class THD * thd = 0x00000165`d43d15c8, bool create_new_log = true, struct rpl_gtid * init_state = 0x00000000`00000000, unsigned int init_state_len = 0, unsigned long next_log_number = 0) [D:\winx64-debug\build\src\sql\log.cc @ 4345]
      server!reset_master(class THD * thd = 0x00000165`d43d15c8, struct rpl_gtid * init_state = 0x00000000`00000000, unsigned int init_state_len = 0, unsigned long next_log_number = 0) [D:\winx64-debug\build\src\sql\sql_repl.cc @ 3966]
      server!reload_acl_and_cache(class THD * thd = 0x00000165`d43d15c8, unsigned int64 options = 0x80, struct TABLE_LIST * tables = 0x00000000`00000000, int * write_to_binlog = 0x0000002a`c42fdc44) [D:\winx64-debug\build\src\sql\sql_reload.cc @ 362]
      server!mysql_execute_command(class THD * thd = 0x00000165`d43d15c8) [D:\winx64-debug\build\src\sql\sql_parse.cc @ 5480]
      server!mysql_parse(class THD * thd = 0x00000165`d43d15c8, char * rawbuf = 0x00000165`d44da280 "--- memory read error at address 0x00000165`d44da280 ---", unsigned int length = 0xc, class Parser_state * parser_state = 
      

      As it was never observed before the extra to 10.5 commits
      need examination to clear out
      possibility of missed out commit_checkpoint_notify_ha() invocation from innobase_mysql_log_notify() which is one of possibilities.
      657fcdf430f and 7b1252c03d7 are rated as potentially relevant to the failure showdown.

      A similar failure

      main.mysqldump-max 'innodb'              w2 [ fail ]  timeout after 900 seconds
      

      exists in http://buildbot.askmonty.org/buildbot/builders/win32-debug/builds/18570/steps/test/logs/stdio

      Upon more analysis the most probable suspect

      --- a/storage/innobase/handler/ha_innodb.cc
      +++ b/storage/innobase/handler/ha_innodb.cc
      @@ -4444,12 +4444,6 @@ innobase_mysql_log_notify(
       	struct pending_checkpoint *	entry;
       	struct pending_checkpoint *	last_ready;
       
      -	/* It is safe to do a quick check for NULL first without lock.
      -	Even if we should race, we will at most skip one checkpoint and
      -	take the next one, which is harmless. */
      -	if (!pending_checkpoint_list)
      -		return;
      -
      

      is identified. , to eliminate (the pasted block is from being tested fixes). Removal won't hurt performance in normal cases when binlog rotation is not frequent event (say not few in a second).

      Attachments

        Issue Links

          Activity

            Also seen in http://buildbot.askmonty.org/buildbot/builders/winx64-debug/builds/22195/steps/test/logs/stdio ,
            innodb.group_commit_crash 'innodb' timeout with the same stack trace.

            wlad Vladislav Vaintroub added a comment - Also seen in http://buildbot.askmonty.org/buildbot/builders/winx64-debug/builds/22195/steps/test/logs/stdio , innodb.group_commit_crash 'innodb' timeout with the same stack trace.
            marko Marko Mäkelä added a comment - http://buildbot.askmonty.org/buildbot/builders/win32-debug/builds/18570/steps/test/logs/stdio shows a hang of main.mysqldump-max .

            I think that the InnoDB changes are related, not causing this bug (causation is not correlation).

            I suspect that the logic is broken from the start of MDEV-532 (MariaDB 10.0.11). Basically, the supposedly ‘safe’ optimization in innobase_mysql_log_notify() is losing an InnoDB log write event. If there are no further writes to the redo log (because not even any InnoDB background tasks are generating any redo log), then the threads may hang.

            I suspect that because InnoDB is causing much fewer unnecessary wake-ups in 10.5 since recently, other threads have more chances to run into race conditions. The exact mechanism of this race is not known yet, and it has not been verified whether removing the suspicious code would fix this race. After all, the tests are not failing deterministically.

            marko Marko Mäkelä added a comment - I think that the InnoDB changes are related, not causing this bug (causation is not correlation). I suspect that the logic is broken from the start of MDEV-532 (MariaDB 10.0.11). Basically, the supposedly ‘safe’ optimization in innobase_mysql_log_notify() is losing an InnoDB log write event. If there are no further writes to the redo log (because not even any InnoDB background tasks are generating any redo log), then the threads may hang. I suspect that because InnoDB is causing much fewer unnecessary wake-ups in 10.5 since recently, other threads have more chances to run into race conditions. The exact mechanism of this race is not known yet, and it has not been verified whether removing the suspicious code would fix this race. After all, the tests are not failing deterministically.
            Elkin Andrei Elkin added a comment -

            marko: Your guess turns out right. A race has been identified. The original idea worked by virtue of former over-abundant attempts to flush by Innodb. A patch is at testing.

            Elkin Andrei Elkin added a comment - marko : Your guess turns out right. A race has been identified. The original idea worked by virtue of former over-abundant attempts to flush by Innodb. A patch is at testing.
            Elkin Andrei Elkin added a comment -

            HNY, dear comrade!

            As you were involved, and the matter of the patch is actually not foreign to you,
            I'd greatly appreciate your check of
            `819b9fc86a1`, or bb-10.2-andrei that holds the latest patch to review.

            Andrei

            Elkin Andrei Elkin added a comment - HNY, dear comrade! As you were involved, and the matter of the patch is actually not foreign to you, I'd greatly appreciate your check of `819b9fc86a1`, or bb-10.2-andrei that holds the latest patch to review. Andrei

            Elkin, I think that we will need separate 10.2 and 10.5 versions of this fix. Before 10.5, log_sys.lsn and log_sys.flushed_to_disk_lsn were protected by log_sys.mutex. Starting with 10.5, they use std::atomic. In the 10.2 version, I think that you should protect pending_checkpoint_list with log_sys.mutex because it has to be acquired anyway. For 10.5, an alternative mechanism may be chosen.

            What is your testing plan for this? I think that this will need some stress testing. It should be noted that 10.3 and 10.6 are likely changing the timings compared to 10.2 and 10.5. So, I would like to see stress tests on 10.3 and 10.6 as well.

            marko Marko Mäkelä added a comment - Elkin , I think that we will need separate 10.2 and 10.5 versions of this fix. Before 10.5, log_sys.lsn and log_sys.flushed_to_disk_lsn were protected by  log_sys.mutex . Starting with 10.5, they use std::atomic . In the 10.2 version, I think that you should protect pending_checkpoint_list with log_sys.mutex because it has to be acquired anyway. For 10.5, an alternative mechanism may be chosen. What is your testing plan for this? I think that this will need some stress testing. It should be noted that 10.3 and 10.6 are likely changing the timings compared to 10.2 and 10.5. So, I would like to see stress tests on 10.3 and 10.6 as well.
            Elkin Andrei Elkin added a comment -

            marko: Thanks for a quick look!

            Your point of 10.2-10.5 was already found as I had started with the 10.5 to get a patch which was "downgraded" to 10.2 style of atomic interfaces.

            To the mutex protection suggestion I explained on Github why 'no need' (specifically to this backup of the proposal
            > because it has to be acquired anyway
            not really).

            To the testing plan, indeed a stress testing is a must, I did light ones, and alice has been entasked (through MDEV-24526) already to think of RQG-based (a sceleton is provided in my mtr-based test attached to the latter ticket).
            Sure I've run some manual tests too and more aggressive ones when MDEV-24526 was unfixed (sigh - we've lost that
            "fortune").

            Elkin Andrei Elkin added a comment - marko : Thanks for a quick look! Your point of 10.2-10.5 was already found as I had started with the 10.5 to get a patch which was "downgraded" to 10.2 style of atomic interfaces. To the mutex protection suggestion I explained on Github why 'no need' (specifically to this backup of the proposal > because it has to be acquired anyway not really). To the testing plan, indeed a stress testing is a must, I did light ones, and alice has been entasked (through MDEV-24526 ) already to think of RQG-based (a sceleton is provided in my mtr-based test attached to the latter ticket). Sure I've run some manual tests too and more aggressive ones when MDEV-24526 was unfixed (sigh - we've lost that "fortune").

            My understanding is that the code that was added in MDEV-232 with good intentions does not work in an ideal way. The hangs caused by an incorrect implementation were apparently often ‘rescued’ by something invoking a log write inside InnoDB. One such source of log writes is page flushing, which was improved in MDEV-23855.

            Why is innobase_checkpoint_request() consulting the latest log_sys_t::lsn at all? The latest log records might have been written by some internal activity, such as change buffer merge or the purge of transaction history, which would have nothing to do with the durability of a transaction COMMIT or XA PREPARE in the SQL layer. What we really need to keep track of is the mtr_t::commit_lsn() of the trx_write_serialisation_history() that is supposed to be durable. Possibly also the LSN of trx_prepare() is interesting; trx_flush_log_if_needed_low() is able to skip the log write in that code path as well.

            I think that we must properly propagate the LSN from the PREPARE or COMMIT to this code. The current implementation may involve unnecessary waits, in addition to suffering from the race condition in innobase_mysql_log_notify(), as highlighted in the Description.

            marko Marko Mäkelä added a comment - My understanding is that the code that was added in MDEV-232 with good intentions does not work in an ideal way. The hangs caused by an incorrect implementation were apparently often ‘rescued’ by something invoking a log write inside InnoDB. One such source of log writes is page flushing, which was improved in MDEV-23855 . Why is innobase_checkpoint_request() consulting the latest log_sys_t::lsn at all? The latest log records might have been written by some internal activity, such as change buffer merge or the purge of transaction history, which would have nothing to do with the durability of a transaction COMMIT or XA PREPARE in the SQL layer. What we really need to keep track of is the mtr_t::commit_lsn() of the trx_write_serialisation_history() that is supposed to be durable. Possibly also the LSN of trx_prepare() is interesting; trx_flush_log_if_needed_low() is able to skip the log write in that code path as well. I think that we must properly propagate the LSN from the PREPARE or COMMIT to this code. The current implementation may involve unnecessary waits, in addition to suffering from the race condition in innobase_mysql_log_notify() , as highlighted in the Description.

            I observed another failure in a 10.6-based branch, which I thought might be related to this one:

            main.mysqlbinlog-innodb 'innodb'         w3 [ fail ]  timeout after 900 seconds
                    Test ended at 2021-02-10 09:38:02
             
            Test case timeout after 900 seconds
             
            == D:/winx64-packages/build/mysql-test/var/3/log/mysqlbinlog-innodb.log == 
            SET TIMESTAMP=1000000000;
            CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=innodb;
            CREATE DATABASE test2;
            RESET MASTER;
             
             == D:/winx64-packages/build/mysql-test/var/3/tmp/analyze-timeout-mysqld.1.err ==Output from cdb follows. Faulting thread is printed twice,with and without function parameters
            Search for STACK_TEXT to see the stack trace of 
            the faulting thread. Callstacks of other threads are printed after it.
            

            As far as I can tell, no thread is executing inside InnoDB, except buf_flush_page_cleaner(), which looks idle.

            marko Marko Mäkelä added a comment - I observed another failure in a 10.6-based branch , which I thought might be related to this one: main.mysqlbinlog-innodb 'innodb' w3 [ fail ] timeout after 900 seconds Test ended at 2021-02-10 09:38:02   Test case timeout after 900 seconds   == D:/winx64-packages/build/mysql-test/var/3/log/mysqlbinlog-innodb.log == SET TIMESTAMP=1000000000; CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=innodb; CREATE DATABASE test2; RESET MASTER;   == D:/winx64-packages/build/mysql-test/var/3/tmp/analyze-timeout-mysqld.1.err ==Output from cdb follows. Faulting thread is printed twice,with and without function parameters Search for STACK_TEXT to see the stack trace of the faulting thread. Callstacks of other threads are printed after it. As far as I can tell, no thread is executing inside InnoDB, except buf_flush_page_cleaner() , which looks idle.

            I suggest to patch 10.5 directly, since it was found broken here. While older versions might have the bug as well, but it is compensated for. Also programming with atomics was rather different in the past.

            wlad Vladislav Vaintroub added a comment - I suggest to patch 10.5 directly, since it was found broken here. While older versions might have the bug as well, but it is compensated for. Also programming with atomics was rather different in the past.

            One more test joins the club:

            10.5 f6cb9e6e2dd0fb2e29a09030e74de7946139e1f4

            main.sp_trans_log 'innodb'               w2 [ fail ]  timeout after 900 seconds
                    Test ended at 2021-03-05 12:11:45
             
            Test case timeout after 900 seconds
             
            == D:/winx64-debug/build/mysql-test/var/2/log/sp_trans_log.log == 
            drop function if exists bug23333|
            drop table if exists t1,t2|
            CREATE TABLE t1 (a int  NOT NULL auto_increment primary key) ENGINE=MyISAM|
            CREATE TABLE t2 (a int  NOT NULL auto_increment, b int, PRIMARY KEY (a)) ENGINE=InnoDB|
            insert into t2 values (1,1)|
            create function bug23333() 
            RETURNS int(11)
            DETERMINISTIC
            begin
            insert into t1 values (null);
            select count(*) from t1 into @a;
            return @a;
            end|
            Warnings:
            Warning	1287	'<select expression> INTO <destination>;' is deprecated and will be removed in a future release. Please use 'SELECT <select list> INTO <destination> FROM...' instead
            reset master;
            

            marko Marko Mäkelä added a comment - One more test joins the club : 10.5 f6cb9e6e2dd0fb2e29a09030e74de7946139e1f4 main.sp_trans_log 'innodb' w2 [ fail ] timeout after 900 seconds Test ended at 2021-03-05 12:11:45   Test case timeout after 900 seconds   == D:/winx64-debug/build/mysql-test/var/2/log/sp_trans_log.log == drop function if exists bug23333| drop table if exists t1,t2| CREATE TABLE t1 (a int NOT NULL auto_increment primary key) ENGINE=MyISAM| CREATE TABLE t2 (a int NOT NULL auto_increment, b int, PRIMARY KEY (a)) ENGINE=InnoDB| insert into t2 values (1,1)| create function bug23333() RETURNS int(11) DETERMINISTIC begin insert into t1 values (null); select count(*) from t1 into @a; return @a; end| Warnings: Warning 1287 '<select expression> INTO <destination>;' is deprecated and will be removed in a future release. Please use 'SELECT <select list> INTO <destination> FROM...' instead reset master;

            One more test joins the club:

            10.6 a0558b8c96339e7d3d65d53249f8a36bf3c3008d

            versioning.rpl 'innodb,stmt,trx_id'      w1 [ fail ]  timeout after 900 seconds
                    Test ended at 2021-03-15 11:41:14
             
            Test case timeout after 900 seconds
             
            == D:/winx64-debug/build/mysql-test/var/1/log/rpl.log == 
            include/master-slave.inc
             
             == D:/winx64-debug/build/mysql-test/var/1/tmp/analyze-timeout-mysqld.2.err ==
            SHOW PROCESSLIST;
            Id	User	Host	db	Command	Time	State	Info	Progress
            14	root	localhost:31928	test	Query	960	starting	RESET MASTER	0.000
            16	root	localhost:32865	NULL	Query	0	starting	SHOW PROCESSLIST	0.000
             
             == D:/winx64-debug/build/mysql-test/var/1/tmp/analyze-timeout-mysqld.1.err ==
            SHOW PROCESSLIST;
            Id	User	Host	db	Command	Time	State	Info	Progress
            15	root	localhost:32864	NULL	Query	0	starting	SHOW PROCESSLIST	0.000
            

            marko Marko Mäkelä added a comment - One more test joins the club : 10.6 a0558b8c96339e7d3d65d53249f8a36bf3c3008d versioning.rpl 'innodb,stmt,trx_id' w1 [ fail ] timeout after 900 seconds Test ended at 2021-03-15 11:41:14   Test case timeout after 900 seconds   == D:/winx64-debug/build/mysql-test/var/1/log/rpl.log == include/master-slave.inc   == D:/winx64-debug/build/mysql-test/var/1/tmp/analyze-timeout-mysqld.2.err == SHOW PROCESSLIST; Id User Host db Command Time State Info Progress 14 root localhost:31928 test Query 960 starting RESET MASTER 0.000 16 root localhost:32865 NULL Query 0 starting SHOW PROCESSLIST 0.000   == D:/winx64-debug/build/mysql-test/var/1/tmp/analyze-timeout-mysqld.1.err == SHOW PROCESSLIST; Id User Host db Command Time State Info Progress 15 root localhost:32864 NULL Query 0 starting SHOW PROCESSLIST 0.000

            The probability of hang could be improved by running with innodb_force_recovery=2 (to prevent the purge of InnoDB transaction history from running). Any ‘extra’ writes to the redo log (outside transaction commit) could ‘rescue’ the hang.

            marko Marko Mäkelä added a comment - The probability of hang could be improved by running with innodb_force_recovery=2 (to prevent the purge of InnoDB transaction history from running). Any ‘extra’ writes to the redo log (outside transaction commit) could ‘rescue’ the hang.

            I pushed some code cleanup to the branch bb-10.5-MDEV-24302 as well as Elkin’s test case for demonstrating the innodb_force_recovery=2 hang if a workaround is not present. I determined with rr replay that the missing log write of the last transaction commit is related to trx_t::flush_log_later and trx_t::active_commit_ordered, which to my understanding only exist to satisfy some requirement of the binlog.

            I am starting to think that it is not safe to read log_sys.lsn in innobase_hton->commit_checkpoint_request and expect that something is guaranteed to write everything up to that LSN to the log. If there is no pressure to perform a redo log checkpoint, that write might never occur. Only ‘important’ state transitions (such as transaction commit) are normally persisted to the redo log, and also that can be prevented by overriding the default setting of innodb_flush_log_at_trx_commit=1 or by the said trx_t flags. If some ‘unimportant’ write to the log happened after a transaction commit, we may wait forever for that ‘unimportant’ log_sys.lsn to be persisted.

            Also, in the cmake -DWITH_PMEM=ON case when the redo log is on persistent memory (MDEV-9905), we are missing a call to innobase_mysql_log_notify() from log_write().

            marko Marko Mäkelä added a comment - I pushed some code cleanup to the branch bb-10.5-MDEV-24302 as well as Elkin ’s test case for demonstrating the innodb_force_recovery=2 hang if a workaround is not present. I determined with rr replay that the missing log write of the last transaction commit is related to trx_t::flush_log_later and trx_t::active_commit_ordered , which to my understanding only exist to satisfy some requirement of the binlog. I am starting to think that it is not safe to read log_sys.lsn in innobase_hton->commit_checkpoint_request and expect that something is guaranteed to write everything up to that LSN to the log. If there is no pressure to perform a redo log checkpoint, that write might never occur. Only ‘important’ state transitions (such as transaction commit) are normally persisted to the redo log, and also that can be prevented by overriding the default setting of innodb_flush_log_at_trx_commit=1 or by the said trx_t flags. If some ‘unimportant’ write to the log happened after a transaction commit, we may wait forever for that ‘unimportant’ log_sys.lsn to be persisted. Also, in the cmake -DWITH_PMEM=ON case when the redo log is on persistent memory ( MDEV-9905 ), we are missing a call to innobase_mysql_log_notify() from log_write() .

            I figured out that with innodb_force_recovery=2, the srv_master_callback() or its predecessor srv_master_thread will never be registered. For that special case, we can initiate a log write directly without any buffering, just like our implementation of the interface for MyRocks does.

            I pushed a revised version to the branch. It includes some cleanup for readability and performance. The commit message identifies a change that potentially prevents the hang. An equivalent change was also present in Elkin’s version. wlad, please review it.

            I think that we must keep this ticket open for some time after pushing the tentative fix, to see if it really fixes the hangs, because the hangs are occurring rather rarely.

            marko Marko Mäkelä added a comment - I figured out that with innodb_force_recovery=2 , the srv_master_callback() or its predecessor srv_master_thread will never be registered. For that special case, we can initiate a log write directly without any buffering, just like our implementation of the interface for MyRocks does. I pushed a revised version to the branch. It includes some cleanup for readability and performance. The commit message identifies a change that potentially prevents the hang. An equivalent change was also present in Elkin ’s version. wlad , please review it. I think that we must keep this ticket open for some time after pushing the tentative fix, to see if it really fixes the hangs, because the hangs are occurring rather rarely.

            Elkin pointed out another race condition that I addressed in a follow-up fix. That fix will also change log_sys.lsn and log_sys.flushed_to_disk_lsn to use std::memory_order_acquire loads and std::memory_order_release stores. It should make no difference on instruction set architectures that use Total Store Ordering (IA-32, AMD64, SPARC), but there is some difference on ARM, POWER and RISC-V RVWMO, for example. Before 10.5, those variables were protected by log_sys.mutex. This follow-up fix may correct some race conditions related to page flushing that could have caused recovery failures on such ‘exotic’ platforms.

            marko Marko Mäkelä added a comment - Elkin pointed out another race condition that I addressed in a follow-up fix . That fix will also change log_sys.lsn and log_sys.flushed_to_disk_lsn to use std::memory_order_acquire loads and std::memory_order_release stores. It should make no difference on instruction set architectures that use Total Store Ordering (IA-32, AMD64, SPARC), but there is some difference on ARM, POWER and RISC-V RVWMO, for example. Before 10.5, those variables were protected by log_sys.mutex . This follow-up fix may correct some race conditions related to page flushing that could have caused recovery failures on such ‘exotic’ platforms.

            Unfortunately, it looks like the InnoDB side fixes were insufficient. I filed MDEV-25611 for another hang in RESET MASTER that was observed in 10.6 in the test versioning.rpl. I have reasons to believe that the failure affects 10.5 as well.

            marko Marko Mäkelä added a comment - Unfortunately, it looks like the InnoDB side fixes were insufficient. I filed MDEV-25611 for another hang in RESET MASTER  that was observed in 10.6 in the test versioning.rpl . I have reasons to believe that the failure affects 10.5 as well.

            People

              marko Marko Mäkelä
              Elkin Andrei Elkin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.