|
I would suggest that this code be refactored. The handlerton::commit_checkpoint_request and trx_commit_complete_for_mysql() should be removed altogether, and InnoDB should notify the binlog via some simpler interface once the transaction state change has become persistent.
Such simplification would make it easy to streamline the transaction commits further in MDEV-14462.
|
|
For what it is worth, after MDEV-25948 the calls to log_write_up_to() should be more frequent again, and the probability of these hangs might become lower. (Every single update of log_sys.flushed_to_disk_lsn has the potential to ‘rescue’ a hung RESET MASTER.)
|
|
I have seen mysqladmin shutdown hang, quite easily reproducible/repeatable, whilst RESET MASTER was running. Only a single input SQL thread is necessary. This was using this testcase from MDEV-24660, using RESET MASTER instead of SHUTDOWN and with RESET MASTER repeated many more times in the input SQL, which overall replayed in random SQL order that modified testcase over and over. Elkin has analayzed those rr traces. Reproducibility is 3 to 4 out of every 10 pquery 15 sec trials with this setup. Only mysqld option required is --log-bin). It is better to use the file (linked first in this comment, and here) but for reference the SQL is listed in this comment.
|
|
angelique.sklavounos, what do you about enhancing versioning.rpl to expose the description reported timeout
more reliably?
Currently there's only one BB trace available, with very little details (perhaps it was again RESET MASTER run by rpl_end.inc).
I suggest to create a new test that would run a ver..rpl's body in a (say 10 time, but fit the exec time under the big-test limit) loop each time ended with an explicit RESET MASTER.
We should have that 10 times speed-up in chasing for the cause of the hang.
|
|
Hi Elkin, the test body looped 10 times was committed here: https://github.com/MariaDB/server/commit/1e82f34a33387432f1e91c18cfb1dba6ee646009
The test ran against the winx64-debug platform here and passed: https://buildbot.askmonty.org/buildbot/builders/winx64-debug/builds/34673
In cross-ref I see no failures for this test. However, there have been no failures for versioning.rpl with "timeout after 900 seconds" since 2021-06-08.
I pushed the looped test against a recent version of 10.6, but I can revert the code back to 2ceadb39 if that would help.
|
|
To better faciliate testing, I reverted MDEV-25948 and added the test. I had no luck in reproducing the hang, but I did not try really hard.
|
|
Branch bb-10.6-MDEV-25611 has the looped x10 versioning.rpl_25611 test, added on top of 762bcb81b5b~1, i.e. the first commit from MDEV-25948.
Dummy commits were made to the branch, and the winx64-debug builder passed each time:
979406 https://buildbot.askmonty.org/buildbot/builders/winx64-debug/builds/34935
379d28 https://buildbot.askmonty.org/buildbot/builders/winx64-debug/builds/34986
87dd2d https://buildbot.askmonty.org/buildbot/builders/winx64-debug/builds/35005
bdfb4c https://buildbot.askmonty.org/buildbot/builders/winx64-debug/builds/35044
0a76df https://buildbot.askmonty.org/buildbot/builders/winx64-debug/builds/35047
To better faciliate testing, I reverted MDEV-25948 and added the test. I had no luck in reproducing the hang, but I did not try really hard.
With this branch (bb-10.6-MDEV-25948-MDEV-25611), the windx64-debug builder passed each time:
fdb873 https://buildbot.askmonty.org/buildbot/builders/winx64-debug/builds/35018
30fcb6 https://buildbot.askmonty.org/buildbot/builders/winx64-debug/builds/35046
Locally I ran ./mtr --repeat=100 --parallel=10 versioning.rpl_25611{,,,} but could not reproduce the hang.
|
|
Based off of this comment, I am unconfirming this issue, but will continue to try to reproduce the hang.
|
|
It seems to me that the debug injection --debug-dbug=d,ib_log_checkpoint_avoid, which should not affect correctness in any way, reproduces some interesting failures.
On an updated test branch, each test in the the following invocation fails for me:
./mtr --parallel=auto --mysqld=--debug-dbug=d,ib_log_checkpoint_avoid --max-test-fail=0 --force \
|
binlog{,_encryption}.binlog_incident rpl.rpl_skip_incident binlog.binlog_killed_simulate rpl.rpl_loaddata_fatal
|
The test rpl.rpl_loaddata_fatal will be terminated by some timeout; all other tests fail due to result differences that suggest that fewer binlog records were written.
rpl.rpl_loaddata_fatal 'stmt' w5 [ fail ]
|
Test ended at 2022-11-15 09:04:07
|
|
CURRENT_TEST: rpl.rpl_loaddata_fatal
|
mysqltest: In included file "./include/wait_for_slave_param.inc":
|
included from ./include/wait_for_slave_sql_error.inc at line 41:
|
included from ./include/wait_for_slave_sql_error_and_skip.inc at line 57:
|
included from /mariadb/10.6m/mysql-test/suite/rpl/t/rpl_loaddata_fatal.test at line 21:
|
At line 119: Timeout in include/wait_for_slave_param.inc
|
That test has also failed recently on buildbot, possibly also due to a timeout that caused the test harness to terminate the server with SIGABRT:
|
bb-10.5-andrei-MDEV-29322 e3954fdadfb52a1b1113ef891281cb7dfc42660a
|
rpl.rpl_loaddata_fatal 'stmt' w19 [ fail ]
|
Test ended at 2022-08-31 19:03:35
|
|
CURRENT_TEST: rpl.rpl_loaddata_fatal
|
mysqltest: At line 25: failed in 'select master_pos_wait('master-bin.000001', 1221, 300, '')': 2013: Lost connection to MySQL server during query
|
|
|
Have built 3dd4bad4b3c7681 on a Windows 10 machine and tried running versioning.rpl_25611 with --debug-dbug=d,ib_log_checkpoint_avoid, but as of now the tests have passed.
|
|
Tried again, I believe I was able to recreate the hang with RESET MASTER, using --debug-dbug=d,ib_log_checkpoint_avoid. The var dir has been placed here:
https://drive.google.com/file/d/16I1nttAiQDzN1oPbBVRp4R_2C3tfdDkH/view?usp=sharing
|
10.6 3dd4bad4b3c7681
|
versioning.rpl_25611 'innodb,mix,timestamp' w2 [ 45 fail ] timeout after 900 seconds
|
Test ended at 2022-11-29 09:15:12
|
|
Test case timeout after 900 seconds
|
|
== C:/Users/sepol/mariadb/server/bld/mysql-test/var/2/log/rpl_25611.log ==
|
Table Create Table
|
t1 CREATE TABLE `t1` (
|
`a` int(11) DEFAULT NULL,
|
`b` int(11) DEFAULT NULL
|
) ENGINE=ENGINE DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci WITH SYSTEM VERSIONING
|
connection master;
|
drop table t1, t2;
|
create table t1 (i int) with system versioning partition by system_time limit 8 ( partition p1 history, partition p2 history, partition pn current );
|
insert into t1 values (1);
|
update t1 set i = 1;
|
update t1 set i = 0;
|
connection slave;
|
connection master;
|
drop table t1;
|
connection slave;
|
connection slave;
|
include/stop_slave.inc
|
reset slave;
|
include/start_slave.inc
|
reset master;
|
|
== C:/Users/sepol/mariadb/server/bld/mysql-test/var/2/tmp/analyze-timeout-mysqld.1.err ==
|
SHOW PROCESSLIST;
|
Id User Host db Command Time State Info Progress
|
855 root localhost:54625 NULL Binlog Dump 921 Master has sent all binlog to slave; waiting for more updates NULL 0.000
|
856 root localhost:58844 NULL Query 0 starting SHOW PROCESSLIST 0.000
|
|
== C:/Users/sepol/mariadb/server/bld/mysql-test/var/2/tmp/analyze-timeout-mysqld.2.err ==
|
SHOW PROCESSLIST;
|
Id User Host db Command Time State Info Progress
|
2536 root localhost:54442 test Query 921 starting reset master 0.000
|
2573 system user NULL Slave_IO 921 Waiting for master to send event NULL 0.000
|
2574 system user test Slave_SQL 921 closing tables DROP TABLE `t1` /* generated by server */ 0.000
|
2576 root localhost:58845 NULL Query 0 starting SHOW PROCESSLIST 0.000
|
|
mysqltest failed but provided no output
|
The result from queries just before the failure was:
|
< snip >
|
Table Create Table
|
t1 CREATE TABLE `t1` (
|
`a` int(11) DEFAULT NULL,
|
`b` int(11) DEFAULT NULL
|
) ENGINE=ENGINE DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci WITH SYSTEM VERSIONING
|
connection master;
|
drop table t1, t2;
|
create table t1 (i int) with system versioning partition by system_time limit 8 ( partition p1 history, partition p2 history, partition pn current );
|
insert into t1 values (1);
|
update t1 set i = 1;
|
update t1 set i = 0;
|
connection slave;
|
connection master;
|
drop table t1;
|
connection slave;
|
connection slave;
|
include/stop_slave.inc
|
reset slave;
|
include/start_slave.inc
|
reset master;
|
|
|
- found 'mar3C75.tmp.dmp', moving it to 'C:/Users/sepol/mariadb/server/bld/mysql-test/var/2/log/versioning.rpl_25611-innodb,mix,timestamp'
|
- found 'mariadb-test.dmp', moving it to 'C:/Users/sepol/mariadb/server/bld/mysql-test/var/2/log/versioning.rpl_25611-innodb,mix,timestamp'
|
- found 'mariadbd.dmp', moving it to 'C:/Users/sepol/mariadb/server/bld/mysql-test/var/2/log/versioning.rpl_25611-innodb,mix,timestamp'
|
- found 'mar3C75.tmp.dmp' (0/5)
|
|
Trying 'cdb' to get a backtrace
|
Cannot find cdb. Please Install Debugging tools for Windows
|
from http://www.microsoft.com/whdc/devtools/debugging/install64bit.mspx (native x64 version)
|
- found 'mariadb-test.dmp' (1/5)
|
|
Trying 'cdb' to get a backtrace
|
Cannot find cdb. Please Install Debugging tools for Windows
|
from http://www.microsoft.com/whdc/devtools/debugging/install64bit.mspx (native x64 version)
|
- found 'mariadbd.dmp' (2/5)
|
|
Trying 'cdb' to get a backtrace
|
Cannot find cdb. Please Install Debugging tools for Windows
|
from http://www.microsoft.com/whdc/devtools/debugging/install64bit.mspx (native x64 version)
|
- saving 'C:/Users/sepol/mariadb/server/bld/mysql-test/var/2/log/versioning.rpl_25611-innodb,mix,timestamp/' to 'C:/Users/sepol/mariadb/server/bld/mysql-test/var/log/versioning.rpl_25611-innodb,mix,timestamp/'
|
|
Only 1 of 15 completed.
|
--------------------------------------------------------------------------
|
The servers were restarted 0 times
|
Spent 25301.472 of 3516 seconds executing testcases
|
|
Failure: Failed 1/45 tests, 97.78% were successful.
|
|
Failing test(s): versioning.rpl_25611
|
|
The log files in var/log may give you some hint of what went wrong.
|
|
If you want to report this error, please read first the documentation
|
at http://dev.mysql.com/doc/mysql/en/mysql-test-suite.html
|
|
mysql-test-run: *** ERROR: there were failing test cases
|
|
|
The RESET MASTER hang was reproduced without --debug-dbug=d,ib_log_checkpoint_avoid, on a machine shared with Elkin.
|
On an updated test branch, each test in the the following invocation fails for me:
./mtr --parallel=auto --mysqld=--debug-dbug=d,ib_log_checkpoint_avoid --max-test-fail=0 --force \
|
binlog{,_encryption}.binlog_incident rpl.rpl_skip_incident binlog.binlog_killed_simulate rpl.rpl_loaddata_fatal
|
I got an idea that perhaps the following patch would fix this, but it was not the case.
diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
|
index dba0352b5ce..e5527d04601 100644
|
--- a/storage/innobase/handler/ha_innodb.cc
|
+++ b/storage/innobase/handler/ha_innodb.cc
|
@@ -4776,7 +4776,7 @@ void log_flush_notify(lsn_t flush_lsn)
|
{
|
mysql_mutex_lock(&log_requests.mutex);
|
pending= log_requests.start.load(std::memory_order_relaxed);
|
- log_flush_notify_and_unlock(pending, flush_lsn);
|
+ log_flush_notify_and_unlock(pending, log_sys.get_flushed_lsn());
|
}
|
}
|
|
@@ -4849,7 +4849,7 @@ static void innodb_log_flush_request(void *cookie)
|
persistent, we will notify old requests of completion. Note:
|
log_flush_notify() may skip some notifications because it is
|
basically assuming that the list is in ascending order of LSN. */
|
- log_flush_notify(flush_lsn);
|
+ log_flush_notify(log_sys.get_flushed_lsn());
|
commit_checkpoint_notify_ha(cookie);
|
}
|
|
If I omit the option --debug-dbug=d,ib_log_checkpoint_avoid, then all tests will pass.
|
|
Seems to be a debug_dbug syntax problem.
Using --debug-dbug=d,... resets any other debug_dbug settings, causing some failures.
Using this command I get no failures:
./mtr --parallel=auto --mysqld=--debug-dbug=+d,ib_log_checkpoint_avoid --max-test-fail=0 --force binlog{,_encryption}.binlog_incident rpl.rpl_skip_incident binlog.binlog_killed_simulate rpl.rpl_loaddata_fatal
|
Maybe it should be re-tried to run some of the other rqg tests to see if they can reproduce the hang?
|
|
I created a test case that reliably reproduces a hang in RESET MASTER due to waiting for binlog checkpoint response that does not arrive.
https://github.com/MariaDB/server/commit/008df8e4d689eaa205203923432d7696eae7e0f0
The test case disables in InnoDB the code that syncs the InnoDB redo log once per second, otherwise the hang does not occur.
But this is still an interesting data point. It points to a design problem in RESET MASTER (that it can hang if the innodb redo log does not get synced to the end for some reason).
And in case a binlog checkpoint notification gets lost for some reason (unknown at present), this could similarly cause a RESET MASTER hang.
|
|
elenst provided a list of hangs on Microsoft Windows similar to the one in the Description. The last such failure occurred on 2021-10-13 on a mariadb-10.6.4 based branch. As far as I can tell, the extra changes on top of the mariadb-10.6.4 release tag should not contribute to this. The failure was like this:
innodb.group_commit_crash 'innodb' w2 [ fail ] timeout after 900 seconds
|
…
|
SHOW BINLOG EVENTS LIMIT 4,1;
|
Log_name Pos Event_type Server_id End_log_pos Info
|
master-bin.000001 # Query 1 # use `test`; insert into t1 select * from t2
|
delete from t1;
|
SET binlog_format= mixed;
|
RESET MASTER;
|
I do not see any hung InnoDB threads. The thread executing RESET MASTER is waiting here:
server!pthread_cond_timedwait(struct _RTL_CONDITION_VARIABLE * cond = 0x00007fff`0adcbb28, struct _RTL_CRITICAL_SECTION * mutex = 0x00007fff`0adcbaf8, struct timespec * abstime = 0x00000000`00000000) [C:UsersjenkinsworkspaceESDebug-WIN-MTRwin_buildmysysmy_wincond.c @ 82]
|
server!pthread_cond_wait(struct _RTL_CONDITION_VARIABLE * cond = 0x00007fff`0adcbb28, struct _RTL_CRITICAL_SECTION * mutex = 0x00007fff`0adcbaf8) [C:UsersjenkinsworkspaceESDebug-WIN-MTRwin_buildmysysmy_wincond.c @ 91]
|
server!psi_cond_wait(struct st_mysql_cond * that = 0x00007fff`0adcbb28, struct st_mysql_mutex * mutex = 0x00007fff`0adcbaf8, char * file = 0x00007fff`08c6c2c0 "C:UsersjenkinsworkspaceESDebug-WIN-MTRwin_buildsqllog.cc", unsigned int line = 0x1105) [C:UsersjenkinsworkspaceESDebug-WIN-MTRwin_buildmysysmy_thr_init.c @ 596]
|
server!inline_mysql_cond_wait(struct st_mysql_cond * that = 0x00007fff`0adcbb28, struct st_mysql_mutex * mutex = 0x00007fff`0adcbaf8, char * src_file = 0x00007fff`08c6c2c0 "C:UsersjenkinsworkspaceESDebug-WIN-MTRwin_buildsqllog.cc", unsigned int src_line = 0x1105) [C:UsersjenkinsworkspaceESDebug-WIN-MTRwin_buildincludemysqlpsimysql_thread.h @ 1070]
|
server!MYSQL_BIN_LOG::reset_logs(class THD * thd = 0x000002f3`5efac5b8, bool create_new_log = true, struct rpl_gtid * init_state = 0x00000000`00000000, unsigned int init_state_len = 0, unsigned long next_log_number = 0) [C:UsersjenkinsworkspaceESDebug-WIN-MTRwin_buildsqllog.cc @ 4358]
|
It should be noted that MDEV-25948 was implemented in MariaDB Server 10.6.3 already, so it did not possibly "fix" this, since this failure occurred on a branch that is ahead of mariadb-10.6.4 by a number of commits.
The last failure on a less modified MariaDB Server branch (1 additional commit on a main branch) was right before MDEV-25948 had been fixed in the main branch.
It is unclear to me what appears to have fixed the problem, and I am not convinced that it has been really fixed. The probability of hang could have been lowered possibly due to some change in the replication side, or some change to InnoDB page flushing (which drives log checkpoints and can trigger log writes) that would make it harder to hit the hang.
knielsen, thank you for looking at this and creating a more usable test case.
|
|
By analysis done so far the hang is most probably was (maybe still is) caused by Innodb not flushing redo log to disk even if RESET MASTER "thinks" it should do. And it does not flush 'cos of not having any real data modification to log.
That suggests a simple workaround for the user - to unhang the RM just execute any dummy modification involving Innodb, and
and a straightforward fix - to do the above in the server when a timeout elapses.
|
|
Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.
|
|
Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.
|