Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.11.11
-
None
-
None
Description
Noticed this strance thing during intense test debugging. Not sure
what to do about it at this point, but wanted to at least make a
record of the observation for future reference.
I observe that very occationally, less than one in 100000, there is a
2 minute (120 seconds) delay between a server crashing (deliberately)
in a test and the server restarting:
2025-02-23 15:31:55 4 [ERROR] mariadbd: Error writing file 'master-bin' (errno: 28 "No space left on device")
|
dispatch_command: crash_dispatch_command_before: now
|
SIGKILL myself
|
$ env _RR_TRACE_DIR=/kvm/src/my/dev/mariadb2/mysql-test/var/75/log/mysqld.1.rr rr record /kvm/src/my/dev/mariadb2/sql/mariadbd --defaults-group-suffix=.1 --defaults-file=/kvm/src/my/dev/mariadb2/mysql-test/var/75/my.cnf --log-output=file --binlog-format=statement --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --log-bin=master-bin --loose-skip-stack-trace --server-id=1 --loose-debug-sync-timeout=300 --loose-debug-gdb --loose-skip-stack-trace --loose-skip-innodb-use-native-aio --loose-innodb-flush-method=fsync
|
2025-02-23 15:33:57 0 [Warning] Could not increase number of max_open_files to more than 1024 (request: 32183)
|
2025-02-23 15:33:57 0 [Warning] Changed limits: max_open_files: 1024 max_connections: 151 (was 151) table_cache: 421 (was 2000)
|
2025-02-23 15:33:57 0 [Note] Starting MariaDB 10.11.12-MariaDB-debug-log source revision 7e001b2a3cc1efc5ac182a179f061faacdf09085 server_uid fl7I2SDCO8XlZM/jjMzemQIRIGo= as process 147583
|
This was running with --rr:
(cd mysql-test/ && script -c './mtr rpl.rpl_gtid_crash{,,}{,,}{,,} --parallel=81 --repeat=1000 --rr')
|
Note that the "No space left" error is from the test, not an actual
disk full error on the host.
In one 81 x 1000 --rr run, I got this 6 times in rpl.rpl_gtid_crash,
which crashes the master 8 times and the slave 6 times. All 6 were
delays in starting the master, not sure if that is significant. It was
not the same master crash in each instance, but the 120 seconds delay
was very consistent.
When debugging with rr replay, it looks like this is not a delay
inside the server code, because rr `elapsed-time` command reports that
the server exists after a few seconds, both the one running until the
crash/delay, and the one after.
Don't really have any clear ideas at this point what the cause is,
maybe some bug in mtr detection of server crash and/or restart?
Side note: I would avoid using repeat with rr, unless there really is a need to preserve all rr replay traces for the non-failing runs. I’d rather write a loop like the following in order to rerun the test until it fails: