[MDEV-36148] Rare 2 minute delay in mtr restarting server after crash - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.11.11
Fix Version/s: None
Component/s: Tests
Labels:
None

Description

Noticed this strance thing during intense test debugging. Not sure
what to do about it at this point, but wanted to at least make a
record of the observation for future reference.

I observe that very occationally, less than one in 100000, there is a
2 minute (120 seconds) delay between a server crashing (deliberately)
in a test and the server restarting:

2025-02-23 15:31:55 4 [ERROR] mariadbd: Error writing file 'master-bin' (errno: 28 "No space left on device")

dispatch_command: crash_dispatch_command_before: now

SIGKILL myself

$ env _RR_TRACE_DIR=/kvm/src/my/dev/mariadb2/mysql-test/var/75/log/mysqld.1.rr rr record /kvm/src/my/dev/mariadb2/sql/mariadbd --defaults-group-suffix=.1 --defaults-file=/kvm/src/my/dev/mariadb2/mysql-test/var/75/my.cnf --log-output=file --binlog-format=statement --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --log-bin=master-bin --loose-skip-stack-trace --server-id=1 --loose-debug-sync-timeout=300 --loose-debug-gdb --loose-skip-stack-trace --loose-skip-innodb-use-native-aio --loose-innodb-flush-method=fsync

2025-02-23 15:33:57 0 [Warning] Could not increase number of max_open_files to more than 1024 (request: 32183)

2025-02-23 15:33:57 0 [Warning] Changed limits: max_open_files: 1024  max_connections: 151 (was 151)  table_cache: 421 (was 2000)

2025-02-23 15:33:57 0 [Note] Starting MariaDB 10.11.12-MariaDB-debug-log source revision 7e001b2a3cc1efc5ac182a179f061faacdf09085 server_uid fl7I2SDCO8XlZM/jjMzemQIRIGo= as process 147583

This was running with --rr:

(cd mysql-test/ && script -c './mtr rpl.rpl_gtid_crash{,,}{,,}{,,} --parallel=81 --repeat=1000 --rr')

Note that the "No space left" error is from the test, not an actual
disk full error on the host.

In one 81 x 1000 --rr run, I got this 6 times in rpl.rpl_gtid_crash,
which crashes the master 8 times and the slave 6 times. All 6 were
delays in starting the master, not sure if that is significant. It was
not the same master crash in each instance, but the 120 seconds delay
was very consistent.

When debugging with rr replay, it looks like this is not a delay
inside the server code, because rr `elapsed-time` command reports that
the server exists after a few seconds, both the one running until the
crash/delay, and the one after.

Don't really have any clear ideas at this point what the cause is,
maybe some bug in mtr detection of server crash and/or restart?

Attachments

Activity

Marko Mäkelä added a comment - 2025-02-24 08:09

Side note: I would avoid using repeat with rr, unless there really is a need to preserve all rr replay traces for the non-failing runs. I’d rather write a loop like the following in order to rerun the test until it fails:

while ./mtr rpl.rpl_gtid_crash{,,}{,,}{,,} --parallel=81 --rr

do :

done

Marko Mäkelä added a comment - 2025-02-24 08:09 Side note: I would avoid using repeat with rr , unless there really is a need to preserve all rr replay traces for the non-failing runs. I’d rather write a loop like the following in order to rerun the test until it fails: while . /mtr rpl.rpl_gtid_crash{,,}{,,}{,,} --parallel=81 --rr do : done

People

Assignee:: Unassigned

Reporter:: Kristian Nielsen

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2025-02-24 07:58

Updated:: 2025-03-03 09:07

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server