Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-36148

Rare 2 minute delay in mtr restarting server after crash

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.11.11
    • None
    • Tests
    • None

    Description

      Noticed this strance thing during intense test debugging. Not sure
      what to do about it at this point, but wanted to at least make a
      record of the observation for future reference.

      I observe that very occationally, less than one in 100000, there is a
      2 minute (120 seconds) delay between a server crashing (deliberately)
      in a test and the server restarting:

      2025-02-23 15:31:55 4 [ERROR] mariadbd: Error writing file 'master-bin' (errno: 28 "No space left on device")
      dispatch_command: crash_dispatch_command_before: now
      SIGKILL myself
      $ env _RR_TRACE_DIR=/kvm/src/my/dev/mariadb2/mysql-test/var/75/log/mysqld.1.rr rr record /kvm/src/my/dev/mariadb2/sql/mariadbd --defaults-group-suffix=.1 --defaults-file=/kvm/src/my/dev/mariadb2/mysql-test/var/75/my.cnf --log-output=file --binlog-format=statement --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --log-bin=master-bin --loose-skip-stack-trace --server-id=1 --loose-debug-sync-timeout=300 --loose-debug-gdb --loose-skip-stack-trace --loose-skip-innodb-use-native-aio --loose-innodb-flush-method=fsync
      2025-02-23 15:33:57 0 [Warning] Could not increase number of max_open_files to more than 1024 (request: 32183)
      2025-02-23 15:33:57 0 [Warning] Changed limits: max_open_files: 1024  max_connections: 151 (was 151)  table_cache: 421 (was 2000)
      2025-02-23 15:33:57 0 [Note] Starting MariaDB 10.11.12-MariaDB-debug-log source revision 7e001b2a3cc1efc5ac182a179f061faacdf09085 server_uid fl7I2SDCO8XlZM/jjMzemQIRIGo= as process 147583
      

      This was running with --rr:

      (cd mysql-test/ && script -c './mtr rpl.rpl_gtid_crash{,,}{,,}{,,} --parallel=81 --repeat=1000 --rr')
      

      Note that the "No space left" error is from the test, not an actual
      disk full error on the host.

      In one 81 x 1000 --rr run, I got this 6 times in rpl.rpl_gtid_crash,
      which crashes the master 8 times and the slave 6 times. All 6 were
      delays in starting the master, not sure if that is significant. It was
      not the same master crash in each instance, but the 120 seconds delay
      was very consistent.

      When debugging with rr replay, it looks like this is not a delay
      inside the server code, because rr `elapsed-time` command reports that
      the server exists after a few seconds, both the one running until the
      crash/delay, and the one after.

      Don't really have any clear ideas at this point what the cause is,
      maybe some bug in mtr detection of server crash and/or restart?

      Attachments

        Activity

          Side note: I would avoid using repeat with rr, unless there really is a need to preserve all rr replay traces for the non-failing runs. I’d rather write a loop like the following in order to rerun the test until it fails:

          while ./mtr rpl.rpl_gtid_crash{,,}{,,}{,,} --parallel=81 --rr
          do :
          done
          

          marko Marko Mäkelä added a comment - Side note: I would avoid using repeat with rr , unless there really is a need to preserve all rr replay traces for the non-failing runs. I’d rather write a loop like the following in order to rerun the test until it fails: while . /mtr rpl.rpl_gtid_crash{,,}{,,}{,,} --parallel=81 --rr do : done

          People

            Unassigned Unassigned
            knielsen Kristian Nielsen
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.