[MDEV-25802] mtr: race condition if the test quickly restarts twice - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5
Fix Version/s: 10.2.40, 10.3.31, 10.4.21, 10.5.12, 10.6.4
Component/s: Tests
Labels:
None

Description

include/wait_until_connected_again.inc has the following loop:

while ($mysql_errno)

  --error 0,ER_ACCESS_DENIED_ERROR,ER_SERVER_SHUTDOWN,ER_CONNECTION_KILLED,ER_LOCK_WAIT_TIMEOUT,2002,2006,2013

  show status;

  dec $counter;

  if (!$counter)

    --die Server failed to restart

  -- <cut out for brevity...> --

  --sleep 0.1

Note that it sleeps even if $mysql_errno is 0. monty tried to fix it in commit 9ec2129f715 in 10.6. This broke some tests, for example main.events_restart.

What happens is, the test restarts the server with --server-id=100. mtr goes into sleep_until_file_created() function that does

  for ( my $loop= 1; $loop <= $loops; $loop++ )

    if ( -r $pidfile ) { return 1; }

    my $seconds= ($loop * $sleeptime) / 1000;

    if ( defined $proc and ! $proc->wait_one(0) )

      mtr_warning("Process $proc died after mysql-test-run waited $seconds " .

		  "seconds for $pidfile to be created.");

      return 0;

    mtr_debug("Sleep $sleeptime milliseconds waiting for $pidfile");

    mtr_milli_sleep($sleeptime);

(also reformatted for brevity). This function verifies that there's no pid file yet, and that the process exists wait_one(0). Then it waits. While it waits, include/wait_until_connected_again.inc notices that the server is up, passes the control back to the test, the test verifies that server_id is 100 and event scheduler is up, and wants to restart the server with the default --server-id=1. The test stops the server, mtr comes back from the sleep, notices that there's no pid file and no process — and reports that "Process ... died".

This is how it could be fixed. Before restarting the server the test creates an .expect file. mtr reads it to know it's an expected server shutdown, and not an error. If mtr will delete the .expect file immediately after reading it, then sleep_until_file_created() could check, if this file exists, it is an expected next shutdown, not a failure of the previous one.

Note, that while the 10.6 commit above removes sleep 0.1 from include/wait_until_connected_again.inc, it adds it to include/restart_mysqld.inc precisely to workaround this bug. After fixing this bug, please, remove that sleep. Also, that commit reduces the sleep duration in mtr, this could be reverted too if it takes too much CPU.

Attachments

Activity

Ascending order - Click to sort in descending order

Anel Husakovic added a comment - 2021-05-28 15:23 - edited

serg, elenst based on slack comment (usage of repeat ) I couldn't reproduce the failure of events_restart

$ ./mysql-test/mtr events_restart --repeat=100 --mem

main.events_restart                      [ 98 pass ]   7821

main.events_restart                      [ 99 pass ]   7809

main.events_restart                      [ 100 pass ]   7837

--------------------------------------------------------------------------

The servers were restarted 0 times

Spent 783.427 of 802 seconds executing testcases

Completed: All 100 tests were successful.

Also couldn't find in buildbot 10.6 from commit 9ec2129f715c onwards as well failure in events_restart.

However serg based on your description (thank you very much for detailed explanation ) I tried to create the patch: a09469c2137
I tested locally only main suite:

The servers were restarted 212 times

Spent 1370.866 of 239 seconds executing testcases

Completed: All 1039 tests were successful.

86 tests were skipped, 66 by the test itself.

Anel Husakovic added a comment - 2021-05-28 15:23 - edited serg , elenst based on slack comment (usage of repeat ) I couldn't reproduce the failure of events_restart $ ./mysql-test/mtr events_restart --repeat=100 --mem main.events_restart [ 98 pass ] 7821 main.events_restart [ 99 pass ] 7809 main.events_restart [ 100 pass ] 7837 -------------------------------------------------------------------------- The servers were restarted 0 times Spent 783.427 of 802 seconds executing testcases Completed: All 100 tests were successful. Also couldn't find in buildbot 10.6 from commit 9ec2129f715c onwards as well failure in events_restart . However serg based on your description (thank you very much for detailed explanation ) I tried to create the patch: a09469c2137 I tested locally only main suite: The servers were restarted 212 times Spent 1370.866 of 239 seconds executing testcases Completed: All 1039 tests were successful. 86 tests were skipped, 66 by the test itself.

Sergei Golubchik added a comment - 2021-06-05 19:17

Let's not do a blind fix, if we can avoid it

To repeat on 10.6, see mysql-test/include/restart_mysqld.inc — at the end it has a sleep, precisely to work around this bug. If you remove the sleep, you should be able to repeat the bug. And then you should be able to fix it properly, without introducing sleeps that make restarts slower.

Sergei Golubchik added a comment - 2021-06-05 19:17 Let's not do a blind fix, if we can avoid it To repeat on 10.6, see mysql-test/include/restart_mysqld.inc — at the end it has a sleep, precisely to work around this bug. If you remove the sleep, you should be able to repeat the bug. And then you should be able to fix it properly, without introducing sleeps that make restarts slower.

Sergei Golubchik added a comment - 2021-07-06 14:09

There were tests regularly failing in buildbot because of that, for example, innodb.log_corruption. Your fix in ~~MDEV-25857~~ helped to show the test name, but the test was still failing. So we need to have this fixed.

Sergei Golubchik added a comment - 2021-07-06 14:09 There were tests regularly failing in buildbot because of that, for example, innodb.log_corruption. Your fix in MDEV-25857 helped to show the test name, but the test was still failing. So we need to have this fixed.

People

Assignee:: Sergei Golubchik

Reporter:: Sergei Golubchik

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2021-05-27 15:07

Updated:: 2021-07-06 15:49

Resolved:: 2021-07-06 15:49

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server