Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21120

Server hangs on shutdown in MYSQL_BIN_LOG::stop_background_thread

Details

    • Bug
    • Status: Stalled (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.2(EOL), 10.3(EOL), 10.4(EOL)
    • 10.4(EOL)
    • Replication
    • None

    Description

      The test runs FLUSH LOGS in one thread as frequently as it can, and restarts the server every 30 seconds. Soon enough, the restart fails because the server gets stuck upon shutdown seemingly forever:

      10.4 6cedb671

      Thread 1 (Thread 0x7f5ff9dff740 (LWP 28877)):
      #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
      #1  0x0000562f02585018 in safe_cond_wait (cond=0x562f0333bef0 <mysql_bin_log+4304>, mp=0x562f0333be08 <mysql_bin_log+4072>, file=0x562f02838d90 "/data/src/10.4/include/mysql/psi/mysql_thread.h", line=1178) at /data/src/10.4/mysys/thr_mutex.c:492
      #2  0x0000562f01e08de2 in inline_mysql_cond_wait (that=0x562f0333bef0 <mysql_bin_log+4304>, mutex=0x562f0333be08 <mysql_bin_log+4072>, src_file=0x562f0283991c "/data/src/10.4/sql/log.cc", src_line=3284) at /data/src/10.4/include/mysql/psi/mysql_thread.h:1178
      #3  0x0000562f01e12ea4 in MYSQL_BIN_LOG::stop_background_thread (this=0x562f0333ae20 <mysql_bin_log>) at /data/src/10.4/sql/log.cc:3284
      #4  0x0000562f01846bd3 in close_connections () at /data/src/10.4/sql/mysqld.cc:1757
      #5  0x0000562f0184fa71 in mysqld_main (argc=15, argv=0x562f048f9a48) at /data/src/10.4/sql/mysqld.cc:5915
      #6  0x0000562f01843760 in main (argc=15, argv=0x7ffeb5bcf8e8) at /data/src/10.4/sql/main.cc:25
      

      All threads' full stack trace is attached.

      2019-11-21 21:47:34 0 [Note] /data/bld/10.4-debug-nightly/bin/mysqld: ready for connections.
      Version: '10.4.11-MariaDB-debug-log'  socket: '/dev/shm/vardir_mdev21120/mysql.sock'  port: 19300  Source distribution
      2019-11-21 21:48:04 0 [Note] /data/bld/10.4-debug-nightly/bin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
      2019-11-21 21:48:04 0 [Note] InnoDB: FTS optimize thread exiting.
      2019-11-21 21:48:04 0 [Note] Event Scheduler: Purging the queue. 0 events
      

      To reproduce:

      git clone https://github.com/MariaDB/randgen --branch mdev21120 rqg-mdev21120
      cd rqg-mdev21120
      perl ./runall-new.pl --basedir=<basedir> --grammar=mdev21120.yy --duration=350 --threads=1 --skip-gendata --scenario=Restart --scenario-shutdown-timeout=600 --mysqld=--log-bin --vardir=/dev/shm/vardir_mdev21120
      

      Make sure to use the RQG repo and branch as suggested above, other forks might not have proper instrumentation. Remember to set the basedir in the command above

      The indication of the reported failure is that it says

      # 2019-11-21T21:48:06 [30854] GenTest: child 30854 is being stopped with status STATUS_OK
      # 2019-11-21T21:48:06 [30850] For pid 30854 reporter status STATUS_OK
      # 2019-11-21T21:48:06 [30850] Kill GenTest::ErrorFilter(30853)
      # 2019-11-21T21:48:06 [30850] Test completed successfully.
      # 2019-11-21T21:48:06 [30850] GenTest will exit with exit status STATUS_OK (0)
      

      and stays this way for long enough time, without either proceeding or returning the prompt. Eventually, after scenario-shutdown-timeout seconds (configurable, set to 10 min in the command line above), it will SIGABRT the server and exit with STATUS_TEST_FAILURE.

      If you interrupt the test while it is still waiting, remember also to kill the server explicitly, killing the perl processes either by Ctrl-C or by any signal in this case won't have effect on the underlying hanging server.

      Normally, when it succeeds (finishes without hanging), it will say at the end

      # 2019-11-21T21:18:12 [21604] #-------------------------#
      # 2019-11-21T21:18:12 [21604] #-- Stopping the server --#
      # 2019-11-21T21:18:12 [21604] #-------------------------#
      # 2019-11-21T21:18:12 [21604] Stopping server on port 19300
      # 2019-11-21T21:18:14 [21604] Server has been stopped
      # 2019-11-21T21:18:14 [21604] Success
      # 2019-11-21T21:18:14 [21604] [21604] ./runall-new.pl will exit with exit status STATUS_OK (0)
      

      and will return the prompt.

      The test may be saying something like this in the output while it's running:

      # 2019-11-21T21:17:58 [21604] #-- Stopping the server --#
      # 2019-11-21T21:17:58 [21604] #-------------------------#
      # 2019-11-21T21:17:58 [21604] Stopping server on port 10610
      # 2019-11-21T21:17:58 [23564][ERROR] Executor::MySQL::execute: Failed to reconnect after getting STATUS_SERVER_KILLED
      # 2019-11-21T21:17:58 [23564] Executor::MySQL::execute: Query: FLUSH LOGS /* QNO 1410 CON_ID 7 */  failed: 1053 Server shutdown in progress (STATUS_SERVER_KILLED)
      # 2019-11-21T21:17:58 [23564] Mixer: Server crash or critical failure 1053 (STATUS_SERVER_KILLED) reported at dsn dbi:mysql:host=127.0.0.1:port=10610:user=root:database=test:mysql_local_infile=1
      # 2019-11-21T21:17:58 [23564] GenTest: Server crash or critical failure (STATUS_SERVER_KILLED) reported, the child will be stopped
      # 2019-11-21T21:17:58 [23564] GenTest: child 23564 is being stopped with status STATUS_SERVER_KILLED
      # 2019-11-21T21:17:58 [23557] Process with pid 23564 ended with status STATUS_SERVER_KILLED
      # 2019-11-21T21:17:59 [23557] Killing periodic reporting process with pid 23563...
      # 2019-11-21T21:17:59 [23563] GenTest: child 23563 is being stopped with status STATUS_OK
      # 2019-11-21T21:17:59 [23557] For pid 23563 reporter status STATUS_OK
      # 2019-11-21T21:17:59 [23557] Kill GenTest::ErrorFilter(23562)
      # 2019-11-21T21:17:59 [23557] Test completed successfully.
      # 2019-11-21T21:17:59 [23557] GenTest will exit with exit status STATUS_OK (0)
      # 2019-11-21T21:18:00 [21604] Server has been stopped
      # 2019-11-21T21:18:00 [21604] #-------------------------------------------------------------#
      # 2019-11-21T21:18:00 [21604] #-- Checking the server log for fatal errors after stopping --#
      # 2019-11-21T21:18:00 [21604] #-------------------------------------------------------------#
      # 2019-11-21T21:18:00 [21604] Checking server log for important errors starting from the beginning
      # 2019-11-21T21:18:00 [21604] #---------------------------#
      # 2019-11-21T21:18:00 [21604] #-- Restarting the server --#
      # 2019-11-21T21:18:00 [21604] #---------------------------#
      

      This is not an indication of a failure, it is just how it deals with a periodic restart.

      Currently it is reproducible reliably for me on 10.2-10.5. Couldn't reproduce on 10.1.

      Attachments

        Activity

          Elkin Andrei Elkin added a comment -

          The description rather boldly hints at the fixed MDEV-24526 that corrects FLUSH LOGS vs binlog checkpoint notifications, which may have missed in the description case.

          Elkin Andrei Elkin added a comment - The description rather boldly hints at the fixed MDEV-24526 that corrects FLUSH LOGS vs binlog checkpoint notifications, which may have missed in the description case.

          I was able to reproduce as described in old 10.4 versions with and without MDEV-24526:

          commit 0d267f7caa1421f9e2b22412694546e499845488
          Author: Marko Mäkelä <marko.makela@mariadb.com>
          Date:   Thu Apr 22 07:36:04 2021 +0300
           
              MDEV-25362 after-merge fix: Remove unnecessary code
          

          commit ee455e6f2e6c9cda921c0801210786123d9f8b95
          Merge: 0d267f7caa1 6f271302b64
          Author: Marko Mäkelä <marko.makela@mariadb.com>
          Date:   Thu Apr 22 07:51:33 2021 +0300
           
              Merge 10.3 into 10.4
          

          But the problem seems not reproducible in latest 10.5. I'm not sure exactly what fixed the problem, I do vaguely recall there were fixes in the binlog background thread, but not sure.

          elenst What do you say, should we close it given that I could reproduce in old 10.4 but not in latest 10.5?

          knielsen Kristian Nielsen added a comment - I was able to reproduce as described in old 10.4 versions with and without MDEV-24526 : commit 0d267f7caa1421f9e2b22412694546e499845488 Author: Marko Mäkelä <marko.makela@mariadb.com> Date: Thu Apr 22 07:36:04 2021 +0300   MDEV-25362 after-merge fix: Remove unnecessary code commit ee455e6f2e6c9cda921c0801210786123d9f8b95 Merge: 0d267f7caa1 6f271302b64 Author: Marko Mäkelä <marko.makela@mariadb.com> Date: Thu Apr 22 07:51:33 2021 +0300   Merge 10.3 into 10.4 But the problem seems not reproducible in latest 10.5. I'm not sure exactly what fixed the problem, I do vaguely recall there were fixes in the binlog background thread, but not sure. elenst What do you say, should we close it given that I could reproduce in old 10.4 but not in latest 10.5?

          Thanks for checking, I'll take this one.
          It is not easy to bisect, given the absence of a decent test case, but I have a fair amount of unrecognized "hangs or very slow upon shutdown" problems in my tests still, so I want to double-check it's not one of them before closing.

          elenst Elena Stepanova added a comment - Thanks for checking, I'll take this one. It is not easy to bisect, given the absence of a decent test case, but I have a fair amount of unrecognized "hangs or very slow upon shutdown" problems in my tests still, so I want to double-check it's not one of them before closing.

          Thanks, Elena! I agree, just because it doesn't reproduce for me in 10.5 doesn't mean the bug is fixed.
          Do let me know if you find any candidate commit that seems to fix (or hide) the bug and I will take a look. I'm also very interested in tracking down any hangs.

          The latest commit I saw the failure in is ee455e6f2e6c9cda921c0801210786123d9f8b95 and the commit I saw passing is dba906f26e2af533749261eaa8b4ab0ea404aecf (which is bb-10.5-knielsen, but you can probably use fdb6db6b47f1825eabffde76c29d9b94545f1ef4 instead).

          Note I had to hack a few things to make different old versions build / new versions work with that rqg branch.

          knielsen Kristian Nielsen added a comment - Thanks, Elena! I agree, just because it doesn't reproduce for me in 10.5 doesn't mean the bug is fixed. Do let me know if you find any candidate commit that seems to fix (or hide) the bug and I will take a look. I'm also very interested in tracking down any hangs. The latest commit I saw the failure in is ee455e6f2e6c9cda921c0801210786123d9f8b95 and the commit I saw passing is dba906f26e2af533749261eaa8b4ab0ea404aecf (which is bb-10.5-knielsen, but you can probably use fdb6db6b47f1825eabffde76c29d9b94545f1ef4 instead). Note I had to hack a few things to make different old versions build / new versions work with that rqg branch.

          People

            elenst Elena Stepanova
            elenst Elena Stepanova
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.