Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21120

Server hangs on shutdown in MYSQL_BIN_LOG::stop_background_thread

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.2, 10.3, 10.4
    • 10.4
    • Replication
    • None

    Description

      The test runs FLUSH LOGS in one thread as frequently as it can, and restarts the server every 30 seconds. Soon enough, the restart fails because the server gets stuck upon shutdown seemingly forever:

      10.4 6cedb671

      Thread 1 (Thread 0x7f5ff9dff740 (LWP 28877)):
      #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
      #1  0x0000562f02585018 in safe_cond_wait (cond=0x562f0333bef0 <mysql_bin_log+4304>, mp=0x562f0333be08 <mysql_bin_log+4072>, file=0x562f02838d90 "/data/src/10.4/include/mysql/psi/mysql_thread.h", line=1178) at /data/src/10.4/mysys/thr_mutex.c:492
      #2  0x0000562f01e08de2 in inline_mysql_cond_wait (that=0x562f0333bef0 <mysql_bin_log+4304>, mutex=0x562f0333be08 <mysql_bin_log+4072>, src_file=0x562f0283991c "/data/src/10.4/sql/log.cc", src_line=3284) at /data/src/10.4/include/mysql/psi/mysql_thread.h:1178
      #3  0x0000562f01e12ea4 in MYSQL_BIN_LOG::stop_background_thread (this=0x562f0333ae20 <mysql_bin_log>) at /data/src/10.4/sql/log.cc:3284
      #4  0x0000562f01846bd3 in close_connections () at /data/src/10.4/sql/mysqld.cc:1757
      #5  0x0000562f0184fa71 in mysqld_main (argc=15, argv=0x562f048f9a48) at /data/src/10.4/sql/mysqld.cc:5915
      #6  0x0000562f01843760 in main (argc=15, argv=0x7ffeb5bcf8e8) at /data/src/10.4/sql/main.cc:25
      

      All threads' full stack trace is attached.

      2019-11-21 21:47:34 0 [Note] /data/bld/10.4-debug-nightly/bin/mysqld: ready for connections.
      Version: '10.4.11-MariaDB-debug-log'  socket: '/dev/shm/vardir_mdev21120/mysql.sock'  port: 19300  Source distribution
      2019-11-21 21:48:04 0 [Note] /data/bld/10.4-debug-nightly/bin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
      2019-11-21 21:48:04 0 [Note] InnoDB: FTS optimize thread exiting.
      2019-11-21 21:48:04 0 [Note] Event Scheduler: Purging the queue. 0 events
      

      To reproduce:

      git clone https://github.com/MariaDB/randgen --branch mdev21120 rqg-mdev21120
      cd rqg-mdev21120
      perl ./runall-new.pl --basedir=<basedir> --grammar=mdev21120.yy --duration=350 --threads=1 --skip-gendata --scenario=Restart --scenario-shutdown-timeout=600 --mysqld=--log-bin --vardir=/dev/shm/vardir_mdev21120
      

      Make sure to use the RQG repo and branch as suggested above, other forks might not have proper instrumentation. Remember to set the basedir in the command above

      The indication of the reported failure is that it says

      # 2019-11-21T21:48:06 [30854] GenTest: child 30854 is being stopped with status STATUS_OK
      # 2019-11-21T21:48:06 [30850] For pid 30854 reporter status STATUS_OK
      # 2019-11-21T21:48:06 [30850] Kill GenTest::ErrorFilter(30853)
      # 2019-11-21T21:48:06 [30850] Test completed successfully.
      # 2019-11-21T21:48:06 [30850] GenTest will exit with exit status STATUS_OK (0)
      

      and stays this way for long enough time, without either proceeding or returning the prompt. Eventually, after scenario-shutdown-timeout seconds (configurable, set to 10 min in the command line above), it will SIGABRT the server and exit with STATUS_TEST_FAILURE.

      If you interrupt the test while it is still waiting, remember also to kill the server explicitly, killing the perl processes either by Ctrl-C or by any signal in this case won't have effect on the underlying hanging server.

      Normally, when it succeeds (finishes without hanging), it will say at the end

      # 2019-11-21T21:18:12 [21604] #-------------------------#
      # 2019-11-21T21:18:12 [21604] #-- Stopping the server --#
      # 2019-11-21T21:18:12 [21604] #-------------------------#
      # 2019-11-21T21:18:12 [21604] Stopping server on port 19300
      # 2019-11-21T21:18:14 [21604] Server has been stopped
      # 2019-11-21T21:18:14 [21604] Success
      # 2019-11-21T21:18:14 [21604] [21604] ./runall-new.pl will exit with exit status STATUS_OK (0)
      

      and will return the prompt.

      The test may be saying something like this in the output while it's running:

      # 2019-11-21T21:17:58 [21604] #-- Stopping the server --#
      # 2019-11-21T21:17:58 [21604] #-------------------------#
      # 2019-11-21T21:17:58 [21604] Stopping server on port 10610
      # 2019-11-21T21:17:58 [23564][ERROR] Executor::MySQL::execute: Failed to reconnect after getting STATUS_SERVER_KILLED
      # 2019-11-21T21:17:58 [23564] Executor::MySQL::execute: Query: FLUSH LOGS /* QNO 1410 CON_ID 7 */  failed: 1053 Server shutdown in progress (STATUS_SERVER_KILLED)
      # 2019-11-21T21:17:58 [23564] Mixer: Server crash or critical failure 1053 (STATUS_SERVER_KILLED) reported at dsn dbi:mysql:host=127.0.0.1:port=10610:user=root:database=test:mysql_local_infile=1
      # 2019-11-21T21:17:58 [23564] GenTest: Server crash or critical failure (STATUS_SERVER_KILLED) reported, the child will be stopped
      # 2019-11-21T21:17:58 [23564] GenTest: child 23564 is being stopped with status STATUS_SERVER_KILLED
      # 2019-11-21T21:17:58 [23557] Process with pid 23564 ended with status STATUS_SERVER_KILLED
      # 2019-11-21T21:17:59 [23557] Killing periodic reporting process with pid 23563...
      # 2019-11-21T21:17:59 [23563] GenTest: child 23563 is being stopped with status STATUS_OK
      # 2019-11-21T21:17:59 [23557] For pid 23563 reporter status STATUS_OK
      # 2019-11-21T21:17:59 [23557] Kill GenTest::ErrorFilter(23562)
      # 2019-11-21T21:17:59 [23557] Test completed successfully.
      # 2019-11-21T21:17:59 [23557] GenTest will exit with exit status STATUS_OK (0)
      # 2019-11-21T21:18:00 [21604] Server has been stopped
      # 2019-11-21T21:18:00 [21604] #-------------------------------------------------------------#
      # 2019-11-21T21:18:00 [21604] #-- Checking the server log for fatal errors after stopping --#
      # 2019-11-21T21:18:00 [21604] #-------------------------------------------------------------#
      # 2019-11-21T21:18:00 [21604] Checking server log for important errors starting from the beginning
      # 2019-11-21T21:18:00 [21604] #---------------------------#
      # 2019-11-21T21:18:00 [21604] #-- Restarting the server --#
      # 2019-11-21T21:18:00 [21604] #---------------------------#
      

      This is not an indication of a failure, it is just how it deals with a periodic restart.

      Currently it is reproducible reliably for me on 10.2-10.5. Couldn't reproduce on 10.1.

      Attachments

        Activity

          People

            Elkin Andrei Elkin
            elenst Elena Stepanova
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.