[MDEV-21120] Server hangs on shutdown in MYSQL_BIN_LOG::stop_background_thread Created: 2019-11-21  Updated: 2023-04-27

Status: Open
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.2, 10.3, 10.4
Fix Version/s: 10.4

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Andrei Elkin
Resolution: Unresolved Votes: 0
Labels: None

Attachments: File binlog_hang_threads_full_10.4    

 Description   

The test runs FLUSH LOGS in one thread as frequently as it can, and restarts the server every 30 seconds. Soon enough, the restart fails because the server gets stuck upon shutdown seemingly forever:

10.4 6cedb671

Thread 1 (Thread 0x7f5ff9dff740 (LWP 28877)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000562f02585018 in safe_cond_wait (cond=0x562f0333bef0 <mysql_bin_log+4304>, mp=0x562f0333be08 <mysql_bin_log+4072>, file=0x562f02838d90 "/data/src/10.4/include/mysql/psi/mysql_thread.h", line=1178) at /data/src/10.4/mysys/thr_mutex.c:492
#2  0x0000562f01e08de2 in inline_mysql_cond_wait (that=0x562f0333bef0 <mysql_bin_log+4304>, mutex=0x562f0333be08 <mysql_bin_log+4072>, src_file=0x562f0283991c "/data/src/10.4/sql/log.cc", src_line=3284) at /data/src/10.4/include/mysql/psi/mysql_thread.h:1178
#3  0x0000562f01e12ea4 in MYSQL_BIN_LOG::stop_background_thread (this=0x562f0333ae20 <mysql_bin_log>) at /data/src/10.4/sql/log.cc:3284
#4  0x0000562f01846bd3 in close_connections () at /data/src/10.4/sql/mysqld.cc:1757
#5  0x0000562f0184fa71 in mysqld_main (argc=15, argv=0x562f048f9a48) at /data/src/10.4/sql/mysqld.cc:5915
#6  0x0000562f01843760 in main (argc=15, argv=0x7ffeb5bcf8e8) at /data/src/10.4/sql/main.cc:25

All threads' full stack trace is attached.

2019-11-21 21:47:34 0 [Note] /data/bld/10.4-debug-nightly/bin/mysqld: ready for connections.
Version: '10.4.11-MariaDB-debug-log'  socket: '/dev/shm/vardir_mdev21120/mysql.sock'  port: 19300  Source distribution
2019-11-21 21:48:04 0 [Note] /data/bld/10.4-debug-nightly/bin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
2019-11-21 21:48:04 0 [Note] InnoDB: FTS optimize thread exiting.
2019-11-21 21:48:04 0 [Note] Event Scheduler: Purging the queue. 0 events

To reproduce:

git clone https://github.com/MariaDB/randgen --branch mdev21120 rqg-mdev21120
cd rqg-mdev21120
perl ./runall-new.pl --basedir=<basedir> --grammar=mdev21120.yy --duration=350 --threads=1 --skip-gendata --scenario=Restart --scenario-shutdown-timeout=600 --mysqld=--log-bin --vardir=/dev/shm/vardir_mdev21120

Make sure to use the RQG repo and branch as suggested above, other forks might not have proper instrumentation. Remember to set the basedir in the command above

The indication of the reported failure is that it says

# 2019-11-21T21:48:06 [30854] GenTest: child 30854 is being stopped with status STATUS_OK
# 2019-11-21T21:48:06 [30850] For pid 30854 reporter status STATUS_OK
# 2019-11-21T21:48:06 [30850] Kill GenTest::ErrorFilter(30853)
# 2019-11-21T21:48:06 [30850] Test completed successfully.
# 2019-11-21T21:48:06 [30850] GenTest will exit with exit status STATUS_OK (0)

and stays this way for long enough time, without either proceeding or returning the prompt. Eventually, after scenario-shutdown-timeout seconds (configurable, set to 10 min in the command line above), it will SIGABRT the server and exit with STATUS_TEST_FAILURE.

If you interrupt the test while it is still waiting, remember also to kill the server explicitly, killing the perl processes either by Ctrl-C or by any signal in this case won't have effect on the underlying hanging server.

Normally, when it succeeds (finishes without hanging), it will say at the end

# 2019-11-21T21:18:12 [21604] #-------------------------#
# 2019-11-21T21:18:12 [21604] #-- Stopping the server --#
# 2019-11-21T21:18:12 [21604] #-------------------------#
# 2019-11-21T21:18:12 [21604] Stopping server on port 19300
# 2019-11-21T21:18:14 [21604] Server has been stopped
# 2019-11-21T21:18:14 [21604] Success
# 2019-11-21T21:18:14 [21604] [21604] ./runall-new.pl will exit with exit status STATUS_OK (0)

and will return the prompt.

The test may be saying something like this in the output while it's running:

# 2019-11-21T21:17:58 [21604] #-- Stopping the server --#
# 2019-11-21T21:17:58 [21604] #-------------------------#
# 2019-11-21T21:17:58 [21604] Stopping server on port 10610
# 2019-11-21T21:17:58 [23564][ERROR] Executor::MySQL::execute: Failed to reconnect after getting STATUS_SERVER_KILLED
# 2019-11-21T21:17:58 [23564] Executor::MySQL::execute: Query: FLUSH LOGS /* QNO 1410 CON_ID 7 */  failed: 1053 Server shutdown in progress (STATUS_SERVER_KILLED)
# 2019-11-21T21:17:58 [23564] Mixer: Server crash or critical failure 1053 (STATUS_SERVER_KILLED) reported at dsn dbi:mysql:host=127.0.0.1:port=10610:user=root:database=test:mysql_local_infile=1
# 2019-11-21T21:17:58 [23564] GenTest: Server crash or critical failure (STATUS_SERVER_KILLED) reported, the child will be stopped
# 2019-11-21T21:17:58 [23564] GenTest: child 23564 is being stopped with status STATUS_SERVER_KILLED
# 2019-11-21T21:17:58 [23557] Process with pid 23564 ended with status STATUS_SERVER_KILLED
# 2019-11-21T21:17:59 [23557] Killing periodic reporting process with pid 23563...
# 2019-11-21T21:17:59 [23563] GenTest: child 23563 is being stopped with status STATUS_OK
# 2019-11-21T21:17:59 [23557] For pid 23563 reporter status STATUS_OK
# 2019-11-21T21:17:59 [23557] Kill GenTest::ErrorFilter(23562)
# 2019-11-21T21:17:59 [23557] Test completed successfully.
# 2019-11-21T21:17:59 [23557] GenTest will exit with exit status STATUS_OK (0)
# 2019-11-21T21:18:00 [21604] Server has been stopped
# 2019-11-21T21:18:00 [21604] #-------------------------------------------------------------#
# 2019-11-21T21:18:00 [21604] #-- Checking the server log for fatal errors after stopping --#
# 2019-11-21T21:18:00 [21604] #-------------------------------------------------------------#
# 2019-11-21T21:18:00 [21604] Checking server log for important errors starting from the beginning
# 2019-11-21T21:18:00 [21604] #---------------------------#
# 2019-11-21T21:18:00 [21604] #-- Restarting the server --#
# 2019-11-21T21:18:00 [21604] #---------------------------#

This is not an indication of a failure, it is just how it deals with a periodic restart.

Currently it is reproducible reliably for me on 10.2-10.5. Couldn't reproduce on 10.1.



 Comments   
Comment by Andrei Elkin [ 2021-06-28 ]

The description rather boldly hints at the fixed MDEV-24526 that corrects FLUSH LOGS vs binlog checkpoint notifications, which may have missed in the description case.

Generated at Thu Feb 08 09:04:44 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.