[MDEV-21707] Mariadb doesn't exit correctly Created: 2020-02-11  Updated: 2023-04-20  Resolved: 2023-04-20

Status: Closed
Project: MariaDB Server
Component/s: Galera, Storage Engine - InnoDB
Affects Version/s: 10.4.10, 10.4.12
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Eugene Assignee: Jan Lindström
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Linux 4.19.102-gentoo x86_64 AMD EPYC 7451


Attachments: File my.cnf     File shutdown-mysqld.err    
Issue Links:
Relates
relates to MDEV-22116 Not able to shutdown MariaDB after up... Closed

 Description   

Attempt to stop mysqld with signal 15 fails for members of galera cluster.
Configuration is: 3 identical servers running galera nodes. All have similar config (attached).

When signal 15 is sent to mysqld process, mysql writes down /var/lib/mysql/grastate.dat filling in all the fields, for example:

# GALERA saved state
version: 2.1
uuid:    0b58193e-****-****-****-b2ddbb52b5f6
seqno:   37261
safe_to_bootstrap: 1

then declines any attempts to connect but still remains present in memory and not exiting:

# ps aux|grep 'sbin/mysqld'
mysql    50683  0.5  0.6 68765724 24362700 ?    Sl   13:57   0:24 /usr/sbin/mysqld --defaults-file=/etc/mysql/my.cnf
 
# strace -p 50683|head -n 20
strace: Process 50683 attached
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=0}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)

This can last for days. Mysqld refuses signal 15 and if killed finally with signal 9, it fails to recover from binary log, like:

[ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.

This always ends with SST.

In mentioned hung state, mysqld doesn't perform any IO operations, used memory size remains constant, seems that process runs some infinite loop, but CPU is also not used.

This happens on cluster with gigabytes of data, but this also was found on newly installed cluster with no data except default database.

Expected behavior was - mysqld exits on signal 15 in reasonable time with flushing cached data and closing files first.



 Comments   
Comment by Eugene [ 2020-02-11 ]

Another example: node wrote /var/lib/mysql/grastate.dat file:

# GALERA saved state
version: 2.1
uuid:    be6fbe1e-****-****-****-bb328ef5c238
seqno:   138155645
safe_to_bootstrap: 0

and mysqld process doesn't exit:

# strace -p 35454
strace: Process 35454 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffdedabf1b0) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffdedabf1b0) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffdedabf1b0) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffdedabf1b0) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffdedabf1b0) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffdedabf1b0) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffdedabf1b0) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffdedabf1b0) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffdedabf1b0) = 0

and so on...

Comment by Eugene [ 2020-03-17 ]

Please find some additional details.
Same situation: have to shut down the node, thus sending SIGTERM to mysql PID. Status immedeately after that:

# netstat -anp|grep mysql
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp6       0      0 :::13316                :::*                    LISTEN      16948/mysqld        
tcp6     901      0 :::3306                 :::*                    LISTEN      16948/mysqld    
Proto RefCnt Flags       Type       State         I-Node   PID/Program name     Path
unix  2      [ ACC ]     STREAM     LISTENING     51400219 16948/mysqld         /var/run/mysqld/mysqld.sock
unix  2      [ ]         STREAM     CONNECTED     51221906 16948/mysqld         
unix  2      [ ]         STREAM     CONNECTING    0        -                    /var/run/mysqld/mysqld.sock

a bit later

# netstat -anp|grep mysql
unix  2      [ ]         STREAM     CONNECTED     51221906 16948/mysqld         

5 mins later

# netstat -anp|grep mysql
unix  2      [ ]         STREAM     CONNECTED     51221906 16948/mysqld         

and this socket connection never get closed.

Comment by Eugene [ 2020-03-17 ]

Next attempt, shutdown of another node having very same configuration (hanging in same state, open socket connection with no socket file name listed):

lsof / | egrep -i 'mysqld'
COMMAND     PID   USER   FD   TYPE DEVICE  SIZE/OFF    NODE NAME
mysqld    11396    mysql  rtd    DIR  9,127         296       128 /
mysqld    11396    mysql  txt    REG  9,127    22728336   3497072 /usr/sbin/mysqld
mysqld    11396    mysql  mem    REG  9,127       34976 269009536 /lib64/librt-2.29.so
mysqld    11396    mysql  mem    REG  9,127     2621776 806632409 /usr/lib64/galera/libgalera_smm.so
mysqld    11396    mysql  mem    REG  9,127    10406312 271192294 /var/lib/sss/mc/initgroups
mysqld    11396    mysql  mem    REG  9,127       42928 269009306 /lib64/libnss_sss.so.2
mysqld    11396    mysql  mem    REG  9,127       50984 268468382 /lib64/libnss_files-2.29.so
mysqld    11396    mysql  mem    REG  9,127     1913840 269009513 /lib64/libc-2.29.so
mysqld    11396    mysql  mem    REG  9,127      100064 541632938 /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/libgcc_s.so.1
mysqld    11396    mysql  mem    REG  9,127     1349416 268504737 /lib64/libm-2.29.so
mysqld    11396    mysql  mem    REG  9,127     2597000 543603860 /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/libstdc++.so.6.0.27
mysqld    11396    mysql  mem    REG  9,127       18216 269009309 /lib64/libdl-2.29.so
mysqld    11396    mysql  mem    REG  9,127     2867848   6571647 /usr/lib64/libcrypto.so.1.1
mysqld    11396    mysql  mem    REG  9,127      593544   4328082 /usr/lib64/libssl.so.1.1
mysqld    11396    mysql  mem    REG  9,127      167384 268972675 /lib64/libpthread-2.29.so
mysqld    11396    mysql  mem    REG  9,127       42792 269009311 /lib64/libcrypt-2.29.so
mysqld    11396    mysql  mem    REG  9,127      464640 268436074 /lib64/libpcre.so.1.2.10
mysqld    11396    mysql  mem    REG  9,127     1259592  54853789 /usr/lib64/libbfd-2.33.1.gentoo-sys-libs-binutils-libs-st-def.so
mysqld    11396    mysql  mem    REG  9,127      100128 269008129 /lib64/libz.so.1.2.11
mysqld    11396    mysql  mem    REG  9,127       51256   3294168 /usr/lib64/libnuma.so.1.0.0
mysqld    11396    mysql  mem    REG  9,127       13968 270111117 /lib64/libaio.so.1.0.1
mysqld    11396    mysql  mem    REG  9,127       38704   3592968 /usr/lib64/libsnappy.so.1.1.8
mysqld    11396    mysql  mem    REG  9,127       82672 268436257 /lib64/libbz2.so.1.0.6
mysqld    11396    mysql  mem    REG  9,127      161568 268436133 /lib64/liblzma.so.5.2.4
mysqld    11396    mysql  mem    REG  9,127      141016 271367672 /lib64/liblzo2.so.2.0.0
mysqld    11396    mysql  mem    REG  9,127      132824   6017616 /usr/lib64/liblz4.so.1.9.2
mysqld    11396    mysql  mem    REG  9,127      169280 269008849 /lib64/ld-2.29.so
mysqld    11396    mysql    1w   REG  9,127       28657 537938587 /var/log/mysql/mysqld.err
mysqld    11396    mysql    2w   REG  9,127       28657 537938587 /var/log/mysql/mysqld.err
mysqld    11396    mysql    3r   REG  9,127    10406312 271192294 /var/lib/sss/mc/initgroups
mysqld    11396    mysql   22u   REG  9,127           0   3340215 /tmp/ibgGj6cl (deleted)
mysqld    11396    mysql   23u   REG  9,127           0   3371336 /tmp/ibpgHyo4 (deleted)
mysqld    11396    mysql   24u   REG  9,127           0   3372109 /tmp/ib48BNsV (deleted)
mysqld    11396    mysql   28u   REG  9,127           0   3412306 /tmp/ibAmxs9N (deleted)
mysqld    11396    mysql  156w   REG  9,127       87767 549418425 /var/log/mysql/slow_queries.log

ls -lha /proc/PID/fd looks as follows:

dr-x------ 2 root  root   0 Mar 14 09:07 .
dr-xr-xr-x 9 mysql mysql  0 Mar 14 09:07 ..
lr-x------ 1 root  root  64 Mar 14 09:07 0 -> /dev/null
l-wx------ 1 root  root  64 Mar 14 09:07 1 -> /var/log/mysql/mysqld.err
l-wx------ 1 root  root  64 Mar 14 09:07 2 -> /var/log/mysql/mysqld.err
lrwx------ 1 root  root  64 Mar 14 09:07 4 -> socket:[280875441]
l-wx------ 1 root  root  64 Mar 14 15:23 156 -> /var/log/mysql/slow_queries.log
lrwx------ 1 root  root  64 Mar 14 15:22 28 -> /tmp/ibAmxs9N (deleted)
lrwx------ 1 root  root  64 Mar 14 15:22 22 -> /tmp/ibgGj6cl (deleted)
lrwx------ 1 root  root  64 Mar 14 15:22 23 -> /tmp/ibpgHyo4 (deleted)
lrwx------ 1 root  root  64 Mar 14 15:22 24 -> /tmp/ib48BNsV (deleted)
lrwx------ 1 root  root  64 Mar 14 15:23 164 -> anon_inode:[eventpoll]
lrwx------ 1 root  root  64 Mar 14 15:23 165 -> anon_inode:[eventpoll]
lrwx------ 1 root  root  64 Mar 14 15:23 166 -> anon_inode:[eventpoll]
lrwx------ 1 root  root  64 Mar 14 15:23 167 -> anon_inode:[eventpoll]
lrwx------ 1 root  root  64 Mar 14 15:23 168 -> anon_inode:[eventpoll]
lrwx------ 1 root  root  64 Mar 14 15:23 169 -> anon_inode:[eventpoll]
lrwx------ 1 root  root  64 Mar 14 15:23 170 -> anon_inode:[eventpoll]
lrwx------ 1 root  root  64 Mar 14 15:23 171 -> anon_inode:[eventpoll]
...
...
...
lrwx------ 1 root  root  64 Mar 14 15:23 254 -> anon_inode:[eventpoll]
lrwx------ 1 root  root  64 Mar 14 09:07 255 -> anon_inode:[eventpoll]
lrwx------ 1 root  root  64 Mar 14 15:23 256 -> anon_inode:[eventpoll]
lrwx------ 1 root  root  64 Mar 14 15:23 257 -> anon_inode:[eventpoll]
lrwx------ 1 root  root  64 Mar 14 15:23 258 -> anon_inode:[eventpoll]
lrwx------ 1 root  root  64 Mar 14 15:23 259 -> anon_inode:[eventpoll]
lrwx------ 1 root  root  64 Mar 14 15:23 100 -> /var/lib/mysql/user_201/trigger_stats_o.ibd
lrwx------ 1 root  root  64 Mar 14 15:24 1000 -> /var/lib/mysql/user_19979/stats_t.ibd
lrwx------ 1 root  root  64 Mar 14 15:24 10000 -> /var/lib/mysql/user_161362/session.ibd
lrwx------ 1 root  root  64 Mar 14 15:27 100000 -> /var/lib/mysql/user_136452/trigger_event.ibd
lrwx------ 1 root  root  64 Mar 14 15:27 100001 -> /var/lib/mysql/user_136452/notifications.ibd
lrwx------ 1 root  root  64 Mar 14 15:27 100002 -> /var/lib/mysql/user_136452/customactivity.ibd
lrwx------ 1 root  root  64 Mar 14 15:27 100003 -> /var/lib/mysql/user_136452/survey_response.ibd
lrwx------ 1 root  root  64 Mar 14 15:27 100004 -> /var/lib/mysql/user_9803/contact.ibd
lrwx------ 1 root  root  64 Mar 14 15:27 100005 -> /var/lib/mysql/user_9803/notifications.ibd
...
...
...
lrwx------ 1 root  root  64 Mar 14 15:27 99996 -> /var/lib/mysql/user_17582/splittest.ibd
lrwx------ 1 root  root  64 Mar 14 15:27 99997 -> /var/lib/mysql/user_136966/contact.ibd
lrwx------ 1 root  root  64 Mar 14 15:27 99998 -> /var/lib/mysql/user_136966/survey_response.ibd
lrwx------ 1 root  root  64 Mar 14 15:27 99999 -> /var/lib/mysql/user_136966/survey_response_value.ibd

Comment by Eugene [ 2020-03-17 ]

Please also find shutdown log attached. shutdown-mysqld.err
Note last line of log file

WSREP: Flushing memory map to disk...

After this nothing happens even if you leave it for several hours, so finally you have to kill process with SIGKILL

Comment by Eugene [ 2020-04-01 ]

Seems that issue is caused by thread handling mechanism.
In configuration file we used to have

thread_handling                         = pool-of-threads

Regardless of number of threads in pool (one cluster used 7000, then reduced to 2000 while second used 400), the issue with shutdown was present. However, problem suddenly appeared to be gone with removal of mentioned thread_handling line that in fact reverted to "one thread per connection". Shutdown was clean with 2000 permitted connections (over 1000 used).

Comment by Jan Lindström [ 2023-04-11 ]

euglorg Can you try with more recent version of MariaDB and Galera library. If you can reproduce this issue please provide full error log, output from show processlist, and if you can kill server and provide full stack trace.

Comment by Eugene [ 2023-04-11 ]

Hello Jan,
Thank you for your reply. The issue was stuck for 3 years, now we don't have mariadb-10.4 anywhere, instead, we have 10.6.
Was there any fix for this issue in 10.6 or is it expected that the issue is still present?
It seems from the discussion in MDEV-22116 that the issue was not investigated and fixed in fact.

Comment by Jan Lindström [ 2023-04-12 ]

euglorg It seems that because this issue priority was low it newer got to work queue. There has been so many fixes and improvements its impossible to me to say is this issue fixed. If you have not seen it in 10.6, sounds good.

Comment by Eugene [ 2023-04-12 ]

We haven't see it since the day we disabled `thread_handling = pool-of-threads` in configuration of (that time) 10.4. So in fact it's not clear whether it still affects newer versions of mariadb.

Comment by Jan Lindström [ 2023-04-20 ]

I used 10.4.29 and 3-node cluster and in my tests server shutdown normally with killall -TERM mysqld.

Generated at Thu Feb 08 09:09:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.