[MDEV-22116] Not able to shutdown MariaDB after upgrade. Happening randomly. Created: 2020-04-02  Updated: 2020-08-04  Resolved: 2020-07-28

Status: Closed
Project: MariaDB Server
Component/s: Server
Affects Version/s: 10.4.12
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: prasad Assignee: Vladislav Vaintroub
Resolution: Incomplete Votes: 1
Labels: shutdown
Environment:

Centos 6.10


Attachments: Text File bashtrace.txt     Text File killmysqld.txt     Text File mysql.server_when_hang.log     File mysqld.err     Text File mysqld_bt_all_threads.txt     Text File mysqld_full_bt_all_threads_when_hang.txt     File mysqld_when_hang.err    
Issue Links:
Relates
relates to MDEV-21707 Mariadb doesn't exit correctly Closed

 Description   

It was observed that mysql.server stop not working all the times. It seems get stuck in loop. We have upgraded Maria DB 10.3.15 to 10.4.12 and observing this issue. Here are some more details:
Environment : Centos 6.10
Attaching "bash -x /opt/proofpoint/current/opt/mysql/support-files/mysql.server stop" output (bashtrace.txt)

Also did strace on mysqld pid

[testing@rajxxx abcdef]$ ps aux | grep mysql
testing      16488  0.0  0.0 103328   896 pts/1    S+   20:23   0:00 grep mysql
testing      21007  0.0  0.1 125820 10652 ?        S    10:23   0:00 /bin/sh /opt/testing/testing-7.2.y/opt/mysql/bin/mysqld_safe --defaults-file=/opt/testing/testing-7.2.y/etc/db/my.cnf --ledir=/opt/testing/testing-7.2.y/opt/mysql/bin --datadir=/opt/testing/testing-7.2.y/opt/mysql/data --pid-file=/opt/testing/testing-7.2.y/var/run/mysqld.pid
testing      21297  0.2  3.6 2608196 302196 ?      Sl   10:23   1:37 /opt/testing/testing-7.2.y/opt/mysql/bin/mysqld --defaults-file=/opt/testing/testing-7.2.y/etc/db/my.cnf --basedir=/opt/testing/testing-7.2.y/opt/mysql --datadir=/opt/testing/testing-7.2.y/opt/mysql/data --plugin-dir=/opt/testing/testing-7.2.y/opt/mysql/lib/plugin --log-error=/opt/testing/testing-7.2.y/var/log/db/mysqld.err --pid-file=/opt/testing/testing-7.2.y/var/run/mysqld.pid --socket=/opt/testing/testing-7.2.y/opt/mysql/data/mysql.sock --port=3306
[testing@rajxxx abcdef]$
============
strace -p 21297   getting stuck at below
-------
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
-----------



 Comments   
Comment by prasad [ 2020-04-02 ]

It seems relevant to MDEV-21707

Also note that I checked all tables and corruption to make sure it is clean using mysqlcheck before shutdown.

Comment by Elena Stepanova [ 2020-04-05 ]

Do you also have thread_handling = pool-of-threads (as in MDEV-21707), and does the problem disappear if you remove the setting?

Comment by prasad [ 2020-04-06 ]

We have thread_handling=one-thread-per-connection (default) and never changed. It also seems happening only for master and not slaves. When do master upgrade, we also make sure we enable skip-slave-start, skip-networking to completely disconnect communication from slaves before upgrade. Is there anything you want me to try? Is there any other configs that is preventing clean master shutdown?

Comment by prasad [ 2020-04-12 ]

I have also tried other options "mysqladmin shutdown" also 'shutdown wait for all slaves'. I still see same issue. For some reason, master was unable to do clean shutdown.

Comment by Elena Stepanova [ 2020-04-13 ]

Can you get all threads' stack trace from the hanging server (mysqld process)?
Please also attach the server error log.

Comment by prasad [ 2020-04-15 ]

Please find attached logs
mysqld_bt_all_threads.txt mysqld.err

Comment by Elena Stepanova [ 2020-04-16 ]

Thanks.
The only [non-]activity that I see in the stack trace seems to relate to binary logging. I'm far from sure it's the culprit, but I guess Elkin gets to look at it first.

Comment by Andrei Elkin [ 2020-04-17 ]

There's nothing replication thread specific in the stacktrace. We can see that the main thread is still around to poll fd:s. I can't explain that, but can offer to consider what if on this CENT OS 6.0 the server is built with DONT_USE_THR_ALARM so that this otherwise "default" part of SIGTERM handling
#ifndef DONT_USE_THR_ALARM
if (pthread_kill(select_thread, thr_client_alarm))
break; // allready dead
#endif
did not work at all over there.

Comment by Elena Stepanova [ 2020-04-17 ]

I guess it belongs to runtime then. I'll assign it to sanja as the runtime team lead, but maybe wlad could pick it up.

Comment by Vladislav Vaintroub [ 2020-04-17 ]

vrpprasad, can you next time try switch on strace, and "kill pid-of-mysqld" . To me it does not look like server was informed about shutdown at all

Comment by Sergey Vojtovich [ 2020-04-17 ]

vrpprasad, how was this binary compiled?

Comment by prasad [ 2020-04-20 ]

@sergey : No. we didn't compile and we downloaded binaries from https://downloads.mariadb.org/MariaDB/.

Comment by prasad [ 2020-04-21 ]

killmysqld.txt

@vladislav : I put "strace -p <mysqldpid>" and killed mysql. Please find the strace log.

Comment by Vladislav Vaintroub [ 2020-04-21 ]

vrpprasad, did the server hang during this attempt? do you have a corresponding gdb if it did not shutdown?

Comment by Andrei Elkin [ 2020-04-21 ]

vrpprasad Thanks for the new upload! Sorry we did not specify how much of the error the snippet should contain. It's too little in the 2 lines.. Sadly.
Could you please send us a piece of error log that contains the server startup and all that followed through the hanging shutdown?

Many thanks!

Andrei

Comment by prasad [ 2020-04-22 ]

@andrei and @Vladislav.
I am attaching all logs when it hang. mysql.server_when_hang.log mysqld_full_bt_all_threads_when_hang.txt mysqld_when_hang.err

Comment by Andrei Elkin [ 2020-04-27 ]

vrpprasad The stacktrace details get much harder to perceive 'cos of unresolved (??)
symbols all around. Just compare for yourself with the original report's one.
We managed to agree on some picture of actual threads at the snapshot time.
But still it largely remains unclear.

You should find a way to get the symbols back
(gdb) symbol-file
seems to be your option.

Secondly, I believe that the error log is taken from the server run
in log_warnings=1 even though 3 was requested.

Could you please take that under control and provide us with

SELECT @@global.log_warnings

report which shall clear out doubts about the reason of lack of error log messages
for us critical to know.

So could you please repeat it over again with the resolved symbols and real @@global.log_warnings = 3?

Thank you.

Andrei

Comment by Andrei Elkin [ 2020-04-28 ]

vrpprasad, Hi.

Thanks for bring more data for analysis! Yet the latest errorlog and stacktrace
https://jira.mariadb.org/secure/attachment/51369/mysqld_when_hang.err
https://jira.mariadb.org/secure/attachment/51368/mysqld_full_bt_all_threads_when_hang.txt
does not look to match.
In the error log there are actually traces of @@global.log_warnings=3, as it was requested, but
there are no traces of the Dump thread ever started on the server!
That is no line like

2020-04-28 17:58:39 9 [Note] Start binlog_dump to slave_server(3), pos(master-bin.000026, 343)

which must be there already with @@global.log_warnings=2, but the stacktrace has the binlog dump thread's stack.
I am afraid we still need these two pieces in proper order.

Also could try out
innodb_fast_shutdown=1
to clear out innodb involvement?

Regards,

Andrei

Comment by Sergey Vojtovich [ 2020-04-30 ]

I believe this issue worth investigating as soon as there's interest on both sides.
It is unclear if InnoDB purge threads were active (can we make them faster?) or stuck .
Lack of diagnostics is unfortunate.

Comment by prasad [ 2020-05-05 ]

HI @Sergey and @Andrei,
innodb_fast_shutdown=1 is fixing the issue. All the logs asked by support team provided in https://support.mariadb.com/view.php?id=37849. Please let me know if you need anything else.

Comment by Vladislav Vaintroub [ 2020-07-28 ]

not sure if there is anything more to do.
Shutdown in 10.4 takes a longer time, but it does not hang

Generated at Thu Feb 08 09:12:18 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.