[MDEV-26871] Shutdown hangs after upgrade to 10.5.12 Created: 2021-10-20  Updated: 2022-02-21  Resolved: 2021-11-08

Status: Closed
Project: MariaDB Server
Component/s: Replication, Storage Engine - InnoDB
Affects Version/s: 10.5.12
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Yakov Kushnirsky Assignee: Unassigned
Resolution: Not a Bug Votes: 0
Labels: None

Attachments: Text File mysqld_full_bt_all_threads-node1.txt     Text File mysqld_full_bt_all_threads-node2-1.txt     Text File mysqld_full_bt_all_threads-node2-2.txt     Text File mysqld_full_bt_all_threads-node2.txt     Text File ssq-3-2021-10-15-09-56.saas.appiancloud.com-20211019-mysqld_bt_all_threads.txt    
Issue Links:
Relates
relates to MDEV-22340 Server hangs on ibuf_merge_or_delete_... Closed
relates to MDEV-23839 innodb_fast_shutdown=0 hang on change... Closed

 Description   

Shutdown hangs after upgrade to 10.5.12

Shutdown hangs was reported on some sites after upgrade to 10.5.12 (from 10.5.10).
This is semisync replication. Master was started and running while 2 slaves hang on shutdown (done with innodb_fast_shutdown=0)
Configuration parameters are same between those site that work and does not.
Backtraces are attached here.

I am trying to get more details, but according to initial report some nodes work without problems, while other hangs (for days).

Some of parameters are:

MemTotal: 32471016 kB(30.97G)

innodb_buffer_pool_size 4294967296 (4GB)
tmp_table_size 16777216 (16MB) <--I asked bump up to 134217728 (128MB). But it was rejected since "worked before upgrade"
Max_used_connections 32

innodb_autoinc_lock_mode 1
innodb_flush_log_at_timeout 1



 Comments   
Comment by Daniel Black [ 2021-10-20 ]

Nice stack trace. Did you kill this off with a kill -6 ? Its just odd that storage/innobase/include/data0type.ic +301 doesn't include an assertion.

Recommend if you see this again, take two backtraces of the running process with some time interval between them. That can expose if progress is being made on whatever taks its performing. https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/#getting-backtraces-from-a-running-mysqld-process-with-gdb-on-linux

What innodb log file size is on this?

Probably not related, what are all the java threads here from?

Comment by Marko Mäkelä [ 2021-10-21 ]

In mysqld_full_bt_all_threads-node1.txt the only InnoDB thread is the idle page cleaner thread, in an indefinite sleep:

Thread 34 (Thread 0x7f4a4e196700 (LWP 17181)):
#0  0x00007f4b71528a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000055ee7a6eae96 in buf_flush_page_cleaner () at /usr/src/debug/MariaDB-/src_0/storage/innobase/buf/buf0flu.cc:2238

This means that no shutdown has been initiated to InnoDB. The cause of the hang must be something else. I did not look at the non-InnoDB threads. I leave that analysis to Elkin’s team.

In mysqld_full_bt_all_threads-node2.txt I see that apart from the page cleaner being idle, Thread 1 is executing InnoDB shutdown. That change buffer merge code should only be executed if you have set innodb_fast_shutdown=0. It looks like possible self-inflicted damage to me. See MDEV-23755 for discussion why that setting should not be needed.

Comment by Sergei Golubchik [ 2021-11-02 ]

YK, are those two stack traces indeed, as danblack asked, from two different points in time in the same process, during shutdown ? It seems to be the answer is "no" and "no"?

Comment by Yakov Kushnirsky [ 2021-11-02 ]

serg, you are correct: "no" and "no". These 2 most recent traces (that were taken on 10/21 without sending a kill) are from 2 different nodes: mysqld_full_bt_all_threads-node1 and -node2.
Let me check if the 10.5.12 environment where problem was observed is still available.

Comment by Yakov Kushnirsky [ 2021-11-03 ]

2 new traces, taken during shutdown, on the same node (running v. 10.5.12) a few minutes apart for the same process just uploaded:

  • mysqld_full_bt_all_threads-node2-1
  • mysqld_full_bt_all_threads-node2-2
Comment by Sergei Golubchik [ 2021-11-04 ]

marko, see above. Shutdown is apparently spending a lot of time in srv_shutdown ->... -> ibuf_merge_pages.

So, indeed, looks like your innodb_fast_shutdown=0 case. And the fix is not to use =0 here.

Comment by Sergei Golubchik [ 2021-11-08 ]

It seems to be not a bug. If you want InnoDB shutdown to be fast, you'd better set innodb_fast_shutdown=1. Otherwise you allow it to be slow and sometimes it can be.

Comment by Marko Mäkelä [ 2021-11-08 ]

There used to be a belief that innodb_fast_shutdown=0 would be needed before upgrades. That is unnecessary, as I have noted in the updated Description of MDEV-23755.

Generated at Thu Feb 08 09:48:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.