Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Duplicate
-
10.6.11, 10.6.12
-
None
-
OS : CentOS Linux release 7.9.2009 (Core)
Hardware : Supermicro SYS-5039MC-H12TRF
CPU : 12x Intel(R) Xeon(R) E-2286G CPU @ 4.00GHz
RAM : 31 GB
Description
We have a lot of servers that are using this same set up and configuration, and on one of them recently we started having a really odd issue that I can't explain.
The short version is that one of the databases in the server stops responding. When we review the process list, it shows that queries are stuck in the "Opening tables" state. Other databases on this same server are still responding normally at this time.
Killing the SQL processes from the mysql command line doesn't work. No error, it just doesn't kill it.
Then, if we try to issue the standard systemctl restart mariadb, it looks like it tried to start shutting down, but never can. At that point, the process list looks like this (username and database name have been replaced):
| 36649 | USERNAME | localhost | DATABASE | Killed | 4044 | Opening tables | SELECT post_id, meta_key, meta_value FROM wp_postmeta WHERE post_id IN (3210213,3209688,1894564,1578 | 0.000 |
|
| 36650 | USERNAME | localhost | DATABASE | Killed | 4038 | Opening tables | SELECT post_id, meta_key, meta_value FROM wp_postmeta WHERE post_id IN (3210213,3209688,1894564,1578 | 0.000 |
|
| 36653 | USERNAME | localhost | DATABASE | Killed | 4038 | Opening tables | SELECT post_id, meta_key, meta_value FROM wp_postmeta WHERE post_id IN (3210213,3209688,1894564,1578 | 0.000 |
|
...
|
...
|
...
|
Each time this has happened, we ended up having to issue a "kill -9" on the Mariadb master process to get it back. Once we did that, it started right back up and is running normally again.
I have found no errors at the system level and Mariadb is not even recording any errors. Logging is still working because Mariadb does log some things during the event, like when a user doesn't use the right password. But there are zero errors logged.
Just reviewing the documentation about this, it says that it could be caused by table_open_cache settings. On this server, there are about 1000 tables, including the tables in sys and mysql, etc. And we're using the default value of 2000 for table_open_cache right now. So I can't see how it could have anything to do with that setting.
It's a difficult issue to debug since there is basically no data recorded about the problem and it's basically just locked up when I get to it.
Any ideas?
Attachments
Issue Links
- duplicates
-
MDEV-26855 Enable spinning for log_sys_mutex and log_flush_order_mutex
- Closed
-
MDEV-29835 Partial server freeze
- Closed
- relates to
-
MDEV-30665 MariaDB 10.6.12 hang 2 days after upgrade
- Closed