[MDEV-30481] Hard lock up with queries in "Opening tables" state - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Duplicate
Affects Version/s: 10.6.11, 10.6.12
Fix Version/s: 10.6.13, 10.8.8, 10.9.6, 11.0.2, 11.1.1, 10.10.5, 10.11.4
Component/s: Storage Engine - InnoDB
Labels:
None
Environment:
OS : CentOS Linux release 7.9.2009 (Core)
Hardware : Supermicro SYS-5039MC-H12TRF
CPU : 12x Intel(R) Xeon(R) E-2286G CPU @ 4.00GHz
RAM : 31 GB

Description

We have a lot of servers that are using this same set up and configuration, and on one of them recently we started having a really odd issue that I can't explain.

The short version is that one of the databases in the server stops responding. When we review the process list, it shows that queries are stuck in the "Opening tables" state. Other databases on this same server are still responding normally at this time.

Killing the SQL processes from the mysql command line doesn't work. No error, it just doesn't kill it.

Then, if we try to issue the standard systemctl restart mariadb, it looks like it tried to start shutting down, but never can. At that point, the process list looks like this (username and database name have been replaced):

| 36649 | USERNAME | localhost | DATABASE | Killed  | 4044 | Opening tables | SELECT post_id, meta_key, meta_value FROM wp_postmeta WHERE post_id IN (3210213,3209688,1894564,1578 |    0.000 |

| 36650 | USERNAME | localhost | DATABASE | Killed  | 4038 | Opening tables | SELECT post_id, meta_key, meta_value FROM wp_postmeta WHERE post_id IN (3210213,3209688,1894564,1578 |    0.000 |

| 36653 | USERNAME | localhost | DATABASE | Killed  | 4038 | Opening tables | SELECT post_id, meta_key, meta_value FROM wp_postmeta WHERE post_id IN (3210213,3209688,1894564,1578 |    0.000 |

...

...

...

Each time this has happened, we ended up having to issue a "kill -9" on the Mariadb master process to get it back. Once we did that, it started right back up and is running normally again.

I have found no errors at the system level and Mariadb is not even recording any errors. Logging is still working because Mariadb does log some things during the event, like when a user doesn't use the right password. But there are zero errors logged.

Just reviewing the documentation about this, it says that it could be caused by table_open_cache settings. On this server, there are about 1000 tables, including the tables in sys and mysql, etc. And we're using the default value of 2000 for table_open_cache right now. So I can't see how it could have anything to do with that setting.

It's a difficult issue to debug since there is basically no data recorded about the problem and it's basically just locked up when I get to it.

Any ideas?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

disassemble-ha_innobase--info_low.txt
49 kB
2023-03-07 15:13
ha_innobase__info_low-dis-rs.txt
93 kB
2023-03-08 06:11
mariadbd_full_bt_all_threads_2023-03-06.txt
462 kB
2023-03-06 17:37

Issue Links

duplicates

MDEV-26855 Enable spinning for log_sys_mutex and log_flush_order_mutex

Closed

MDEV-29835 Partial server freeze

Closed

relates to

MDEV-30665 MariaDB 10.6.12 hang 2 days after upgrade

Closed

Activity

Ascending order - Click to sort in descending order

View 13 older comments

Brad added a comment - 2023-03-30 13:40

Thanks, Marko! The RPMs you listed are exactly what I updated to yesterday so we should be all set. It's very infrequent for us though. I'll set a reminder for myself to check reply back in 8 weeks. If we haven't had the problem by then, then I think we can call it fixed.

Brad added a comment - 2023-03-30 13:40 Thanks, Marko! The RPMs you listed are exactly what I updated to yesterday so we should be all set. It's very infrequent for us though. I'll set a reminder for myself to check reply back in 8 weeks. If we haven't had the problem by then, then I think we can call it fixed.

Richard Stanway added a comment - 2023-05-01 14:59 - edited

I believe I am also running into this (or a very similar issue) on a 10.11.2 server. The primary DB activity is a bursty series of inserts and updates to a single table from 8 connections every minute. Almost all server settings are at default beyond these:

innodb_buffer_pool_size = 64G

innodb_doublewrite = 0

innodb_file_per_table = ON

innodb_log_write_ahead_size = 16384

innodb_use_native_aio = 0

innodb_use_atomic_writes = 0

innodb_flush_neighbors = 0

innodb_io_capacity = 1000

innodb_io_capacity_max = 2500

innodb_flush_log_at_trx_commit = 2

Thread stacks:
https://gist.githubusercontent.com/notr1ch/c37d05f3c537c5c3f6a3c1c4d53c43ea/raw/0f271da2deb43c166b9558a3250f2953c1e4007a/gistfile1.txt

I also have a core file if it's helpful to get more information from.

Richard Stanway added a comment - 2023-05-01 14:59 - edited I believe I am also running into this (or a very similar issue) on a 10.11.2 server. The primary DB activity is a bursty series of inserts and updates to a single table from 8 connections every minute. Almost all server settings are at default beyond these: innodb_buffer_pool_size = 64G innodb_doublewrite = 0 innodb_file_per_table = ON innodb_log_write_ahead_size = 16384 innodb_use_native_aio = 0 innodb_use_atomic_writes = 0 innodb_flush_neighbors = 0 innodb_io_capacity = 1000 innodb_io_capacity_max = 2500 innodb_flush_log_at_trx_commit = 2 Thread stacks: https://gist.githubusercontent.com/notr1ch/c37d05f3c537c5c3f6a3c1c4d53c43ea/raw/0f271da2deb43c166b9558a3250f2953c1e4007a/gistfile1.txt I also have a core file if it's helpful to get more information from.

Richard Stanway added a comment - 2023-05-10 14:32

I've been using the custom build from https://ci.mariadb.org/33215/ since my last comment and the problem did not reproduce (previously it happened again within 4 hours of the server restart). I've just upgraded to 10.11.3 which seems to include this fix, and hopefully this problem is now solved.

Richard Stanway added a comment - 2023-05-10 14:32 I've been using the custom build from https://ci.mariadb.org/33215/ since my last comment and the problem did not reproduce (previously it happened again within 4 hours of the server restart). I've just upgraded to 10.11.3 which seems to include this fix, and hopefully this problem is now solved.

Brad added a comment - 2023-05-10 14:36

It hasn't quite been 6 weeks since we started using the patched version but I think we may be close enough to call this as fixed in my eyes. Thanks for all of your hard work! I'll report back if we do end up having issues again

Brad added a comment - 2023-05-10 14:36 It hasn't quite been 6 weeks since we started using the patched version but I think we may be close enough to call this as fixed in my eyes. Thanks for all of your hard work! I'll report back if we do end up having issues again

Marko Mäkelä added a comment - 2023-05-10 19:14

Thank you, wk_bradp. Coincidentally, MariaDB Server 10.6.13 was just released today.

Marko Mäkelä added a comment - 2023-05-10 19:14 Thank you, wk_bradp . Coincidentally, MariaDB Server 10.6.13 was just released today.

MariaDB Server

Hard lock up with queries in "Opening tables" state

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration