[MDEV-30481] Hard lock up with queries in "Opening tables" state Created: 2023-01-26  Updated: 2023-05-12  Resolved: 2023-05-10

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6.11, 10.6.12
Fix Version/s: 11.1.1, 11.0.2, 10.6.13, 10.8.8, 10.9.6, 10.10.5, 10.11.4

Type: Bug Priority: Critical
Reporter: Brad Assignee: Marko Mäkelä
Resolution: Duplicate Votes: 3
Labels: None
Environment:

OS : CentOS Linux release 7.9.2009 (Core)
Hardware : Supermicro SYS-5039MC-H12TRF
CPU : 12x Intel(R) Xeon(R) E-2286G CPU @ 4.00GHz
RAM : 31 GB


Attachments: Text File disassemble-ha_innobase--info_low.txt     Text File ha_innobase__info_low-dis-rs.txt     Text File mariadbd_full_bt_all_threads_2023-03-06.txt    
Issue Links:
Duplicate
duplicates MDEV-26855 Enable spinning for log_sys_mutex and... Closed
duplicates MDEV-29835 Partial server freeze Closed
Relates
relates to MDEV-30665 MariaDB 10.6.12 hang 2 days after upg... Closed

 Description   

We have a lot of servers that are using this same set up and configuration, and on one of them recently we started having a really odd issue that I can't explain.

The short version is that one of the databases in the server stops responding. When we review the process list, it shows that queries are stuck in the "Opening tables" state. Other databases on this same server are still responding normally at this time.

Killing the SQL processes from the mysql command line doesn't work. No error, it just doesn't kill it.

Then, if we try to issue the standard systemctl restart mariadb, it looks like it tried to start shutting down, but never can. At that point, the process list looks like this (username and database name have been replaced):

| 36649 | USERNAME | localhost | DATABASE | Killed  | 4044 | Opening tables | SELECT post_id, meta_key, meta_value FROM wp_postmeta WHERE post_id IN (3210213,3209688,1894564,1578 |    0.000 |
| 36650 | USERNAME | localhost | DATABASE | Killed  | 4038 | Opening tables | SELECT post_id, meta_key, meta_value FROM wp_postmeta WHERE post_id IN (3210213,3209688,1894564,1578 |    0.000 |
| 36653 | USERNAME | localhost | DATABASE | Killed  | 4038 | Opening tables | SELECT post_id, meta_key, meta_value FROM wp_postmeta WHERE post_id IN (3210213,3209688,1894564,1578 |    0.000 |
...
...
...

Each time this has happened, we ended up having to issue a "kill -9" on the Mariadb master process to get it back. Once we did that, it started right back up and is running normally again.

I have found no errors at the system level and Mariadb is not even recording any errors. Logging is still working because Mariadb does log some things during the event, like when a user doesn't use the right password. But there are zero errors logged.

Just reviewing the documentation about this, it says that it could be caused by table_open_cache settings. On this server, there are about 1000 tables, including the tables in sys and mysql, etc. And we're using the default value of 2000 for table_open_cache right now. So I can't see how it could have anything to do with that setting.

It's a difficult issue to debug since there is basically no data recorded about the problem and it's basically just locked up when I get to it.

Any ideas?



 Comments   
Comment by Daniel Black [ 2023-01-27 ]

Is obtaining https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/#getting-full-backtraces-for-all-threads-from-a-running-mariadbd-process with debug symbols installed on the next occurrence possible. Its normally pretty easy to see the lock deadlock with that information. Just saving with gcore for post-analysis.

Comment by Brad [ 2023-01-27 ]

Thanks so much for the reply! Yes, this is possible. I've installed all the necessary debug symbol info packages so we should be all set. This is happening about once every 3 weeks it seems so it may be a couple of weeks before I can get the info. I'll let you know what I find.

Comment by Brad [ 2023-03-06 ]

Hey @danblack,

This issue just happened this morning and we were able to get the backtrace data from the running instance. Would you mind taking a look and see if there is anything unusual in there? I didn't spot anything obvious but maybe a better trained eye.

Thanks!

Comment by Daniel Black [ 2023-03-06 ]

of note:

Thread 33-70, a select query, calling ha_innobase::info_low with flags= HA_STATUS_NO_LOCK | HA_STATUS_VARIABLE - holding a lock
(most/all are SELECT ... wp_postmeta)

Thread 33 - large I_S looking query going over all tables lock in ha_innobase::info_low flag 282 - HA_STATUS_OPEN | HA_STATUS_VARIABLE | HA_STATUS_CONST | HA_STATUS_NO_LOCK (holding same lock as 33-70 - assuming the table lock on wp_postmeta)

Thread 31 table lock wait on "DELETE FROM wp_postmeta"

Thread 30 - in progress, unblocked, UPDATE wp_wc_product_meta_lookup , wp_postmeta

Thread 28 - show fields from `wp_postmeta` -> lock in ha_innobase::info_low flag 282 - HA_STATUS_OPEN | HA_STATUS_VARIABLE | HA_STATUS_CONST | HA_STATUS_NO_LOCK (holding same lock as 33-70 - assuming the table lock?)

How long would the Thread 30 query take (try a SELECT form of the same query for a comparison time)? Are you confident its just not everything blocked on that query? Would it be covering the entire wp_postmeta table (what's its query plan)?

Comment by Marko Mäkelä [ 2023-03-07 ]

danblack, thank you for your analysis so far. InnoDB seems to currently ignore the flag HA_STATUS_NO_LOCK.

Can either you or wk_bradp please try to find the statement in ha_innobase::info_low() that is trying to acquire a lock? In the stack trace, we only see the partly inlined call to acquire some dict_index_t::lock or buf_page_t::lock. The output of the GDB command disassemble ha_innobase::info_low would be helpful.

Comment by Brad [ 2023-03-07 ]

Thanks for looking into this guys!

> How long would the Thread 30 query take (try a SELECT form of the same query for a comparison time)?
It looks like the data cuts off the full SQL query so I'm not sure how I can reproduce that one and run it to find out. I'm not sure if it's even getting that far though because the queries all show state "Opening Tables".

> Are you confident its just not everything blocked on that query?
I'm pretty sure. The entire database gets locked when the issue happens. You can't execute any queries to any table in the database. Even just running "use DATABASENAME" causes the mysql client to lock up. Although, other databases in this same server are running fine at the time. Also, you can't kill the SQL commands. No errors, it just doesn't kill them when you issue the kill ID command.

> Would it be covering the entire wp_postmeta table (what's its query plan)?
I'm not sure if I understand this question

I've also attached the "disassemble ha_innobase::info_low" info here as well.

Comment by Daniel Black [ 2023-03-08 ]

stack trace line numbers seem to match 10.6.12

Comment by Daniel Black [ 2023-03-08 ]

stack trace to dissemble is −0x56138AD0B005

0x000056138b9d8101 in ha_innobase::info_low is ccd0fc ha_innobase__info_low-dis-rs.txt , seems to follow after innobase_get_mysql_key_number_for_index acquiring /* Loop through each index of the table and lock them */

Comment by Marko Mäkelä [ 2023-03-08 ]

wk_bradp and danblack, thank you. In ha_innobase__info_low-dis-rs.txt I can see the following:

   0x0000000000ccd0fc <+3260>:	e8 df 81 0f 00	call   0xdc52e0 <_ZN14ssux_lock_implILb0EE7rd_waitEv>
   0x0000000000ccd101 <+3265>:	e9 f8 fb ff ff	jmp    0xccccfe <_ZN11ha_innobase8info_lowEjb+2238>

The code that it jumps to after acquiring a shared latch helpfully includes some source code:

/usr/src/debug/MariaDB-/src_0/storage/innobase/handler/ha_innodb.cc:
14661		if (size_in_header < FSP_EXTENT_SIZE) {
   0x0000000000ccccfe <+2238>:	48 8d 05 5b f5 66 01	lea    0x166f55b(%rip),%rax        # 0x233c260 <srv_page_size_shift>

Also this code is inlined. In ha_innobase::info_low() we have the following:

			space->s_lock();
			stats.delete_length = 1024
				* fsp_get_available_space_in_free_extents(
					*space);
			space->s_unlock();

It would seem to make sense to obey HA_STATUS_NO_LOCK and set stats.delete_length to a dummy value when the flag is set.

But, what is the thread that is holding the tablespace latch doing? Is it hung somewhere? What is it waiting for? Latches should not be held for more than a few milliseconds at a time.

Comment by Marko Mäkelä [ 2023-03-08 ]

I am happy to see that wk_bradp is using MariaDB 10.6.12 where part of MDEV-26855 was fixed by MDEV-30400. In mariadbd_full_bt_all_threads_2023-03-06.txt we have a nice stack trace of Thread 63 (Thread 0x7fda16ffd700 (LWP 16590)) that matches a known remaining hang of MDEV-29835. Thanks to the refactoring of mtr_t::m_memo that was revised in MDEV-30289 we can easily see that this thread is holding the exclusive tablespace latch that numerous ha_innobase::info_low() are eagerly waiting for:

{object = 0x56138e94d0c8, type = MTR_MEMO_SPACE_X_LOCK}

In the blocked threads, the address is slightly different, because fil_space_t::latch is not the first data member:

#4  ssux_lock_impl<false>::rd_wait (this=0x56138e94d170) at /usr/src/debug/MariaDB-/src_0/storage/innobase/sync/srw_lock.cc:392
#5  0x000056138b9d8101 in ha_innobase::info_low (this=0x7fd9e020acc0, flag=18, is_analyze=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/include/srw_lock.h:247

I don’t think that we should do anything about ha_innobase::info_low(). We must fix the remaining MDEV-29835 deadlocks instead.

Edit: Fixed typo for MDEV-29835.

Comment by Marko Mäkelä [ 2023-03-29 ]

wk_bradp, can you test this custom build that corresponds to this version of the final fix of MDEV-29835? These commits are identical with each other, only the base revision changed in between. Any recent development snapshot build of 10.6 is fine as well.

For another community user, the hangs went away. He said that his workload used to hang every couple of days, at best 5 days without a hang.

Comment by Brad [ 2023-03-29 ]

Absolutely! I just updated it on that server. However, it only happens to this server about every 6 weeks and we just had an event last week. So it would be a while before we would know if it did fix it or not.

I do have it updated and will let you know if we run into any more issues. Many thanks!

Comment by Marko Mäkelä [ 2023-03-30 ]

wk_bradp, thank you. I will leave this waiting for your feedback, and I will tentatively mark this as a duplicate of MDEV-29835.

The hangs are very sporadic, and the logic was actually broken already in MySQL 5.7 and MariaDB 10.2. Part of the trouble was fixed in MDEV-14637 years ago. In our internal testing, we started observing the hangs more often starting with 10.6.9, which included MDEV-21136.

For the record, you can find the latest development snapshot of a major branch if you go to https://buildbot.mariadb.org/#/grid?branch=10.6 and then click on the leftmost cell at the end, right next to the tarball-docker row header. Currently it would be https://buildbot.mariadb.org/#/builders/1/builds/33837 for the latest 10.6. The last part is the build number. You can find the packages for this build at https://ci.mariadb.org/33837 for various platforms. For your platform, the right file might be https://ci.mariadb.org/33837/amd64-centos-7-rpm-autobake/MariaDB.repo

A heavy user just today reported that on a patched server, he got 11 days of uptime without hangs before he had to move the database to larger storage. Hangs used to occur every 1 or 2 days, the record was 5 days.

Comment by Brad [ 2023-03-30 ]

Thanks, Marko! The RPMs you listed are exactly what I updated to yesterday so we should be all set. It's very infrequent for us though. I'll set a reminder for myself to check reply back in 8 weeks. If we haven't had the problem by then, then I think we can call it fixed.

Comment by Richard Stanway [ 2023-05-01 ]

I believe I am also running into this (or a very similar issue) on a 10.11.2 server. The primary DB activity is a bursty series of inserts and updates to a single table from 8 connections every minute. Almost all server settings are at default beyond these:

innodb_buffer_pool_size = 64G
innodb_doublewrite = 0
innodb_file_per_table = ON
innodb_log_write_ahead_size = 16384
innodb_use_native_aio = 0
innodb_use_atomic_writes = 0
innodb_flush_neighbors = 0
innodb_io_capacity = 1000
innodb_io_capacity_max = 2500
innodb_flush_log_at_trx_commit = 2

Thread stacks:
https://gist.githubusercontent.com/notr1ch/c37d05f3c537c5c3f6a3c1c4d53c43ea/raw/0f271da2deb43c166b9558a3250f2953c1e4007a/gistfile1.txt

I also have a core file if it's helpful to get more information from.

Comment by Richard Stanway [ 2023-05-10 ]

I've been using the custom build from https://ci.mariadb.org/33215/ since my last comment and the problem did not reproduce (previously it happened again within 4 hours of the server restart). I've just upgraded to 10.11.3 which seems to include this fix, and hopefully this problem is now solved.

Comment by Brad [ 2023-05-10 ]

It hasn't quite been 6 weeks since we started using the patched version but I think we may be close enough to call this as fixed in my eyes. Thanks for all of your hard work! I'll report back if we do end up having issues again

Comment by Marko Mäkelä [ 2023-05-10 ]

Thank you, wk_bradp. Coincidentally, MariaDB Server 10.6.13 was just released today.

Generated at Thu Feb 08 10:16:34 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.