[MDEV-30481] Hard lock up with queries in "Opening tables" state Created: 2023-01-26 Updated: 2023-05-12 Resolved: 2023-05-10 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.6.11, 10.6.12 |
| Fix Version/s: | 11.1.1, 11.0.2, 10.6.13, 10.8.8, 10.9.6, 10.10.5, 10.11.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Brad | Assignee: | Marko Mäkelä |
| Resolution: | Duplicate | Votes: | 3 |
| Labels: | None | ||
| Environment: |
OS : CentOS Linux release 7.9.2009 (Core) |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
We have a lot of servers that are using this same set up and configuration, and on one of them recently we started having a really odd issue that I can't explain. The short version is that one of the databases in the server stops responding. When we review the process list, it shows that queries are stuck in the "Opening tables" state. Other databases on this same server are still responding normally at this time. Killing the SQL processes from the mysql command line doesn't work. No error, it just doesn't kill it. Then, if we try to issue the standard systemctl restart mariadb, it looks like it tried to start shutting down, but never can. At that point, the process list looks like this (username and database name have been replaced):
Each time this has happened, we ended up having to issue a "kill -9" on the Mariadb master process to get it back. Once we did that, it started right back up and is running normally again. I have found no errors at the system level and Mariadb is not even recording any errors. Logging is still working because Mariadb does log some things during the event, like when a user doesn't use the right password. But there are zero errors logged. Just reviewing the documentation about this, it says that it could be caused by table_open_cache settings. On this server, there are about 1000 tables, including the tables in sys and mysql, etc. And we're using the default value of 2000 for table_open_cache right now. So I can't see how it could have anything to do with that setting. It's a difficult issue to debug since there is basically no data recorded about the problem and it's basically just locked up when I get to it. Any ideas? |
| Comments |
| Comment by Daniel Black [ 2023-01-27 ] | ||||||||||
|
Is obtaining https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/#getting-full-backtraces-for-all-threads-from-a-running-mariadbd-process with debug symbols installed on the next occurrence possible. Its normally pretty easy to see the lock deadlock with that information. Just saving with gcore for post-analysis. | ||||||||||
| Comment by Brad [ 2023-01-27 ] | ||||||||||
|
Thanks so much for the reply! Yes, this is possible. I've installed all the necessary debug symbol info packages so we should be all set. This is happening about once every 3 weeks it seems so it may be a couple of weeks before I can get the info. I'll let you know what I find. | ||||||||||
| Comment by Brad [ 2023-03-06 ] | ||||||||||
|
Hey @danblack, This issue just happened this morning and we were able to get the backtrace data from the running instance. Would you mind taking a look and see if there is anything unusual in there? I didn't spot anything obvious but maybe a better trained eye. Thanks! | ||||||||||
| Comment by Daniel Black [ 2023-03-06 ] | ||||||||||
|
of note: Thread 33-70, a select query, calling ha_innobase::info_low with flags= HA_STATUS_NO_LOCK | HA_STATUS_VARIABLE - holding a lock Thread 33 - large I_S looking query going over all tables lock in ha_innobase::info_low flag 282 - HA_STATUS_OPEN | HA_STATUS_VARIABLE | HA_STATUS_CONST | HA_STATUS_NO_LOCK (holding same lock as 33-70 - assuming the table lock on wp_postmeta) Thread 31 table lock wait on "DELETE FROM wp_postmeta" Thread 30 - in progress, unblocked, UPDATE wp_wc_product_meta_lookup , wp_postmeta Thread 28 - show fields from `wp_postmeta` -> lock in ha_innobase::info_low flag 282 - HA_STATUS_OPEN | HA_STATUS_VARIABLE | HA_STATUS_CONST | HA_STATUS_NO_LOCK (holding same lock as 33-70 - assuming the table lock?) How long would the Thread 30 query take (try a SELECT form of the same query for a comparison time)? Are you confident its just not everything blocked on that query? Would it be covering the entire wp_postmeta table (what's its query plan)? | ||||||||||
| Comment by Marko Mäkelä [ 2023-03-07 ] | ||||||||||
|
danblack, thank you for your analysis so far. InnoDB seems to currently ignore the flag HA_STATUS_NO_LOCK. Can either you or wk_bradp please try to find the statement in ha_innobase::info_low() that is trying to acquire a lock? In the stack trace, we only see the partly inlined call to acquire some dict_index_t::lock or buf_page_t::lock. The output of the GDB command disassemble ha_innobase::info_low would be helpful. | ||||||||||
| Comment by Brad [ 2023-03-07 ] | ||||||||||
|
Thanks for looking into this guys! > How long would the Thread 30 query take (try a SELECT form of the same query for a comparison time)? > Are you confident its just not everything blocked on that query? > Would it be covering the entire wp_postmeta table (what's its query plan)? I've also attached the "disassemble ha_innobase::info_low" info here as well. | ||||||||||
| Comment by Daniel Black [ 2023-03-08 ] | ||||||||||
|
stack trace line numbers seem to match 10.6.12 | ||||||||||
| Comment by Daniel Black [ 2023-03-08 ] | ||||||||||
|
stack trace to dissemble is −0x56138AD0B005 0x000056138b9d8101 in ha_innobase::info_low is ccd0fc ha_innobase__info_low-dis-rs.txt | ||||||||||
| Comment by Marko Mäkelä [ 2023-03-08 ] | ||||||||||
|
wk_bradp and danblack, thank you. In ha_innobase__info_low-dis-rs.txt
The code that it jumps to after acquiring a shared latch helpfully includes some source code:
Also this code is inlined. In ha_innobase::info_low() we have the following:
It would seem to make sense to obey HA_STATUS_NO_LOCK and set stats.delete_length to a dummy value when the flag is set. But, what is the thread that is holding the tablespace latch doing? Is it hung somewhere? What is it waiting for? Latches should not be held for more than a few milliseconds at a time. | ||||||||||
| Comment by Marko Mäkelä [ 2023-03-08 ] | ||||||||||
|
I am happy to see that wk_bradp is using MariaDB 10.6.12 where part of
In the blocked threads, the address is slightly different, because fil_space_t::latch is not the first data member:
I don’t think that we should do anything about ha_innobase::info_low(). We must fix the remaining Edit: Fixed typo for | ||||||||||
| Comment by Marko Mäkelä [ 2023-03-29 ] | ||||||||||
|
wk_bradp, can you test this custom build that corresponds to this version of the final fix of For another community user, the hangs went away. He said that his workload used to hang every couple of days, at best 5 days without a hang. | ||||||||||
| Comment by Brad [ 2023-03-29 ] | ||||||||||
|
Absolutely! I just updated it on that server. However, it only happens to this server about every 6 weeks and we just had an event last week. So it would be a while before we would know if it did fix it or not. I do have it updated and will let you know if we run into any more issues. Many thanks! | ||||||||||
| Comment by Marko Mäkelä [ 2023-03-30 ] | ||||||||||
|
wk_bradp, thank you. I will leave this waiting for your feedback, and I will tentatively mark this as a duplicate of The hangs are very sporadic, and the logic was actually broken already in MySQL 5.7 and MariaDB 10.2. Part of the trouble was fixed in For the record, you can find the latest development snapshot of a major branch if you go to https://buildbot.mariadb.org/#/grid?branch=10.6 and then click on the leftmost cell at the end, right next to the tarball-docker row header. Currently it would be https://buildbot.mariadb.org/#/builders/1/builds/33837 for the latest 10.6. The last part is the build number. You can find the packages for this build at https://ci.mariadb.org/33837 for various platforms. For your platform, the right file might be https://ci.mariadb.org/33837/amd64-centos-7-rpm-autobake/MariaDB.repo A heavy user just today reported that on a patched server, he got 11 days of uptime without hangs before he had to move the database to larger storage. Hangs used to occur every 1 or 2 days, the record was 5 days. | ||||||||||
| Comment by Brad [ 2023-03-30 ] | ||||||||||
|
Thanks, Marko! The RPMs you listed are exactly what I updated to yesterday so we should be all set. It's very infrequent for us though. I'll set a reminder for myself to check reply back in 8 weeks. If we haven't had the problem by then, then I think we can call it fixed. | ||||||||||
| Comment by Richard Stanway [ 2023-05-01 ] | ||||||||||
|
I believe I am also running into this (or a very similar issue) on a 10.11.2 server. The primary DB activity is a bursty series of inserts and updates to a single table from 8 connections every minute. Almost all server settings are at default beyond these:
Thread stacks: I also have a core file if it's helpful to get more information from. | ||||||||||
| Comment by Richard Stanway [ 2023-05-10 ] | ||||||||||
|
I've been using the custom build from https://ci.mariadb.org/33215/ since my last comment and the problem did not reproduce (previously it happened again within 4 hours of the server restart). I've just upgraded to 10.11.3 which seems to include this fix, and hopefully this problem is now solved. | ||||||||||
| Comment by Brad [ 2023-05-10 ] | ||||||||||
|
It hasn't quite been 6 weeks since we started using the patched version but I think we may be close enough to call this as fixed in my eyes. Thanks for all of your hard work! I'll report back if we do end up having issues again | ||||||||||
| Comment by Marko Mäkelä [ 2023-05-10 ] | ||||||||||
|
Thank you, wk_bradp. Coincidentally, MariaDB Server 10.6.13 was just released today. |