[MDEV-24378] Crashes on Semaphore wait > 600 seconds Created: 2020-12-09 Updated: 2023-02-19 Resolved: 2021-04-18 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.4.17 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jacob Williams | Assignee: | Marko Mäkelä |
| Resolution: | Incomplete | Votes: | 15 |
| Labels: | need_feedback | ||
| Environment: |
4.14.203-116.332.amzn1.x86_64 |
||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Description |
|
Since upgrading to 10.4.17 we have been getting long semaphore waits which frequently result in a crash. Sometimes they resolve before a crash, but still cause lots of other lock wait timeouts or other related problems. The most frequent latches referenced are in btr0cur.cc and dict0dict.cc. We have 4 production and 4 staging servers, each with different workloads, and we have seen these errors on at least 3 of the production servers and 2 of the staging servers. The tendency is for it to happen during high write load but it has happened when load was fairly low. It has occurred at least 8 times in our production cluster since the upgrade to 10.4.17 which we installed on 2020/11/21. Prior to the upgrade to 10.4.17 we had been on 10.4.14. That version had been having unexplained crashes, following a similar pattern, but we were not getting the semaphore wait warnings in our logs. Those had started with the 10.4.14 upgrade but without any ability to reproduce, and no significant log messages, we had not filed a bug report on it yet. I suspect that it was the same issue, but don't know how to test that theory.
There are warnings with INNODB monitor output numerous times preceding the crash. The number of open queries getting stuck gets quite long as it approaches the crash so I figure the one most likely to be helpful is the earliest one. I have censored the query strings due to sensitive data.
Potentially related issues? |
| Comments |
| Comment by Marko Mäkelä [ 2020-12-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
jacob.williams, can you please try to produce a stack trace of all threads during the hang? This report could be a duplicate of Side note: in MariaDB 10.6, I plan to remove the InnoDB semaphore output as part of | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Octavian Cerna [ 2020-12-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I can confirm, I'm seeing long semaphore waits on multiple machines running both 10.3.27 and 10.4.17, I have attached a monitor output: oc_1_innodb.txt and gdb backtrace: oc_1_threads.txt of a server There was a single query running, waiting for a RW-lock. Looking at the backtrace of the thread holding that RW-lock This led me to believe that the background stats update is starving the query threads by holding a write lock to I suspect that this high contention on that internal mutex might be related to I'm seeing similar behaviour on a server running 10.3.27 - long semaphore waits in btr0cur.cc resulting in crashes, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by harri korpi [ 2020-12-23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello, Maria log shows the crash as:
Would you able to confirm if this is the same or different issue? Thank you. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mike Ingle [ 2020-12-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We had this problem on 10.4.17 ubuntu 18.04 x86_64. On one server I could not even load a large 250GB import file from mysqldump, as it would throw the >600 error and restart at about the 25% mark every time. I downgraded that machine to 10.4.15 and was able to load it. Our production server had been upgraded to 10.4.17 to avoid another issue with a complex stored procedure causing a restart. It was getting long semaphore timeouts causing rollbacks every day and eventually had a >600 restart during a large batch insert. We got a lot of log entries like this: The lock was always from dict0stats.cc, which is table/index statistics. This file has been modified a lot between 10.4.15 and 10.4.17. I turned off table/index statistics: set global innodb_stats_persistent = off; The errors, which had been appearing nightly, did not happen last night. I think what is happening is that when some percentage of the table has been changed, Mariadb decides to update the statistics. Because the insert or update is still going on, the statistics gets into a deadlock with the insert/update and crashes the server. It looks like the dict0stats.cc file has a lot of exception bailouts in it. Perhaps one of them is not releasing a lock? Until Mariadb gets this fixed, turning off innodb stats may be a workaround to keep your server alive. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Leonard Sinquer [ 2021-01-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We are having exact the same issues on our multi-master set-up, restarts happening daily on random masters/slaves (all running 10.4.17). We previously uses 10.4.2 and this was never happening. I will try the suggestion by @mikeininfo and will let you know the results. Thanks!
mysql-error.log snippet:
Checking stack trace:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mike Ingle [ 2021-01-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@shalogo Did that happen before or after you made the change? I have not seen it again since the change. Trying to determine if this is the only place it happens. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Leonard Sinquer [ 2021-01-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@Mike not sure what change you mean? The semaphore wait started happening after we upgraded to 10.4.17 immediately. Like every 24 hours since the upgrade at least once. Sometimes after 4 hours, sometimes takes full day. In the end, I was forced to roll back to 10.4.15 (which has its own issues, like innodb_io_capacity being interpreted in a weird way) but the semaphore waits are not present at least. I have only production set-up, so I cannot afford to experiment much, sorry. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Simon Avery [ 2021-01-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Another sufferer of this reporting in. We have around a dozen MariaDb servers, all running on Centos 7, which started exhibiting this behavior only after we upgraded from 10.2 a few months back. Current version is 10.3.27-MariaDB, and we started at 3.14 when we moved up. Happens under heavy load. Sometimes under high concurrency, sometimes with a single active connection. Resources are fine, lots of free ram, cpu load adequate, disk and network fast. Nothing near capacity. Some SLOW queries are generated (default timeouts) but nothing terrible. There's nothing unusual in the work being asked of the databases, similar queries, data and workload worked fine before this upgrade. We have 10.3.x running on Centos 8 Linux servers too. I can't say that it doesn't crash on that, only that the ones I'm having most problem with are on C7. Have been chasing this for a while, so relieved to see we are not the only one. Jan 15 00:21:10 serverA maria 2021-01-15 0:21:05 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. Partial other logs: maria-sem.txt | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jacob Williams [ 2021-01-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have not been able to dedicate the time to get the stack trace, but I still plan to. We downgraded our servers to 10.4.13 which was the last version that was stable for us. It has been stable for several weeks now on the older version. Versions 14-17 each had slightly different symptoms, but we saw 1-2 crashes per week over 8 servers, with significant locking issues preceding the crashes. Prior to 10.4.17 the crashes did not have much in the logs, either nothing or a generic crash output with no stack trace. Starting with .17 the crashes were always preceded by the semaphore warnings, although sometimes the semaphore warnings resolved without a crash. When I get a chance, I will upgrade our staging environment again and set it up to get a proper stack trace. I still don't know how to trigger the problem at will in our environment. It happens often enough to cause problems, but infrequently enough to make it hard to catch. And it is much less frequently in staging, since we don't maintain nearly as much load as production. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Santiago Gala [ 2021-01-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We have experienced the same crash twice. We have a server dealing with an archive. A big database where records are inserted continuously. While this insertion process is going on, periodic processes copy those records to new ones on a backup volume and delete the original ones, and users query the same table. We suffered the problem twice in 2 weeks, soon after starting the first "export" batch process post upgrade from 10.4.13. The table that holds files tracks about 400 million rows, and up to 10 per second are added. during normal operation, about 20 per second while files are being sent to long term storage. Previous results with 10.4.13 were quite satisfactory except that we experienced | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Anton Rozhkov [ 2021-01-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We have expiriences same issue after upgrading to 10.5.8 (in 10.5.x line) and 10.4.17 (in 10.4.x line). The issue appears periodically under normal workload, and it is difficult to distinguish the differences in workload patterns between time when problem is not exist and time when we have a lock. We have same issue on different servers with diffferent workload, on some ones it may appear several times a day, on other ones - once time a week. The difference with the cases described here is that we don't crashes on Semaphore wait > 600 seconds, in our case all db insert/update operation are locked for some time (from several seconds to 2-3 minutes) and then locks are releases and db write operations are unfreezes. 10.5.8 : 10.4.17: We have tried compiling binaries with cherry-picked patches for | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Zimeon [ 2021-01-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We seem to have the same issue after upgrading to 10.2.36-1.el7.centos from MariaDB-server-10.2.31-1.el7.centos. ================================================== InnoDB: ###### Diagnostic info printed to the standard error stream To report this bug, see https://mariadb.com/kb/en/reporting-bugs ================================================== InnoDB: ###### Diagnostic info printed to the standard error stream We will try our best to scrape up some info that will hopefully help Server version: 10.2.36-MariaDB Thread pointer: 0x0 I'll have to rollback to the previous installed version. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-01-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Releases with the fix of But, I see that rantal mentioned that the fix would not help. Would it be possible to produce a stack trace of all threads during the hang, like I asked on December 9? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Zimeon [ 2021-02-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Unfortunately I won't be able to test, had to downgrade in the hope it won't crash. It's been stable now for about 16 hours, hope it keeps that way. 10.2.34 was the oldest in the repository for a quick replacement. ---> Package MariaDB-common.x86_64 0:10.2.34-1.el7.centos will be a downgrade | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Kuzmichev [ 2021-02-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've seen a performance impact flavor of this issue (as described by Anton Rozhkov here) on a couple of different ocasions. Both 10.4.17, both have persistent stats and recalc enabled. One of the environments experienced semaphore waits (~250 seconds long) under normal write load (which is, however, heavily insert-based). The other environment was a new replica build process, importing a relatively large database. The latter environment showed the peculiar changing load. Roughly every 30 minutes it would go from all write load (as expected for the dump restoration) to getting stuck on semaphore waits while reading in pages, and not progressing writes at all, then back again. Disabling stats recalc makes load predictably write-heavy 100% of the time during restore. Buffer pool is 32G and largest table is ~50G. Semaphore waits faced are perfect copy of what this and other comments show: https://jira.mariadb.org/browse/MDEV-24378?focusedCommentId=177911&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-177911 What I have in terms of the data gathered are some perf outputs and flamegraphs generated on the "dump-restore" environment during the write and read loads. I've been checking this locally, but either the size of the table is still too small, or IO/performance is more favorable. I get some amount of semaphore waits, but none longer than ~30 seconds. The write thread is in the same state as seen in oc_1_threads.txt Will attach SEIS output (SEIS.dict0stats.out | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-02-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
sergey.kuzmichev, a quick look of mariadb_10417.37345.3.gdb.bt However, I see no evidence of a deadlock, livelock, or busy loop in this case. It definitely does not look like | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-02-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
sergey.kuzmichev, I just realized that mariadb_10417.37345.3.gdb.bt | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-02-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Did anyone have a chance to test 10.4.18, which was released this week? That should fix both | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Leonard Sinquer [ 2021-02-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@Marko - I will install this next week if I do not see any issues in this thread. I have only production servers, so its always a high risk low reward for me. Currently on 10.4.15 and its been stable | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mario Hubert [ 2021-02-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comments: I have been using 10.5.8 ... the problem has been resolved after upgrading to 10.5.9 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-03-05 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you, mhubert. jacob.williams, shalogo, sergey.kuzmichev, did any of you try the newer releases? Can we close this ticket? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jacob Williams [ 2021-10-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Apologies for not getting back on this. We had downgraded to 10.4.13 to avoid crashes and avoided upgrading again for a while. We did update to 10.4.21 finally last weekend and started crashing again starting yesterday. Working on stacktrace now from the coredump we took. The error log message was not very detailed so I can't say it is the same exact issue, but the symptoms were otherwise identical and it occurred twice within 24 hours, starting about 5 days after we upgraded from 10.4.13 to 10.4.21. I don't see the SEMAPHORE messages in the logs, only a generic crash report. But the symptom is that several queries are hung for long periods, clients cannot start new sessions, and some existing sessions may continue to run queries for a time. The server process does not respond to a normal shutdown request from systemd. Kill -9 is required to stop it and then it starts back up fine. This behavior is the same as what we usually saw when we first updasted to 10.4.17. We are going to downgrade back to 10.4.13 to avoid more crashes. I will attach the stacktrace as soon as I have ensured it doesn't have any sensitive data that I can't share. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jacob Williams [ 2021-10-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-10-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I do not see any InnoDB hang in gdb-thread-apply-all-bt.txt.gz I see that some InnoDB threads are waiting for something inside malloc (apparently in the jemalloc library). If you find evidence of a hang, please file a new ticket, and identify the supposedly hung threads. |