[MDEV-25955] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. Created: 2021-06-18 Updated: 2021-10-28 Resolved: 2021-10-28 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Server, Storage Engine - InnoDB |
| Affects Version/s: | 10.3.27 |
| Fix Version/s: | 10.2.37, 10.3.28, 10.4.18, 10.5.9 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Vincent Pelletier | Assignee: | Marko Mäkelä |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | crash, innodb | ||
| Environment: |
Linux x86_64 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
I have a setup with a read/write primary and two replicas which are not being used beyond replication monitoring. All 3 crashed at different times within 2 months of an upgrade from 10.2.23 with very similar symptoms: innodb triggering an abort() (signal 6) because of a 600s lock acquisition timeout, which further devolved into a segfault (signal 11) because of a null-pointer dereference during diagnostic log production for the abort. The upgrade for all processes happened simultaneously (all 10.2 stopped, then all 10.3 started, and no process restart until the crashes). I did not witness the crashes happening on the replicates (we only found out once the replication monitoring notified us of them being unreachable), but I just witnessed a crash on the primary, and I could notice that there were two distinct phases, a first long (until I first noticed the issue) lock sequence, which somehow involved an apparent deadlock. In the attached mariadb-error log, this is the period from 2021-06-16 4:56:21 until 2021-06-16 6:20:24. Then thre is a second phase, eventually showing a similar deadlock, and which did lead to the abort(). This is hte period from 2021-06-16 8:11:08 to the abort() at 2021-06-16 8:23:00. The deadlock appears first in the automatically-logged "show engine innodb status" at 2021-06-16 05:48:30. Everytime, the deadlock involve the same thread id (140189330171648), which at some point is visible as "Main thread ID" - Then, during the second phase, the deadlock reappears at in the 2021-06-16 08:19:06 report, and is present until the last report at 2021-06-16 08:22:46. In all cases, the main thread was: Process ID=33661, Main thread ID=140189330171648, state: enforcing dict cache limit |
| Comments |
| Comment by Vincent Pelletier [ 2021-06-18 ] |
|
I am currently in the process of reading all changed from 10.3.27 to 10.3.29 (latest 10.3 as of this writing), which makes me find I do not know how I could confirm if background analyze was going on at the time of the locking issues. In any case, which we are not at a level of daily crashes, we do have weird optimiser behaviour (like electing a 5M rows index scan on one table instead of a 10k row index scan on another table as we would expect) which did seem to take a while to develop after the version change. |
| Comment by Marko Mäkelä [ 2021-06-18 ] |
|
vpelletier_nxd, there was also |
| Comment by Vincent Pelletier [ 2021-06-18 ] |
|
In my case I believe SELECT statements were running just fine up to the crash, while UPDATE/INSERT/REPLACE (and supposedly DELETE) in the affected table at the time of the locking issue were stuck. I guess the most reliable way to force triggering What I can at least confirm is that these tables were visibly affected by In any case, I ran set global innodb_stats_auto_recalc=0; as recommended on the Debian bug report linked to |
| Comment by Vincent Pelletier [ 2021-06-21 ] |
|
I have a test copy of the database, which was runing on 10.3.27. There, I could reproduce the same symptoms as the original crash by creating some modifiation worload and manually running ANALYZE TABLE <a large table>, up to the abort + segfault. Then, I upgraded this database to 10.3.29 (I believe I cannot use .28 as I need mroonga, and I believe there was some breakage on .28), and I could not reproduce the problem with these steps: ANALYZE TABLE <a large table> now finishes in under a second. Given the similarity of the symptoms I could observe before the upgrade, I believe this bug is a duplicate of Sorry for not finding that report before opening this one. |
| Comment by Vincent Pelletier [ 2021-08-02 ] |
|
I believe this report can be closed as a duplicate of I've been running on 10.3.29 for over a month now without further lock acquisition timeout crash, and without the mysql.innodb_table_stats effects. |
| Comment by Marko Mäkelä [ 2021-10-28 ] |
|
Thank you, I am closing this as a duplicate of |