[MDEV-30767] 10.6.12 regression - InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch Created: 2023-03-01 Updated: 2023-06-27 Resolved: 2023-06-27 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Server, Storage Engine - InnoDB |
| Affects Version/s: | 10.6.12 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | James Reno | Assignee: | Marko Mäkelä |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Description |
|
I have transferred this bug from the Ubuntu Bug Tracker: Ubuntu-Release: (ubuntu-jammy, 22.04, x86_64/amd64; Kernel: 5.15.0-60-generic) It looks like the most recent update to mariadb-server-10.6 (https://bugs.launchpad.net/ubuntu/+source/mariadb-10.3/+bug/2006882) may have introduced a regression causing lockups as a result of work completed under one of the following: MDEV-24911 Missing warning before [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.mutex - Jira
We have a large zabbix installation with >398GB history_uint table and ever since this upgrade the mariadb server has been locking up within 2-12 hours with the following error: [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch Downgrade to previous package version 10.6.11 or 10.6.7 resolves the lockup and our platform remains stable. I have not been able to get a proper crash-dump as the mariadb-server does not crash, it simply hangs – and sometimes the watchdog auto-restarts it (though not for several hours). |
| Comments |
| Comment by Marko Mäkelä [ 2023-03-02 ] | |
|
The prerequisite to fixing this is a stack trace of all threads, taken during the hang. For that to be useful, you must install the -dbgsym package of the MariaDB server. Then, execute something like the following:
One example of a recently fixed hang is | |
| Comment by Jacob Williams [ 2023-03-16 ] | |
|
I didn't see this before I just reported my similar issue. I produced a full stack trace but its just the watchdog I think. I have a core but I can't share it given it has sensitive data and it's 17G. Mine don't happen as often - i have seen it 4 times in production across 12 servers, only one server had it happen twice. Those 3 servers are the biggest ones thoiugh. I do have a few 100+GB tables but they are partitioned. Biggest unpartitioned tables are ~80GB | |
| Comment by Jacob Williams [ 2023-03-16 ] | |
|
I'll add that I have not yet tried downgrading to 10.9.4. I was previously on 10.4.x before upgrading to 10.9.5 and seeing the crash, but 10.9.5 looks like it was same day as 10.6.12. If I downgrade I won't know for a few weeks if its better since they are so infrequent. I haven't found a way to trigger it. | |
| Comment by Marko Mäkelä [ 2023-05-26 ] | |
|
Unfortunately, the 2023Q2 releases (10.6.13, 10.8.8, 10.9.6, 10.10.4, 10.11.3), which fix | |
| Comment by Marko Mäkelä [ 2023-06-27 ] | |
|
There was an unscheduled release of 10.6.14, 10.9.7, … that included a fix of |