[MDEV-30767] 10.6.12 regression - InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch Created: 2023-03-01  Updated: 2023-06-27  Resolved: 2023-06-27

Status: Closed
Project: MariaDB Server
Component/s: Server, Storage Engine - InnoDB
Affects Version/s: 10.6.12
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: James Reno Assignee: Marko Mäkelä
Resolution: Incomplete Votes: 0
Labels: None


 Description   

I have transferred this bug from the Ubuntu Bug Tracker:
https://bugs.launchpad.net/ubuntu/+source/mariadb-10.6/+bug/2008718

Ubuntu-Release: (ubuntu-jammy, 22.04, x86_64/amd64; Kernel: 5.15.0-60-generic)
Affected Package: mariadb-server-10.6 = 10.6.12-0ubuntu0.22.04.1

It looks like the most recent update to mariadb-server-10.6 (https://bugs.launchpad.net/ubuntu/+source/mariadb-10.3/+bug/2006882) may have introduced a regression causing lockups as a result of work completed under one of the following:

MDEV-24911 Missing warning before [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.mutex - Jira

MDEV-24258 Merge dict_sys.mutex into dict_sys.latch - Jira

MDEV-26827 Make page flushing even faster - Jira

We have a large zabbix installation with >398GB history_uint table and ever since this upgrade the mariadb server has been locking up within 2-12 hours with the following error:

[ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch

Downgrade to previous package version 10.6.11 or 10.6.7 resolves the lockup and our platform remains stable.

I have not been able to get a proper crash-dump as the mariadb-server does not crash, it simply hangs – and sometimes the watchdog auto-restarts it (though not for several hours).



 Comments   
Comment by Marko Mäkelä [ 2023-03-02 ]

The prerequisite to fixing this is a stack trace of all threads, taken during the hang. For that to be useful, you must install the -dbgsym package of the MariaDB server. Then, execute something like the following:

sudo gdb -ex 'set height 0' -ex 'thread apply all backtrace' -ex 'quit' $(pgrep -x mariadbd)

One example of a recently fixed hang is MDEV-30638. MariaDB 10.6.12 fixed many hangs in MDEV-30400, but some cases of MDEV-29835 still remain.

Comment by Jacob Williams [ 2023-03-16 ]

I didn't see this before I just reported my similar issue. MDEV-30864

I produced a full stack trace but its just the watchdog I think. I have a core but I can't share it given it has sensitive data and it's 17G.

Mine don't happen as often - i have seen it 4 times in production across 12 servers, only one server had it happen twice. Those 3 servers are the biggest ones thoiugh. I do have a few 100+GB tables but they are partitioned. Biggest unpartitioned tables are ~80GB

Comment by Jacob Williams [ 2023-03-16 ]

I'll add that I have not yet tried downgrading to 10.9.4. I was previously on 10.4.x before upgrading to 10.9.5 and seeing the crash, but 10.9.5 looks like it was same day as 10.6.12. If I downgrade I won't know for a few weeks if its better since they are so infrequent. I haven't found a way to trigger it.

Comment by Marko Mäkelä [ 2023-05-26 ]

Unfortunately, the 2023Q2 releases (10.6.13, 10.8.8, 10.9.6, 10.10.4, 10.11.3), which fix MDEV-29835, are affected by a nastier bug MDEV-31234. Therefore, I cannot recommend upgrading to them. That bug should be fixed in the upcoming 11.0.2 and 11.1.1 releases.

Comment by Marko Mäkelä [ 2023-06-27 ]

There was an unscheduled release of 10.6.14, 10.9.7, … that included a fix of MDEV-31234.

Generated at Thu Feb 08 10:18:44 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.