[MDEV-34641] MariaDB server getting stuck when ntp server switch time in midst of start up - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.6.16, 10.11.6
Fix Version/s: 10.6, 10.11
Component/s: Server
Labels:
- foundation
Environment:
MLOS is based out of CentOS 7.

Linux MLOS-NSM 4.19.245-3.mlos3.x86_64 #1 SMP Wed Jul 19 06:01:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Description

The Mariadb process become non-responsive in 1-3 days time after its start in environment where NTP is enabled.

This could be specifically seen when the ntp server switches system time in the middle on mariadb server start up.

Checking the process list on DB shows all the queries in stuck state

below is strace on the mariadb server process :

The issue is not seen on 10.6.14-MariaDB-log MariaDB Server . The problems start appearing after we upgraded to 10.6.16-MariaDB-log MariaDB Server & 10.11.6-MariaDB-log

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

image-2024-07-23-21-09-50-259.png
45 kB
2024-07-23 15:39
image-2024-07-23-21-11-07-279.png
75 kB
2024-07-23 15:41
image-2024-07-23-21-12-30-739.png
587 kB
2024-07-23 15:42
image-2024-07-23-21-14-07-364.png
3 kB
2024-07-23 15:44
mariadb.strace.out
3 kB
2024-07-24 00:04
mariadbd_full_bt_all_threads_29Jun_after_30_sec.txt
745 kB
2024-07-29 06:23
mariadbd_full_bt_all_threads_29Jun.txt
745 kB
2024-07-29 06:23
mariadbd_full_bt_all_threads.txt
745 kB
2024-07-24 00:04
ntp.logs
11 kB
2024-07-24 00:04
processlist.sql
31 kB
2024-07-24 00:41

Issue Links

relates to

MDEV-32861 InnoDB hangs when running out of I/O slots

Closed

MDEV-33669 mariabackup --backup hangs

Closed

MDEV-33820 Deadlock if system time changes repeatedly during concurrent INSERTs

Open

MDEV-33594 Invoking log_free_check() while holding exclusive dictionary latch may block most InnoDB threads for a long time

Confirmed

Activity

Ascending order - Click to sort in descending order

View 7 older comments

Vladislav Vaintroub added a comment - 2024-07-29 08:34

Thread 62 is blocking because of too many (all allocated) AIOs are in flight. Every slot is allocated before asynchronous write, and freed in AIO write completion callback. I do not know if libaio is to blame, but have no better idea, then innodb_use_native_aio=0 would be worth to try.

Vladislav Vaintroub added a comment - 2024-07-29 08:34 Thread 62 is blocking because of too many (all allocated) AIOs are in flight. Every slot is allocated before asynchronous write, and freed in AIO write completion callback. I do not know if libaio is to blame, but have no better idea, then innodb_use_native_aio=0 would be worth to try.

Marko Mäkelä added a comment - 2024-07-29 12:51

The Linux kernel version might be relevant here. I got the impression that Red Hat Enterprise Linux 7 reached its end of life some weeks or months ago. It could be that io_submit or io_getevents on the underlying Linux kernel version gets confused when the system time is being moved. I see that tpool/aio_linux.cc invokes io_getevents without any time parameter. So, if there was a kernel bug, it should be deep inside the kernel, not in the system call interface.

That being said, I remember that ~~MDEV-32861~~ was fixed not too long time ago, in MariaDB Server 10.6.17 and 10.11.7. That could be a simple explanation of this hang in 10.6.16 and 10.11.6.

ullasram, can you please test if MariaDB Server 10.6.17 or 10.11.7 would avoid this hang?

Marko Mäkelä added a comment - 2024-07-29 12:51 The Linux kernel version might be relevant here. I got the impression that Red Hat Enterprise Linux 7 reached its end of life some weeks or months ago. It could be that io_submit or io_getevents on the underlying Linux kernel version gets confused when the system time is being moved. I see that tpool/aio_linux.cc invokes io_getevents without any time parameter. So, if there was a kernel bug, it should be deep inside the kernel, not in the system call interface. That being said, I remember that MDEV-32861 was fixed not too long time ago, in MariaDB Server 10.6.17 and 10.11.7. That could be a simple explanation of this hang in 10.6.16 and 10.11.6. ullasram , can you please test if MariaDB Server 10.6.17 or 10.11.7 would avoid this hang?

Ullas Ramakrishnan added a comment - 2024-07-30 04:57

Thanks @marko for the update. Let me upgrade to the mentioned version and get back to you. Please note , we haven't figured out a way to consistently replicate the issue . It may take few days for us to verify its working.

Ullas Ramakrishnan added a comment - 2024-07-30 04:57 Thanks @marko for the update. Let me upgrade to the mentioned version and get back to you. Please note , we haven't figured out a way to consistently replicate the issue . It may take few days for us to verify its working.

Ullas Ramakrishnan added a comment - 2024-08-14 13:14 - edited

@marco The issue is not resolved with the 10.11.7 upgrade. We still see the DB hang on the upgraded setup.

Ullas Ramakrishnan added a comment - 2024-08-14 13:14 - edited @marco The issue is not resolved with the 10.11.7 upgrade. We still see the DB hang on the upgraded setup.

Vladislav Vaintroub added a comment - 2024-08-14 14:02

It looks like it is actually ~~MDEV-33669~~ (despite its description containing "mariabackup", it is not about mariabackup=. there are not enough threads to process io completion requests, and this condition is "sticky". It is not fixed in 10.11.7 . But in 10.11.8, it is.

Vladislav Vaintroub added a comment - 2024-08-14 14:02 It looks like it is actually MDEV-33669 (despite its description containing "mariabackup", it is not about mariabackup=. there are not enough threads to process io completion requests, and this condition is "sticky". It is not fixed in 10.11.7 . But in 10.11.8, it is.

People

Assignee:: Daniel Black

Reporter:: Ullas Ramakrishnan

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2024-07-23 15:46

Updated:: 2025-02-06 07:21

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.