[MDEV-33820] Deadlock if system time changes repeatedly during concurrent INSERTs - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Minor
Resolution: Unresolved
Affects Version/s: 10.11.7, 11.3.2
Fix Version/s: None
Component/s: None
Labels:
None
Environment:
Debian 12 on different hardware, on VMware ESX, and qemu

Description

This issue has been analyzed by Marko Mäkelä and for now it looks like this might rather be a bug in the Kernel, but never the less here's a formal report of the problem, in case someone ever encounters this in the wild too.

Rapidly changing the system time (real time clock) while running a bunch of INSERT queries deadlocks MariaDB.
If you run the attached script on a fresh install of (in my case) Debian 12 with MariaDB installed, the DB process will sooner or later deadlock, usually after around 10 minutes. You'll see a bunch of hanging INSERT queries in `SHOW PROCESSLIST`, any other queries touching the involved table will also hang forever.
I have reproduced this with MariaDB 10.11.6 as shipped by Debian, as well as 10.11.7 and 11.3 from mariadb.org's apt repos for Debian 12. I also installed Kernel 6.8 (build by the Ubuntu Kernel team) and was able to reproduce. I've also tried setting `innodb_use_native_aio=0`, which didn't help. Testing with `libaio` wasn't done yet as this would require building from source.
Happens on bare metal HW as well as VMware and QEMU VMs on Intel and AMD CPUs. So far this most reliably reproduces with at least 4 CPU cores, with one and two cores I had little luck to get this to trigger.

If you run the attached test script, please make sure to use a disposable test environment, so that the script can be run as root and is able to mess with the system time.

For the curious: "Why do you even change your system clock 100 times a second?" - This happened to us in production on one server. Turns out that server was running on an ESX node that accidentally had NTP disabled, while automatic time synchronization with the guest via open-vmware-tools was enabled. In addition, systemd-timesyncd was installed in the guest and set to use NTP. The ESX host's clock drifted by 65 seconds over time, so that vmware-tools and timesyncd were constantly fighting over the guest's system clock.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

deadlock.sh
3 kB
2024-04-03 12:32

Issue Links

relates to

MDEV-34641 MariaDB server getting stuck when ntp server switch time in midst of start up

Open

Activity

Simon Rettberg added a comment - 2024-04-03 14:06 - edited

For reference, here's a cut together version of what Marko found when debugging an early version of this hang:

grep -A3 syscall btfull.txt|grep ^#1|sort -u
#1 0x0000558bfcb0bbb8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7fa72fc04438) at ./storage/innobase/sync/srw_lock.cc:243
#1 0x0000558bfcb0bbb8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7fa72fc04bb8) at ./storage/innobase/sync/srw_lock.cc:243
#1 0x0000558bfcb0bbb8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7fa734b1bdd8) at ./storage/innobase/sync/srw_lock.cc:243
#1 0x0000558bfcb0bbb8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7fa7487552b8) at ./storage/innobase/sync/srw_lock.cc:243

The first one is being waited for by Thread 60 while the block (address offset -0x18) is held by Thread 15.
Thread 15 is waiting for an undo page 57, which could be held by Thread 60

no, I do not see block 0x7fa734b1bdc0 anywhere else

the block is write-fixed (WRITE_FIX+1), but apparently no writes are being scheduled:

(gdb) set print static-members off
(gdb) p *(buf_page_t*)0x7fa734b1bdc0
$2 = {id_ =

{m_id = 57}

, hash = 0x7fa72c2037a0, oldest_modification_ = {m = {<std::__atomic_base<unsigned long>> = {_M_i = 1504877260}, <No data fields>}}, lock = {lock = {writer = {
lock = {<std::_atomic_base<unsigned int>> = {_M_i = 2147483650}, <No data fields>}}, readers = {<std::_atomic_base<unsigned int>> = {_M_i = 1}, <No data fields>}}, recursive = 65536,
writer = {<std::__atomic_base<unsigned long>> = {_M_i = 18446744073709551615}, <No data fields>}}, frame = 0x7fa734b78000 "", zip = {data = 0x0, m_end = 0, m_nonempty = 0, n_blobs = 0, ssize = 0, fix = {
m = {<std::__atomic_base<unsigned int>> = {_M_i = 2684354561}, <No data fields>}}}, list =

{prev = 0x7fa72fa02580, next = 0x7fa72ee01860}

, LRU =

{prev = 0x7fa72fa02580, next = 0x7fa72fa014a0}

, old = 0,
freed_page_clock = 12424, access_time = {m_counter = {<std::__atomic_base<unsigned int>> = {_M_i = 435568230}, <No data fields>}}}
(gdb) p write_slots.m_cache.m_pos
$3 = 1022

This subsystem is owned by @Vladislav Vaintroub. We have observed what I thought were "fake hangs" when using rr record. Maybe they are real hangs after all.

I’ll try to dump the contents of write_slots, but I need to figure out how to get the relevant information.

I see that Thread 4 (the buf_flush_page_cleaner) is blocked too, so no new page writes can be submitted.

the write_slots is full (all 1024 slots are in use).

One more think that I could check is which block buf_flush_page_cleaner is about to write. I don’t quite understand how we can initiate a write fix and the block is not in write_slots. But 1024 others are there, blocking progress.

Simon Rettberg added a comment - 2024-04-03 14:06 - edited For reference, here's a cut together version of what Marko found when debugging an early version of this hang: grep -A3 syscall btfull.txt|grep ^#1|sort -u #1 0x0000558bfcb0bbb8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7fa72fc04438) at ./storage/innobase/sync/srw_lock.cc:243 #1 0x0000558bfcb0bbb8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7fa72fc04bb8) at ./storage/innobase/sync/srw_lock.cc:243 #1 0x0000558bfcb0bbb8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7fa734b1bdd8) at ./storage/innobase/sync/srw_lock.cc:243 #1 0x0000558bfcb0bbb8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7fa7487552b8) at ./storage/innobase/sync/srw_lock.cc:243 The first one is being waited for by Thread 60 while the block (address offset -0x18) is held by Thread 15. Thread 15 is waiting for an undo page 57, which could be held by Thread 60 no, I do not see block 0x7fa734b1bdc0 anywhere else the block is write-fixed (WRITE_FIX+1), but apparently no writes are being scheduled: (gdb) set print static-members off (gdb) p *(buf_page_t*)0x7fa734b1bdc0 $2 = {id_ = {m_id = 57} , hash = 0x7fa72c2037a0, oldest_modification_ = {m = {<std::__atomic_base<unsigned long>> = {_M_i = 1504877260}, <No data fields>}}, lock = {lock = {writer = { lock = {<std::_ atomic_base<unsigned int>> = {_M_i = 2147483650}, <No data fields>}}, readers = {<std:: _atomic_base<unsigned int>> = {_M_i = 1}, <No data fields>}}, recursive = 65536, writer = {<std::__atomic_base<unsigned long>> = {_M_i = 18446744073709551615}, <No data fields>}}, frame = 0x7fa734b78000 "", zip = {data = 0x0, m_end = 0, m_nonempty = 0, n_blobs = 0, ssize = 0, fix = { m = {<std::__atomic_base<unsigned int>> = {_M_i = 2684354561}, <No data fields>}}}, list = {prev = 0x7fa72fa02580, next = 0x7fa72ee01860} , LRU = {prev = 0x7fa72fa02580, next = 0x7fa72fa014a0} , old = 0, freed_page_clock = 12424, access_time = {m_counter = {<std::__atomic_base<unsigned int>> = {_M_i = 435568230}, <No data fields>}}} (gdb) p write_slots.m_cache.m_pos $3 = 1022 This subsystem is owned by @Vladislav Vaintroub. We have observed what I thought were "fake hangs" when using rr record. Maybe they are real hangs after all. I’ll try to dump the contents of write_slots, but I need to figure out how to get the relevant information. I see that Thread 4 (the buf_flush_page_cleaner) is blocked too, so no new page writes can be submitted. the write_slots is full (all 1024 slots are in use). One more think that I could check is which block buf_flush_page_cleaner is about to write. I don’t quite understand how we can initiate a write fix and the block is not in write_slots. But 1024 others are there, blocking progress.

MariaDB Server

Deadlock if system time changes repeatedly during concurrent INSERTs

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration