[MDEV-29349] I/O from MariaDB causes FIFREEZE ioctl system call to hang on NVME devices Created: 2022-08-22  Updated: 2023-09-19

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6.8
Fix Version/s: 10.6

Type: Bug Priority: Major
Reporter: Thomas Deutschmann Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

nspawn container running mariadb.org binary distribution 10.6.8-MariaDB-1:10.6.8+maria~bullseye

$ cat /etc/debian_version
11.4

"Host" system is running:

$ cat /etc/debian_version
bookworm/sid

$ uname -r
5.18.0-4-amd64

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7F52 16-Core Processor
Stepping: 0
Frequency boost: enabled
CPU MHz: 2500.000
CPU max MHz: 3500.0000
CPU min MHz: 2500.0000
BogoMIPS: 6987.46
Virtualization: AMD-V
L1d cache: 1 MiB
L1i cache: 1 MiB
L2 cache: 16 MiB
L3 cache: 512 MiB
NUMA node0 CPU(s): 0-15,32-47
NUMA node1 CPU(s): 16-31,48-63
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pd
pe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4
_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefe
tch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibp
b stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_l
lc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc
_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_reco
v succor smca sev sev_es

$ nvme list
Node SN Model Namespace Usage Format FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S5CXNA0N60XXXX Dell Express Flash PM1725b 3.2TB SFF 1 3.20 TB / 3.20 TB 512 B + 0 B 1.2.2
/dev/nvme1n1 S5CXNA0N60XXXX Dell Express Flash PM1725b 3.2TB SFF 1 3.20 TB / 3.20 TB 512 B + 0 B 1.2.2
/dev/nvme2n1 S4C8NA0NC0XXXX SAMSUNG MZWLL6T4HMLA-00005 1 6.40 TB / 6.40 TB 512 B + 0 B GPJA3B3Q



 Description   

Hi,

while trying to backup a Dell R7525 system (AMD EPYC 7002 Serie, codename "Rome") using LVM snapshots I noticed that the system will 'freeze' sometimes (not all the times) when creating the snapshot.

First I thought this was related to LVM so I created

https://listman.redhat.com/archives/linux-lvm/2022-July/026228.html
(continued at
https://listman.redhat.com/archives/linux-lvm/2022-August/thread.html#26229)

Long story short:

I was even able to reproduce with fsfreeze, see last strace lines

[...]
14471 1659449870.984635 openat(AT_FDCWD, "/var/lib/machines", O_RDONLY) = 3
14471 1659449870.984658 newfstatat(3, "", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_EMPTY_PATH) = 0
14471 1659449870.984678 ioctl(3, FIFREEZE

so I started to bisect kernel and found the following bad commit:

md: add support for REQ_NOWAIT
 
commit 021a24460dc2 ("block: add QUEUE_FLAG_NOWAIT") added support
for checking whether a given bdev supports handling of REQ_NOWAIT or not.
Since then commit 6abc49468eea ("dm: add support for REQ_NOWAIT and enable
it for linear target") added support for REQ_NOWAIT for dm. This uses
a similar approach to incorporate REQ_NOWAIT for md based bios.
 
This patch was tested using t/io_uring tool within FIO. A nvme drive
was partitioned into 2 partitions and a simple raid 0 configuration
/dev/md0 was created.
 
md0 : active raid0 nvme4n1p1[1] nvme4n1p2[0]
      937423872 blocks super 1.2 512k chunks
 
Before patch:
 
$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
 
Running top while the above runs:
 
$ ps -eL | grep $(pidof io_uring)
 
  38396   38396 pts/2    00:00:00 io_uring
  38396   38397 pts/2    00:00:15 io_uring
  38396   38398 pts/2    00:00:13 iou-wrk-38397
 
We can see iou-wrk-38397 io worker thread created which gets created
when io_uring sees that the underlying device (/dev/md0 in this case)
doesn't support nowait.
 
After patch:
 
$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
 
Running top while the above runs:
 
$ ps -eL | grep $(pidof io_uring)
 
  38341   38341 pts/2    00:10:22 io_uring
  38341   38342 pts/2    00:10:37 io_uring
 
After running this patch, we don't see any io worker thread
being created which indicated that io_uring saw that the
underlying device does support nowait. This is the exact behaviour
noticed on a dm device which also supports nowait.
 
For all the other raid personalities except raid0, we would need
to train pieces which involves make_request fn in order for them

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f51d46d0e7cb5b8494aa534d276a9d8915a2443d

After reverting this commit (and follow up commit 0f9650bd838efe5c52f7e5f40c3204ad59f1964d) v5.18.15 and v5.19 worked for me again.

However, I am seeing the same problem when using the NVME device directly, i.e. when no mdraid is involved.

After I reported this upstream to kernel mailing list, I was asked to run bisect again against the single NVME device. I tried that but I am failing: Bisect will always end with

first bad commit: [fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf] Linux 5.16-rc1 

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf

...but this doesn't make any sense, right?

Latest kernel is working for me when I just do

diff --git a/Makefile b/Makefile
index 23162e2bdf14..0f344944d828 100644
--- a/Makefile
+++ b/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 VERSION = 5
-PATCHLEVEL = 18
-SUBLEVEL = 18
+PATCHLEVEL = 15
+SUBLEVEL = 0
 EXTRAVERSION =
 NAME = Superb Owl

For some reason, SUBLEVEL = 99 causes the failure again even for 5.15...

I am currently out of ideas. I was asked to find a different reproducer because maybe mysqld is doing something depending on $KV but I am unable to reproduce with fio yet.

However, using MariaDB will always trigger the problem for:

  1. Do a clean boot (=not recovering from crash).
  2. With mysqld running but without any I/O yet, fsfreeze will work.
  3. After restoring ~150mb SQL file, fsfreeze will suddenly hang after FIFREEZE ioctl system call
  4. Reset the system (power cycle is required)
  5. When the system comes back, mysqld will do recovery (InnoDB: Starting final batch to recover 16658 pages from redo log.) – this is enough to trigger the problem again. I.e. without any additional I/O, fsfreeze would already hang.

The problem also occurs when I stop mysqld before doing the FIFREEZE ioctl system call.

I hope someone has an idea or can help me creating a reproducer not depending on mysqld.



 Comments   
Comment by Marko Mäkelä [ 2022-08-22 ]

I don’t remember us doing any extensive testing with file system snapshots.

Would this be reproducible if you disable the use of O_DIRECT, e.g., innodb_flush_method=fsync to disable the effect of MDEV-24854? Would this reproduce without using io_uring (MDEV-24883), e.g., innodb_use_native_aio=OFF?

Comment by Thomas Deutschmann [ 2022-08-22 ]

Yes, I just found MDEV-26674 which lead me to https://github.com/MariaDB/server/commit/de7db5517de11a58d57d2a41d0bc6f38b6f92dd8 – setting innodb_use_native_aio=OFF will prevent the problem.

This helped me to find https://github.com/axboe/fio/issues/1195 so I now have a working reproducer for fio.

Comment by Daniel Black [ 2022-08-23 ]

FYI I've managed to provide container based reproducers for kernel folks. It allows the entire userspace to be contained to test a kernel with a very small and generic runtime component.

Sorry for the MariaDB runtime check disrupting a productive kernel bisect.

Comment by Thomas Deutschmann [ 2022-08-28 ]

We found the kernel issue and a patch was already merged into mainline: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e053aaf4da56cbf0afb33a0fda4a62188e2c0637

Backport for 5.15 and 5.19 is available but pending, https://www.spinics.net/lists/stable/msg588671.html.

One thing I noticed which maybe could get improved in MariaDB: After finding the bad commits, I reverted them which caused that io_uring was no longer available/working on my system which I noticed when fio failed to use io_uring engine. However, mysqld didn't tell me anything about that: It started normally showing "[Note] InnoDB: Using liburing" in logs but wasn't using asynchronous I/O anymore since the system didn't freeze anymore. Maybe it is possible for MariaDB to detect non-working liburing and log that which could be helpful when you expect that asynchronous I/O is used but isn't for some reasons.

Comment by Daniel Black [ 2022-08-29 ]

There is a message when liburing fails to initialize. It seems this is the only case where aio is disabled. There are some synchronous IO bits of innodb.

Are you after a:

if (!srv_use_native_aio) {
  ib::info() << "using syncIO";
}

Detecting non-working uring, like an end-to-end test of uring after initialization like the case sensitive filesystems test? It wouldn't catch your FIFREEZE case I think, but it might still catch the case of a non-fully implemented fs case. What cases aren't the kernel folks testing is the hard question.

Comment by Thomas Deutschmann [ 2022-09-01 ]

> Detecting non-working uring, like an end-to-end test of uring after initialization like the case sensitive filesystems test?

Yes, something like that.

> It wouldn't catch your FIFREEZE case I think

Right but I think this is impossible. I mean, AIO was working and used, system just failed after that.

I was surprised that fio failed once I "broke" AIO (=io_uring engine wasn't usable anymore for fio) but MariaDB started, told me "Yep, I am using native IO" and run my reproducer without any comment. If I didn't know that AIO was broken due to my changes to the kernel (and to be honest I didn't know that the NOWAIT part was critical and removing would cause AIO to break), I can imagine scenarios where you won't notice and just wonder why system A doesn't behave like system B...

Generated at Thu Feb 08 10:07:51 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.