[MDEV-29349] I/O from MariaDB causes FIFREEZE ioctl system call to hang on NVME devices - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.6.8
Fix Version/s: 10.6
Component/s: Storage Engine - InnoDB
Labels:
None
Environment:

Hide
nspawn container running mariadb.org binary distribution 10.6.8-MariaDB-1:10.6.8+maria~bullseye

$ cat /etc/debian_version
11.4

"Host" system is running:

$ cat /etc/debian_version
bookworm/sid

$ uname -r
5.18.0-4-amd64

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7F52 16-Core Processor
Stepping: 0
Frequency boost: enabled
CPU MHz: 2500.000
CPU max MHz: 3500.0000
CPU min MHz: 2500.0000
BogoMIPS: 6987.46
Virtualization: AMD-V
L1d cache: 1 MiB
L1i cache: 1 MiB
L2 cache: 16 MiB
L3 cache: 512 MiB
NUMA node0 CPU(s): 0-15,32-47
NUMA node1 CPU(s): 16-31,48-63
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pd
                                 pe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4
                                 _1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefe
                                 tch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibp
                                 b stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_l
                                 lc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc
                                 _scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_reco
                                 v succor smca sev sev_es

$ nvme list
Node SN Model Namespace Usage Format FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S5CXNA0N60XXXX Dell Express Flash PM1725b 3.2TB SFF 1 3.20 TB / 3.20 TB 512 B + 0 B 1.2.2
/dev/nvme1n1 S5CXNA0N60XXXX Dell Express Flash PM1725b 3.2TB SFF 1 3.20 TB / 3.20 TB 512 B + 0 B 1.2.2
/dev/nvme2n1 S4C8NA0NC0XXXX SAMSUNG MZWLL6T4HMLA-00005 1 6.40 TB / 6.40 TB 512 B + 0 B GPJA3B3Q

Show
nspawn container running mariadb.org binary distribution 10.6.8-MariaDB-1:10.6.8+maria~bullseye $ cat /etc/debian_version 11.4 "Host" system is running: $ cat /etc/debian_version bookworm/sid $ uname -r 5.18.0-4-amd64 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 43 bits physical, 48 bits virtual CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7F52 16-Core Processor Stepping: 0 Frequency boost: enabled CPU MHz: 2500.000 CPU max MHz: 3500.0000 CPU min MHz: 2500.0000 BogoMIPS: 6987.46 Virtualization: AMD-V L1d cache: 1 MiB L1i cache: 1 MiB L2 cache: 16 MiB L3 cache: 512 MiB NUMA node0 CPU(s): 0-15,32-47 NUMA node1 CPU(s): 16-31,48-63 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pd                                  pe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4                                  _1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefe                                  tch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibp                                  b stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_l                                  lc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc                                  _scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_reco                                  v succor smca sev sev_es $ nvme list Node SN Model Namespace Usage Format FW Rev --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 S5CXNA0N60XXXX Dell Express Flash PM1725b 3.2TB SFF 1 3.20 TB / 3.20 TB 512 B + 0 B 1.2.2 /dev/nvme1n1 S5CXNA0N60XXXX Dell Express Flash PM1725b 3.2TB SFF 1 3.20 TB / 3.20 TB 512 B + 0 B 1.2.2 /dev/nvme2n1 S4C8NA0NC0XXXX SAMSUNG MZWLL6T4HMLA-00005 1 6.40 TB / 6.40 TB 512 B + 0 B GPJA3B3Q

Description

Hi,

while trying to backup a Dell R7525 system (AMD EPYC 7002 Serie, codename "Rome") using LVM snapshots I noticed that the system will 'freeze' sometimes (not all the times) when creating the snapshot.

First I thought this was related to LVM so I created

https://listman.redhat.com/archives/linux-lvm/2022-July/026228.html
(continued at
https://listman.redhat.com/archives/linux-lvm/2022-August/thread.html#26229)

Long story short:

I was even able to reproduce with fsfreeze, see last strace lines

[...]

14471 1659449870.984635 openat(AT_FDCWD, "/var/lib/machines", O_RDONLY) = 3

14471 1659449870.984658 newfstatat(3, "", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_EMPTY_PATH) = 0

14471 1659449870.984678 ioctl(3, FIFREEZE

so I started to bisect kernel and found the following bad commit:

md: add support for REQ_NOWAIT

commit 021a24460dc2 ("block: add QUEUE_FLAG_NOWAIT") added support

for checking whether a given bdev supports handling of REQ_NOWAIT or not.

Since then commit 6abc49468eea ("dm: add support for REQ_NOWAIT and enable

it for linear target") added support for REQ_NOWAIT for dm. This uses

a similar approach to incorporate REQ_NOWAIT for md based bios.

This patch was tested using t/io_uring tool within FIO. A nvme drive

was partitioned into 2 partitions and a simple raid 0 configuration

/dev/md0 was created.

md0 : active raid0 nvme4n1p1[1] nvme4n1p2[0]

      937423872 blocks super 1.2 512k chunks

Before patch:

$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100

Running top while the above runs:

$ ps -eL | grep $(pidof io_uring)

  38396   38396 pts/2    00:00:00 io_uring

  38396   38397 pts/2    00:00:15 io_uring

  38396   38398 pts/2    00:00:13 iou-wrk-38397

We can see iou-wrk-38397 io worker thread created which gets created

when io_uring sees that the underlying device (/dev/md0 in this case)

doesn't support nowait.

After patch:

$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100

Running top while the above runs:

$ ps -eL | grep $(pidof io_uring)

  38341   38341 pts/2    00:10:22 io_uring

  38341   38342 pts/2    00:10:37 io_uring

After running this patch, we don't see any io worker thread

being created which indicated that io_uring saw that the

underlying device does support nowait. This is the exact behaviour

noticed on a dm device which also supports nowait.

For all the other raid personalities except raid0, we would need

to train pieces which involves make_request fn in order for them

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f51d46d0e7cb5b8494aa534d276a9d8915a2443d

After reverting this commit (and follow up commit 0f9650bd838efe5c52f7e5f40c3204ad59f1964d) v5.18.15 and v5.19 worked for me again.

However, I am seeing the same problem when using the NVME device directly, i.e. when no mdraid is involved.

After I reported this upstream to kernel mailing list, I was asked to run bisect again against the single NVME device. I tried that but I am failing: Bisect will always end with

first bad commit: [fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf] Linux 5.16-rc1

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf

...but this doesn't make any sense, right?

Latest kernel is working for me when I just do

diff --git a/Makefile b/Makefile

index 23162e2bdf14..0f344944d828 100644

--- a/Makefile

+++ b/Makefile

@@ -1,7 +1,7 @@

 # SPDX-License-Identifier: GPL-2.0

 VERSION = 5

-PATCHLEVEL = 18

-SUBLEVEL = 18

+PATCHLEVEL = 15

+SUBLEVEL = 0

 EXTRAVERSION =

 NAME = Superb Owl

For some reason, SUBLEVEL = 99 causes the failure again even for 5.15...

I am currently out of ideas. I was asked to find a different reproducer because maybe mysqld is doing something depending on $KV but I am unable to reproduce with fio yet.

However, using MariaDB will always trigger the problem for:

Do a clean boot (=not recovering from crash).
With mysqld running but without any I/O yet, fsfreeze will work.
After restoring ~150mb SQL file, fsfreeze will suddenly hang after FIFREEZE ioctl system call
Reset the system (power cycle is required)
When the system comes back, mysqld will do recovery (InnoDB: Starting final batch to recover 16658 pages from redo log.) – this is enough to trigger the problem again. I.e. without any additional I/O, fsfreeze would already hang.

The problem also occurs when I stop mysqld before doing the FIFREEZE ioctl system call.

I hope someone has an idea or can help me creating a reproducer not depending on mysqld.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Thomas Deutschmann

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2022-08-22 15:17

Updated:: 2023-09-19 04:25

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.