Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.6.8
-
None
-
nspawn container running mariadb.org binary distribution 10.6.8-MariaDB-1:10.6.8+maria~bullseye
$ cat /etc/debian_version
11.4
"Host" system is running:
$ cat /etc/debian_version
bookworm/sid
$ uname -r
5.18.0-4-amd64
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7F52 16-Core Processor
Stepping: 0
Frequency boost: enabled
CPU MHz: 2500.000
CPU max MHz: 3500.0000
CPU min MHz: 2500.0000
BogoMIPS: 6987.46
Virtualization: AMD-V
L1d cache: 1 MiB
L1i cache: 1 MiB
L2 cache: 16 MiB
L3 cache: 512 MiB
NUMA node0 CPU(s): 0-15,32-47
NUMA node1 CPU(s): 16-31,48-63
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pd
pe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4
_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefe
tch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibp
b stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_l
lc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc
_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_reco
v succor smca sev sev_es
$ nvme list
Node SN Model Namespace Usage Format FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S5CXNA0N60XXXX Dell Express Flash PM1725b 3.2TB SFF 1 3.20 TB / 3.20 TB 512 B + 0 B 1.2.2
/dev/nvme1n1 S5CXNA0N60XXXX Dell Express Flash PM1725b 3.2TB SFF 1 3.20 TB / 3.20 TB 512 B + 0 B 1.2.2
/dev/nvme2n1 S4C8NA0NC0XXXX SAMSUNG MZWLL6T4HMLA-00005 1 6.40 TB / 6.40 TB 512 B + 0 B GPJA3B3Q
nspawn container running mariadb.org binary distribution 10.6.8-MariaDB-1:10.6.8+maria~bullseye $ cat /etc/debian_version 11.4 "Host" system is running: $ cat /etc/debian_version bookworm/sid $ uname -r 5.18.0-4-amd64 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 43 bits physical, 48 bits virtual CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7F52 16-Core Processor Stepping: 0 Frequency boost: enabled CPU MHz: 2500.000 CPU max MHz: 3500.0000 CPU min MHz: 2500.0000 BogoMIPS: 6987.46 Virtualization: AMD-V L1d cache: 1 MiB L1i cache: 1 MiB L2 cache: 16 MiB L3 cache: 512 MiB NUMA node0 CPU(s): 0-15,32-47 NUMA node1 CPU(s): 16-31,48-63 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pd pe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4 _1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefe tch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibp b stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_l lc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc _scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_reco v succor smca sev sev_es $ nvme list Node SN Model Namespace Usage Format FW Rev --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 S5CXNA0N60XXXX Dell Express Flash PM1725b 3.2TB SFF 1 3.20 TB / 3.20 TB 512 B + 0 B 1.2.2 /dev/nvme1n1 S5CXNA0N60XXXX Dell Express Flash PM1725b 3.2TB SFF 1 3.20 TB / 3.20 TB 512 B + 0 B 1.2.2 /dev/nvme2n1 S4C8NA0NC0XXXX SAMSUNG MZWLL6T4HMLA-00005 1 6.40 TB / 6.40 TB 512 B + 0 B GPJA3B3Q
Description
Hi,
while trying to backup a Dell R7525 system (AMD EPYC 7002 Serie, codename "Rome") using LVM snapshots I noticed that the system will 'freeze' sometimes (not all the times) when creating the snapshot.
First I thought this was related to LVM so I created
https://listman.redhat.com/archives/linux-lvm/2022-July/026228.html
(continued at
https://listman.redhat.com/archives/linux-lvm/2022-August/thread.html#26229)
Long story short:
I was even able to reproduce with fsfreeze, see last strace lines
[...]
|
14471 1659449870.984635 openat(AT_FDCWD, "/var/lib/machines", O_RDONLY) = 3
|
14471 1659449870.984658 newfstatat(3, "", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_EMPTY_PATH) = 0
|
14471 1659449870.984678 ioctl(3, FIFREEZE
|
so I started to bisect kernel and found the following bad commit:
md: add support for REQ_NOWAIT
|
|
commit 021a24460dc2 ("block: add QUEUE_FLAG_NOWAIT") added support
|
for checking whether a given bdev supports handling of REQ_NOWAIT or not.
|
Since then commit 6abc49468eea ("dm: add support for REQ_NOWAIT and enable
|
it for linear target") added support for REQ_NOWAIT for dm. This uses
|
a similar approach to incorporate REQ_NOWAIT for md based bios.
|
|
This patch was tested using t/io_uring tool within FIO. A nvme drive
|
was partitioned into 2 partitions and a simple raid 0 configuration
|
/dev/md0 was created.
|
|
md0 : active raid0 nvme4n1p1[1] nvme4n1p2[0]
|
937423872 blocks super 1.2 512k chunks
|
|
Before patch:
|
|
$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
|
|
Running top while the above runs:
|
|
$ ps -eL | grep $(pidof io_uring)
|
|
38396 38396 pts/2 00:00:00 io_uring
|
38396 38397 pts/2 00:00:15 io_uring
|
38396 38398 pts/2 00:00:13 iou-wrk-38397
|
|
We can see iou-wrk-38397 io worker thread created which gets created
|
when io_uring sees that the underlying device (/dev/md0 in this case)
|
doesn't support nowait.
|
|
After patch:
|
|
$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
|
|
Running top while the above runs:
|
|
$ ps -eL | grep $(pidof io_uring)
|
|
38341 38341 pts/2 00:10:22 io_uring
|
38341 38342 pts/2 00:10:37 io_uring
|
|
After running this patch, we don't see any io worker thread
|
being created which indicated that io_uring saw that the
|
underlying device does support nowait. This is the exact behaviour
|
noticed on a dm device which also supports nowait.
|
|
For all the other raid personalities except raid0, we would need
|
to train pieces which involves make_request fn in order for them
|
After reverting this commit (and follow up commit 0f9650bd838efe5c52f7e5f40c3204ad59f1964d) v5.18.15 and v5.19 worked for me again.
However, I am seeing the same problem when using the NVME device directly, i.e. when no mdraid is involved.
After I reported this upstream to kernel mailing list, I was asked to run bisect again against the single NVME device. I tried that but I am failing: Bisect will always end with
first bad commit: [fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf] Linux 5.16-rc1
|
...but this doesn't make any sense, right?
Latest kernel is working for me when I just do
diff --git a/Makefile b/Makefile
|
index 23162e2bdf14..0f344944d828 100644
|
--- a/Makefile
|
+++ b/Makefile
|
@@ -1,7 +1,7 @@
|
# SPDX-License-Identifier: GPL-2.0
|
VERSION = 5
|
-PATCHLEVEL = 18
|
-SUBLEVEL = 18
|
+PATCHLEVEL = 15
|
+SUBLEVEL = 0
|
EXTRAVERSION =
|
NAME = Superb Owl |
For some reason, SUBLEVEL = 99 causes the failure again even for 5.15...
I am currently out of ideas. I was asked to find a different reproducer because maybe mysqld is doing something depending on $KV but I am unable to reproduce with fio yet.
However, using MariaDB will always trigger the problem for:
- Do a clean boot (=not recovering from crash).
- With mysqld running but without any I/O yet, fsfreeze will work.
- After restoring ~150mb SQL file, fsfreeze will suddenly hang after FIFREEZE ioctl system call
- Reset the system (power cycle is required)
- When the system comes back, mysqld will do recovery (InnoDB: Starting final batch to recover 16658 pages from redo log.) – this is enough to trigger the problem again. I.e. without any additional I/O, fsfreeze would already hang.
The problem also occurs when I stop mysqld before doing the FIFREEZE ioctl system call.
I hope someone has an idea or can help me creating a reproducer not depending on mysqld.