Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29349

I/O from MariaDB causes FIFREEZE ioctl system call to hang on NVME devices

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.6.8
    • 10.6
    • None

    Description

      Hi,

      while trying to backup a Dell R7525 system (AMD EPYC 7002 Serie, codename "Rome") using LVM snapshots I noticed that the system will 'freeze' sometimes (not all the times) when creating the snapshot.

      First I thought this was related to LVM so I created

      https://listman.redhat.com/archives/linux-lvm/2022-July/026228.html
      (continued at
      https://listman.redhat.com/archives/linux-lvm/2022-August/thread.html#26229)

      Long story short:

      I was even able to reproduce with fsfreeze, see last strace lines

      [...]
      14471 1659449870.984635 openat(AT_FDCWD, "/var/lib/machines", O_RDONLY) = 3
      14471 1659449870.984658 newfstatat(3, "", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_EMPTY_PATH) = 0
      14471 1659449870.984678 ioctl(3, FIFREEZE
      

      so I started to bisect kernel and found the following bad commit:

      md: add support for REQ_NOWAIT
       
      commit 021a24460dc2 ("block: add QUEUE_FLAG_NOWAIT") added support
      for checking whether a given bdev supports handling of REQ_NOWAIT or not.
      Since then commit 6abc49468eea ("dm: add support for REQ_NOWAIT and enable
      it for linear target") added support for REQ_NOWAIT for dm. This uses
      a similar approach to incorporate REQ_NOWAIT for md based bios.
       
      This patch was tested using t/io_uring tool within FIO. A nvme drive
      was partitioned into 2 partitions and a simple raid 0 configuration
      /dev/md0 was created.
       
      md0 : active raid0 nvme4n1p1[1] nvme4n1p2[0]
            937423872 blocks super 1.2 512k chunks
       
      Before patch:
       
      $ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
       
      Running top while the above runs:
       
      $ ps -eL | grep $(pidof io_uring)
       
        38396   38396 pts/2    00:00:00 io_uring
        38396   38397 pts/2    00:00:15 io_uring
        38396   38398 pts/2    00:00:13 iou-wrk-38397
       
      We can see iou-wrk-38397 io worker thread created which gets created
      when io_uring sees that the underlying device (/dev/md0 in this case)
      doesn't support nowait.
       
      After patch:
       
      $ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
       
      Running top while the above runs:
       
      $ ps -eL | grep $(pidof io_uring)
       
        38341   38341 pts/2    00:10:22 io_uring
        38341   38342 pts/2    00:10:37 io_uring
       
      After running this patch, we don't see any io worker thread
      being created which indicated that io_uring saw that the
      underlying device does support nowait. This is the exact behaviour
      noticed on a dm device which also supports nowait.
       
      For all the other raid personalities except raid0, we would need
      to train pieces which involves make_request fn in order for them
      

      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f51d46d0e7cb5b8494aa534d276a9d8915a2443d

      After reverting this commit (and follow up commit 0f9650bd838efe5c52f7e5f40c3204ad59f1964d) v5.18.15 and v5.19 worked for me again.

      However, I am seeing the same problem when using the NVME device directly, i.e. when no mdraid is involved.

      After I reported this upstream to kernel mailing list, I was asked to run bisect again against the single NVME device. I tried that but I am failing: Bisect will always end with

      first bad commit: [fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf] Linux 5.16-rc1 
      

      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf

      ...but this doesn't make any sense, right?

      Latest kernel is working for me when I just do

      diff --git a/Makefile b/Makefile
      index 23162e2bdf14..0f344944d828 100644
      --- a/Makefile
      +++ b/Makefile
      @@ -1,7 +1,7 @@
       # SPDX-License-Identifier: GPL-2.0
       VERSION = 5
      -PATCHLEVEL = 18
      -SUBLEVEL = 18
      +PATCHLEVEL = 15
      +SUBLEVEL = 0
       EXTRAVERSION =
       NAME = Superb Owl
      

      For some reason, SUBLEVEL = 99 causes the failure again even for 5.15...

      I am currently out of ideas. I was asked to find a different reproducer because maybe mysqld is doing something depending on $KV but I am unable to reproduce with fio yet.

      However, using MariaDB will always trigger the problem for:

      1. Do a clean boot (=not recovering from crash).
      2. With mysqld running but without any I/O yet, fsfreeze will work.
      3. After restoring ~150mb SQL file, fsfreeze will suddenly hang after FIFREEZE ioctl system call
      4. Reset the system (power cycle is required)
      5. When the system comes back, mysqld will do recovery (InnoDB: Starting final batch to recover 16658 pages from redo log.) – this is enough to trigger the problem again. I.e. without any additional I/O, fsfreeze would already hang.

      The problem also occurs when I stop mysqld before doing the FIFREEZE ioctl system call.

      I hope someone has an idea or can help me creating a reproducer not depending on mysqld.

      Attachments

        Activity

          People

            Unassigned Unassigned
            whissi Thomas Deutschmann
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.