Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27196

guest-agent fs-freeze not working on VM with Debian 11 + MariaDB 10.6 and 10.7

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.6.5, 10.7.1
    • None
    • Packaging, Server
    • None
    • Virtual Machine (Debian 11.1 x86_64) on Proxmox VE 7.1

    Description

      There is something strange between mariadb and qemu-ga on debian 11 (Virtual Machine).

      When mariadb is running, you cannot backup a running virtual machine (snapshot).
      Host sends the command "fs-freeze" to qemu-ga but it does not work properly.
      it hangs the entire VM

      Affected mariadb version: 10.6.5 and 10.7.1
      Not affected: 10.5.13 (and 10.6.5 on Centos 8 Stream)

      My test procedure.

      Attachments

        Activity

          I too have this issue with my install of MariaDB 10.7 on Debian 11.

          I also have MariaDB 10.5 on Debian 11 which doesn't get this issue. But this 10.5 is from Debian repo's not MariaDB.

          I've also reported the issue on qemu, but the fix maybe within MariaDB's area.
          https://gitlab.com/qemu-project/qemu/-/issues/881

          FingerlessGloves FingerlessGloves added a comment - I too have this issue with my install of MariaDB 10.7 on Debian 11. I also have MariaDB 10.5 on Debian 11 which doesn't get this issue. But this 10.5 is from Debian repo's not MariaDB. I've also reported the issue on qemu, but the fix maybe within MariaDB's area. https://gitlab.com/qemu-project/qemu/-/issues/881
          MBO MBO added a comment -

          We have the same issue on one of our servers after upgrading to MariaDB 10.6. OS is CentOS 7 on Proxmox

          MBO MBO added a comment - We have the same issue on one of our servers after upgrading to MariaDB 10.6. OS is CentOS 7 on Proxmox

          Hi!
          I don't have a proxmox environment to test for now so I tried to reproduce this on KVM/Qemu and I am not able to reproduce it.

          Here is the test:

          Some information about the test system (Debian 11):

          ❯ uname -r
          6.0.0-0.deb11.2-amd64
          ❯ virsh --version
          8.0.0
          ❯ qemu-system-x86_64 --version
          QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-11+deb11u2)
          Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers
          ❯ dpkg -l | grep libvirt-daemon | awk '{print $1" "$2" "$3}'
          ii libvirt-daemon 8.0.0-1~bpo11+1
          ii libvirt-daemon-config-network 8.0.0-1~bpo11+1
          ii libvirt-daemon-config-nwfilter 8.0.0-1~bpo11+1
          ii libvirt-daemon-driver-lxc 8.0.0-1~bpo11+1
          ii libvirt-daemon-driver-qemu 8.0.0-1~bpo11+1
          ii libvirt-daemon-driver-vbox 8.0.0-1~bpo11+1
          ii libvirt-daemon-driver-xen 8.0.0-1~bpo11+1
          ii libvirt-daemon-system 8.0.0-1~bpo11+1
          ii libvirt-daemon-system-systemd 8.0.0-1~bpo11+1
          

          faust Faustin Lammler added a comment - Hi! I don't have a proxmox environment to test for now so I tried to reproduce this on KVM/Qemu and I am not able to reproduce it. Here is the test: install debian11 VM; install mariadb 10.6.11 from MDBF repo; import the DB from https://github.com/datacharmer/test_db ; create a snaphot from the virt-manager. Some information about the test system (Debian 11): ❯ uname -r 6.0 . 0 - 0 .deb11. 2 -amd64 ❯ virsh --version 8.0 . 0 ❯ qemu-system-x86_64 --version QEMU emulator version 5.2 . 0 (Debian 1 : 5.2 +dfsg- 11 +deb11u2) Copyright (c) 2003 - 2020 Fabrice Bellard and the QEMU Project developers ❯ dpkg -l | grep libvirt-daemon | awk '{print $1" "$2" "$3}' ii libvirt-daemon 8.0 . 0 - 1 ~bpo11+ 1 ii libvirt-daemon-config-network 8.0 . 0 - 1 ~bpo11+ 1 ii libvirt-daemon-config-nwfilter 8.0 . 0 - 1 ~bpo11+ 1 ii libvirt-daemon-driver-lxc 8.0 . 0 - 1 ~bpo11+ 1 ii libvirt-daemon-driver-qemu 8.0 . 0 - 1 ~bpo11+ 1 ii libvirt-daemon-driver-vbox 8.0 . 0 - 1 ~bpo11+ 1 ii libvirt-daemon-driver-xen 8.0 . 0 - 1 ~bpo11+ 1 ii libvirt-daemon-system 8.0 . 0 - 1 ~bpo11+ 1 ii libvirt-daemon-system-systemd 8.0 . 0 - 1 ~bpo11+ 1

          @Faustin
          Your QEMU version is older than what I tested with the issue, at the time of my testing I was using 6.1.1 your using 5.2.0 with new 6 kernel. Could be the issue doesn't happen on older versions or newer kernel could be the difference.

          Proxmox comes with some newer packages over it's base Debian.

          @MBO
          Can you get your kernel and qemu version when you have the issue. I assume your on Kernel 5.15 now on pve, I had the issue on 5.13 when I reported it.

          FingerlessGloves FingerlessGloves added a comment - @Faustin Your QEMU version is older than what I tested with the issue, at the time of my testing I was using 6.1.1 your using 5.2.0 with new 6 kernel. Could be the issue doesn't happen on older versions or newer kernel could be the difference. Proxmox comes with some newer packages over it's base Debian. @MBO Can you get your kernel and qemu version when you have the issue. I assume your on Kernel 5.15 now on pve, I had the issue on 5.13 when I reported it.
          MBO MBO added a comment -

          @FingerlessGloves

          The VM with the issue was running CentOS7 with kernel: 3.10.0-1160.80.1.el7.x86_64
          The Proxmox machine was running: 5.15.39-3 with Proxmox version 7.2-7 and Qemu 6.2.0

          Hope this information is useful.

          MBO MBO added a comment - @FingerlessGloves The VM with the issue was running CentOS7 with kernel: 3.10.0-1160.80.1.el7.x86_64 The Proxmox machine was running: 5.15.39-3 with Proxmox version 7.2-7 and Qemu 6.2.0 Hope this information is useful.
          danblack Daniel Black added a comment - - edited

          A notable 10.6 difference to 10.5 with regard to storage is innodb_flush_method=O_DIRECT by defaults previously on fsync. If you test the old default that might be a useful for the qemu folks.

          Also 10.6 added liburing as the innodb_use_native_aio=1 implementation in 10.6 (where liburing is available as a distro package). I've yet to look how Centos 8 Stream package theirs, but there might be a difference there.

          Since tmpfs was mentioned a few times in the forum I'll mention https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1020831, but that's pretty much only for the 5.10 debian kernel so won't be the entire story.

          danblack Daniel Black added a comment - - edited A notable 10.6 difference to 10.5 with regard to storage is innodb_flush_method=O_DIRECT by defaults previously on fsync. If you test the old default that might be a useful for the qemu folks. Also 10.6 added liburing as the innodb_use_native_aio=1 implementation in 10.6 (where liburing is available as a distro package). I've yet to look how Centos 8 Stream package theirs, but there might be a difference there. Since tmpfs was mentioned a few times in the forum I'll mention https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1020831 , but that's pretty much only for the 5.10 debian kernel so won't be the entire story.
          linushstge Linus added a comment - - edited

          After setting innodb_flush_method=fsync and innodb_use_native_aio=0 the error still occurs.

          (QEMU 7.0 by i440fx, Tested on Debian 11, MariaDB 10.8)

          Backup log:

          INFO: starting new backup job: vzdump --mode snapshot
          INFO: Starting Backup of VM XXXXXX (qemu)
          INFO: Backup started at 2022-11-30 09:43:43
          INFO: status = running
          INFO: include disk 'scsi0' 'local-zfs:vm-XXXXXX-disk-0' 300G
          INFO: backup mode: snapshot
          INFO: ionice priority: 7
          INFO: snapshots found (not included into backup)
          INFO: creating Proxmox Backup Server archive 'vm/XXXXXX/2022-11-30T08:43:43Z'
          INFO: issuing guest-agent 'fs-freeze' command
           
          [TIMEOUT]
          

          linushstge Linus added a comment - - edited After setting innodb_flush_method=fsync and innodb_use_native_aio=0 the error still occurs. (QEMU 7.0 by i440fx, Tested on Debian 11, MariaDB 10.8) Backup log: INFO: starting new backup job: vzdump --mode snapshot INFO: Starting Backup of VM XXXXXX (qemu) INFO: Backup started at 2022-11-30 09:43:43 INFO: status = running INFO: include disk 'scsi0' 'local-zfs:vm-XXXXXX-disk-0' 300G INFO: backup mode: snapshot INFO: ionice priority: 7 INFO: snapshots found (not included into backup) INFO: creating Proxmox Backup Server archive 'vm/XXXXXX/2022-11-30T08:43:43Z' INFO: issuing guest-agent 'fs-freeze' command   [TIMEOUT]
          alexh Alex added a comment -

          This is not only with qemu-ga. Just issuing fsfreeze -f / on the VM can also lock-up. The fsfreeze command in that case never completes.

          I can reliably reproduce this only with mariadb actively writing to disk during the fsfreeze invocation. Not with other disk-intensive applications, and not while mariadb is idle.

          Using mount namespaces (mariadb in a separate mount namespace, or just other namespaces present on the system) just seems to accelerate triggering the lockup.

          innodb_flush_method = O_DIRECT / O_DSYNC makes no difference. I could NOT (yet) reproduce the issue with innodb_flush_method = fsync.

          I have not tried reproducing it outside of a Qemu VM.

          VM:

          Kernel: Debian 6.1.119-1
          Filesystem: ext4
          MariaDB: 10.11.6-0+deb12u1
          

          Host:

          Kernel: Debian 6.11.10-1
          Filesystem: ext4
          Qemu: 9.1.2+ds-1
          

          To reproduce, run this on the VM while (heavily, multithreaded) writing to the DB (insert, update, delete). In a while, fsfreeze -f / will lockup.

          while true; do \
            echo "$(date '+%F %T.%N') freezing";
            fsfreeze -f /;
            echo "$(date '+%F %T.%N') frozen";
            sleep 0.1;
            echo "$(date '+%F %T.%N') thawing";
            fsfreeze -u /;
            echo "$(date '+%F %T.%N') thawed";
            sleep 0.1;
            echo;
          done;
          

          I create write load by running the prepare script from percona tpcc.

          Could this be mariadb doing something unexpected/wrong while writing, or maybe even a kernel bug?

          alexh Alex added a comment - This is not only with qemu-ga. Just issuing fsfreeze -f / on the VM can also lock-up. The fsfreeze command in that case never completes. I can reliably reproduce this only with mariadb actively writing to disk during the fsfreeze invocation. Not with other disk-intensive applications, and not while mariadb is idle. Using mount namespaces (mariadb in a separate mount namespace, or just other namespaces present on the system) just seems to accelerate triggering the lockup. innodb_flush_method = O_DIRECT / O_DSYNC makes no difference. I could NOT (yet) reproduce the issue with innodb_flush_method = fsync . I have not tried reproducing it outside of a Qemu VM. VM: Kernel: Debian 6.1.119-1 Filesystem: ext4 MariaDB: 10.11.6-0+deb12u1 Host: Kernel: Debian 6.11.10-1 Filesystem: ext4 Qemu: 9.1.2+ds-1 To reproduce, run this on the VM while (heavily, multithreaded) writing to the DB (insert, update, delete). In a while, fsfreeze -f / will lockup. while true ; do \ echo "$(date '+%F %T.%N') freezing" ; fsfreeze -f /; echo "$(date '+%F %T.%N') frozen" ; sleep 0.1; echo "$(date '+%F %T.%N') thawing" ; fsfreeze -u /; echo "$(date '+%F %T.%N') thawed" ; sleep 0.1; echo ; done ; I create write load by running the prepare script from percona tpcc . Could this be mariadb doing something unexpected/wrong while writing, or maybe even a kernel bug?

          People

            Unassigned Unassigned
            NoLIne Paweł Kośka
            Votes:
            1 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.