[MDEV-27196] guest-agent fs-freeze not working on VM with Debian 11 + MariaDB 10.6 and 10.7 - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.6.5, 10.7.1
Fix Version/s: None
Component/s: Packaging, Server
Labels:
None
Environment:
Virtual Machine (Debian 11.1 x86_64) on Proxmox VE 7.1

Description

There is something strange between mariadb and qemu-ga on debian 11 (Virtual Machine).

When mariadb is running, you cannot backup a running virtual machine (snapshot).
Host sends the command "fs-freeze" to qemu-ga but it does not work properly.
it hangs the entire VM

Affected mariadb version: 10.6.5 and 10.7.1
Not affected: 10.5.13 (and 10.6.5 on Centos 8 Stream)

My test procedure.

Install Debian 11 in VM on Proxmox VE.
Install mariadb from https://mariadb.org/download/?t=repo-config&d=Debian+11+%22Bullseye%22&v=10.6
run mysql < employees.sql from https://github.com/datacharmer/test_db
create backup vm (snapshot) in proxmox

Attachments

Activity

Ascending order - Click to sort in descending order

FingerlessGloves added a comment - 2022-02-23 08:42

I too have this issue with my install of MariaDB 10.7 on Debian 11.

I also have MariaDB 10.5 on Debian 11 which doesn't get this issue. But this 10.5 is from Debian repo's not MariaDB.

I've also reported the issue on qemu, but the fix maybe within MariaDB's area.
https://gitlab.com/qemu-project/qemu/-/issues/881

FingerlessGloves added a comment - 2022-02-23 08:42 I too have this issue with my install of MariaDB 10.7 on Debian 11. I also have MariaDB 10.5 on Debian 11 which doesn't get this issue. But this 10.5 is from Debian repo's not MariaDB. I've also reported the issue on qemu, but the fix maybe within MariaDB's area. https://gitlab.com/qemu-project/qemu/-/issues/881

MBO added a comment - 2022-11-16 10:05

We have the same issue on one of our servers after upgrading to MariaDB 10.6. OS is CentOS 7 on Proxmox

MBO added a comment - 2022-11-16 10:05 We have the same issue on one of our servers after upgrading to MariaDB 10.6. OS is CentOS 7 on Proxmox

Faustin Lammler added a comment - 2022-11-23 15:27

Hi!
I don't have a proxmox environment to test for now so I tried to reproduce this on KVM/Qemu and I am not able to reproduce it.

Here is the test:

install debian11 VM;
install mariadb 10.6.11 from MDBF repo;
import the DB from https://github.com/datacharmer/test_db;
create a snaphot from the virt-manager.

Some information about the test system (Debian 11):

❯ uname -r

6.0.0-0.deb11.2-amd64

❯ virsh --version

8.0.0

❯ qemu-system-x86_64 --version

QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-11+deb11u2)

Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers

❯ dpkg -l | grep libvirt-daemon | awk '{print $1" "$2" "$3}'

ii libvirt-daemon 8.0.0-1~bpo11+1

ii libvirt-daemon-config-network 8.0.0-1~bpo11+1

ii libvirt-daemon-config-nwfilter 8.0.0-1~bpo11+1

ii libvirt-daemon-driver-lxc 8.0.0-1~bpo11+1

ii libvirt-daemon-driver-qemu 8.0.0-1~bpo11+1

ii libvirt-daemon-driver-vbox 8.0.0-1~bpo11+1

ii libvirt-daemon-driver-xen 8.0.0-1~bpo11+1

ii libvirt-daemon-system 8.0.0-1~bpo11+1

ii libvirt-daemon-system-systemd 8.0.0-1~bpo11+1

Faustin Lammler added a comment - 2022-11-23 15:27 Hi! I don't have a proxmox environment to test for now so I tried to reproduce this on KVM/Qemu and I am not able to reproduce it. Here is the test: install debian11 VM; install mariadb 10.6.11 from MDBF repo; import the DB from https://github.com/datacharmer/test_db ; create a snaphot from the virt-manager. Some information about the test system (Debian 11): ❯ uname -r 6.0 . 0 - 0 .deb11. 2 -amd64 ❯ virsh --version 8.0 . 0 ❯ qemu-system-x86_64 --version QEMU emulator version 5.2 . 0 (Debian 1 : 5.2 +dfsg- 11 +deb11u2) Copyright (c) 2003 - 2020 Fabrice Bellard and the QEMU Project developers ❯ dpkg -l | grep libvirt-daemon | awk '{print $1" "$2" "$3}' ii libvirt-daemon 8.0 . 0 - 1 ~bpo11+ 1 ii libvirt-daemon-config-network 8.0 . 0 - 1 ~bpo11+ 1 ii libvirt-daemon-config-nwfilter 8.0 . 0 - 1 ~bpo11+ 1 ii libvirt-daemon-driver-lxc 8.0 . 0 - 1 ~bpo11+ 1 ii libvirt-daemon-driver-qemu 8.0 . 0 - 1 ~bpo11+ 1 ii libvirt-daemon-driver-vbox 8.0 . 0 - 1 ~bpo11+ 1 ii libvirt-daemon-driver-xen 8.0 . 0 - 1 ~bpo11+ 1 ii libvirt-daemon-system 8.0 . 0 - 1 ~bpo11+ 1 ii libvirt-daemon-system-systemd 8.0 . 0 - 1 ~bpo11+ 1

FingerlessGloves added a comment - 2022-11-23 15:57

@Faustin
Your QEMU version is older than what I tested with the issue, at the time of my testing I was using 6.1.1 your using 5.2.0 with new 6 kernel. Could be the issue doesn't happen on older versions or newer kernel could be the difference.

Proxmox comes with some newer packages over it's base Debian.

@MBO
Can you get your kernel and qemu version when you have the issue. I assume your on Kernel 5.15 now on pve, I had the issue on 5.13 when I reported it.

FingerlessGloves added a comment - 2022-11-23 15:57 @Faustin Your QEMU version is older than what I tested with the issue, at the time of my testing I was using 6.1.1 your using 5.2.0 with new 6 kernel. Could be the issue doesn't happen on older versions or newer kernel could be the difference. Proxmox comes with some newer packages over it's base Debian. @MBO Can you get your kernel and qemu version when you have the issue. I assume your on Kernel 5.15 now on pve, I had the issue on 5.13 when I reported it.

MBO added a comment - 2022-11-23 16:05

@FingerlessGloves

The VM with the issue was running CentOS7 with kernel: 3.10.0-1160.80.1.el7.x86_64
The Proxmox machine was running: 5.15.39-3 with Proxmox version 7.2-7 and Qemu 6.2.0

Hope this information is useful.

MBO added a comment - 2022-11-23 16:05 @FingerlessGloves The VM with the issue was running CentOS7 with kernel: 3.10.0-1160.80.1.el7.x86_64 The Proxmox machine was running: 5.15.39-3 with Proxmox version 7.2-7 and Qemu 6.2.0 Hope this information is useful.

Daniel Black added a comment - 2022-11-23 19:26 - edited

A notable 10.6 difference to 10.5 with regard to storage is innodb_flush_method=O_DIRECT by defaults previously on fsync. If you test the old default that might be a useful for the qemu folks.

Also 10.6 added liburing as the innodb_use_native_aio=1 implementation in 10.6 (where liburing is available as a distro package). I've yet to look how Centos 8 Stream package theirs, but there might be a difference there.

Since tmpfs was mentioned a few times in the forum I'll mention https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1020831, but that's pretty much only for the 5.10 debian kernel so won't be the entire story.

Daniel Black added a comment - 2022-11-23 19:26 - edited A notable 10.6 difference to 10.5 with regard to storage is innodb_flush_method=O_DIRECT by defaults previously on fsync. If you test the old default that might be a useful for the qemu folks. Also 10.6 added liburing as the innodb_use_native_aio=1 implementation in 10.6 (where liburing is available as a distro package). I've yet to look how Centos 8 Stream package theirs, but there might be a difference there. Since tmpfs was mentioned a few times in the forum I'll mention https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1020831 , but that's pretty much only for the 5.10 debian kernel so won't be the entire story.

Linus added a comment - 2022-11-30 08:49 - edited

After setting innodb_flush_method=fsync and innodb_use_native_aio=0 the error still occurs.

(QEMU 7.0 by i440fx, Tested on Debian 11, MariaDB 10.8)

Backup log:

INFO: starting new backup job: vzdump --mode snapshot

INFO: Starting Backup of VM XXXXXX (qemu)

INFO: Backup started at 2022-11-30 09:43:43

INFO: status = running

INFO: include disk 'scsi0' 'local-zfs:vm-XXXXXX-disk-0' 300G

INFO: backup mode: snapshot

INFO: ionice priority: 7

INFO: snapshots found (not included into backup)

INFO: creating Proxmox Backup Server archive 'vm/XXXXXX/2022-11-30T08:43:43Z'

INFO: issuing guest-agent 'fs-freeze' command

[TIMEOUT]

Linus added a comment - 2022-11-30 08:49 - edited After setting innodb_flush_method=fsync and innodb_use_native_aio=0 the error still occurs. (QEMU 7.0 by i440fx, Tested on Debian 11, MariaDB 10.8) Backup log: INFO: starting new backup job: vzdump --mode snapshot INFO: Starting Backup of VM XXXXXX (qemu) INFO: Backup started at 2022-11-30 09:43:43 INFO: status = running INFO: include disk 'scsi0' 'local-zfs:vm-XXXXXX-disk-0' 300G INFO: backup mode: snapshot INFO: ionice priority: 7 INFO: snapshots found (not included into backup) INFO: creating Proxmox Backup Server archive 'vm/XXXXXX/2022-11-30T08:43:43Z' INFO: issuing guest-agent 'fs-freeze' command [TIMEOUT]

Alex added a comment - 2024-12-06 17:49

This is not only with qemu-ga. Just issuing fsfreeze -f / on the VM can also lock-up. The fsfreeze command in that case never completes.

I can reliably reproduce this only with mariadb actively writing to disk during the fsfreeze invocation. Not with other disk-intensive applications, and not while mariadb is idle.

Using mount namespaces (mariadb in a separate mount namespace, or just other namespaces present on the system) just seems to accelerate triggering the lockup.

innodb_flush_method = O_DIRECT / O_DSYNC makes no difference. I could NOT (yet) reproduce the issue with innodb_flush_method = fsync.

I have not tried reproducing it outside of a Qemu VM.

VM:

Kernel: Debian 6.1.119-1

Filesystem: ext4

MariaDB: 10.11.6-0+deb12u1

Host:

Kernel: Debian 6.11.10-1

Filesystem: ext4

Qemu: 9.1.2+ds-1

To reproduce, run this on the VM while (heavily, multithreaded) writing to the DB (insert, update, delete). In a while, fsfreeze -f / will lockup.

while true; do \

  echo "$(date '+%F %T.%N') freezing";

  fsfreeze -f /;

  echo "$(date '+%F %T.%N') frozen";

  sleep 0.1;

  echo "$(date '+%F %T.%N') thawing";

  fsfreeze -u /;

  echo "$(date '+%F %T.%N') thawed";

  sleep 0.1;

  echo;

done;

I create write load by running the prepare script from percona tpcc.

Could this be mariadb doing something unexpected/wrong while writing, or maybe even a kernel bug?

Alex added a comment - 2024-12-06 17:49 This is not only with qemu-ga. Just issuing fsfreeze -f / on the VM can also lock-up. The fsfreeze command in that case never completes. I can reliably reproduce this only with mariadb actively writing to disk during the fsfreeze invocation. Not with other disk-intensive applications, and not while mariadb is idle. Using mount namespaces (mariadb in a separate mount namespace, or just other namespaces present on the system) just seems to accelerate triggering the lockup. innodb_flush_method = O_DIRECT / O_DSYNC makes no difference. I could NOT (yet) reproduce the issue with innodb_flush_method = fsync . I have not tried reproducing it outside of a Qemu VM. VM: Kernel: Debian 6.1.119-1 Filesystem: ext4 MariaDB: 10.11.6-0+deb12u1 Host: Kernel: Debian 6.11.10-1 Filesystem: ext4 Qemu: 9.1.2+ds-1 To reproduce, run this on the VM while (heavily, multithreaded) writing to the DB (insert, update, delete). In a while, fsfreeze -f / will lockup. while true ; do \ echo "$(date '+%F %T.%N') freezing" ; fsfreeze -f /; echo "$(date '+%F %T.%N') frozen" ; sleep 0.1; echo "$(date '+%F %T.%N') thawing" ; fsfreeze -u /; echo "$(date '+%F %T.%N') thawed" ; sleep 0.1; echo ; done ; I create write load by running the prepare script from percona tpcc . Could this be mariadb doing something unexpected/wrong while writing, or maybe even a kernel bug?

People

Assignee:: Unassigned

Reporter:: Paweł Kośka

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 2021-12-08 10:19

Updated:: 2024-12-06 17:49

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server