Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21215

Random InnoDB: fsync() returned 5 using Btrfs with 10.3.17

Details

    Description

      We upgraded Debian 9 (MariaDB 10.1.38) to Debian 10 (MariaDB 10.3.17). We experienced huge slowdowns (perhaps related to: https://jira.mariadb.org/browse/MDEV-16333)

      We tried to speed up with these settings:

      innodb_flush_method = O_DIRECT_NO_FSYNC
      innodb_use_atomic_writes = 0
      innodb_deadlock_detect = 0

      But randomly it produces these errors:

      2019-12-04 0:24:31 12111440 [ERROR] [FATAL] InnoDB: fsync() returned 5
      191204 0:24:31 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.

      To report this bug, see https://mariadb.com/kb/en/reporting-bugs

      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed,
      something is definitely wrong and this may fail.

      Server version: 10.3.17-MariaDB-0+deb10u1-log

      Attachments

        Issue Links

          Activity

            We got a similar report in MDEV-17482. To get more accurate diagnostics, I added output of the error code.

            The error code 5 should be "Input/Output errror".

            laci, do you see any messages about file system corruption or block device errors in the output of the following commands?

            sudo dmesg
            journalctl -xe
            

            Also, if applicable, I would recommend to check sudo smartctl -A /dev/sda (assuming that the file system of the InnoDB data directory is located on that device).

            marko Marko Mäkelä added a comment - We got a similar report in MDEV-17482 . To get more accurate diagnostics, I added output of the error code. The error code 5 should be "Input/Output errror". laci , do you see any messages about file system corruption or block device errors in the output of the following commands? sudo dmesg journalctl -xe Also, if applicable, I would recommend to check sudo smartctl -A /dev/sda (assuming that the file system of the InnoDB data directory is located on that device).
            laci Laszlo Laci added a comment -

            It's a Xen VM and we see same problems with other VMs too. The VMs runs on different dedicated servers, none of them has disk errors.

            laci Laszlo Laci added a comment - It's a Xen VM and we see same problems with other VMs too. The VMs runs on different dedicated servers, none of them has disk errors.

            laci, given that hardware failure has been ruled out, I would primarily point the finger to the file system (btrfs). A quick search returned a Linux kernel fix for something in the fsync() on btrfs. It might not exactly match what you are seeing, because it mentions an assertion failure. If those assertions are not enabled in normal kernel builds, under that scenario you might observe fsync() returning EIO instead.

            I wonder if a different innodb_flush_method could work around it.

            As far as I know, we do not use btrfs in internal testing. I do not remember the fsync() call ever failing in our internal tests.

            marko Marko Mäkelä added a comment - laci , given that hardware failure has been ruled out, I would primarily point the finger to the file system (btrfs). A quick search returned a Linux kernel fix for something in the fsync() on btrfs . It might not exactly match what you are seeing, because it mentions an assertion failure. If those assertions are not enabled in normal kernel builds, under that scenario you might observe fsync() returning EIO instead. I wonder if a different innodb_flush_method could work around it. As far as I know, we do not use btrfs in internal testing. I do not remember the fsync() call ever failing in our internal tests.
            laci Laszlo Laci added a comment -

            Thank you very much. The Debian buster-backport's kernel 5.3.9-2~bpo10+1 contains that patch. I will install it and let it run for 2-3 weeks. I hope it fixes this problem. Early next year, I'll let you know if the issue has been resolved with that kernel.

            Which filesystems do you use in MariaDB internal testing?

            laci Laszlo Laci added a comment - Thank you very much. The Debian buster-backport's kernel 5.3.9-2~bpo10+1 contains that patch. I will install it and let it run for 2-3 weeks. I hope it fixes this problem. Early next year, I'll let you know if the issue has been resolved with that kernel. Which filesystems do you use in MariaDB internal testing?

            laci, did the kernel upgrade help?
            I think that we mostly use ext4, tmpfs (/dev/shm) on GNU/Linux and NFTS on Microsoft Windows.

            marko Marko Mäkelä added a comment - laci , did the kernel upgrade help? I think that we mostly use ext4, tmpfs (/dev/shm) on GNU/Linux and NFTS on Microsoft Windows.
            laci Laszlo Laci added a comment -

            The backport kernel seems to have fixed the bug, it hasn't come up since.

            laci Laszlo Laci added a comment - The backport kernel seems to have fixed the bug, it hasn't come up since.

            People

              Unassigned Unassigned
              laci Laszlo Laci
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.