[MDEV-21215] Random InnoDB: fsync() returned 5 using Btrfs with 10.3.17 - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Not a Bug
Affects Version/s: 10.3.17
Fix Version/s: N/A
Component/s: Platform Debian, Storage Engine - InnoDB
Labels:
None
Environment:
Debian 10 64 bit Btrfs

Description

We upgraded Debian 9 (MariaDB 10.1.38) to Debian 10 (MariaDB 10.3.17). We experienced huge slowdowns (perhaps related to: https://jira.mariadb.org/browse/MDEV-16333)

We tried to speed up with these settings:

innodb_flush_method = O_DIRECT_NO_FSYNC
innodb_use_atomic_writes = 0
innodb_deadlock_detect = 0

But randomly it produces these errors:

2019-12-04 0:24:31 12111440 [ERROR] [FATAL] InnoDB: fsync() returned 5
191204 0:24:31 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.3.17-MariaDB-0+deb10u1-log

Attachments

Issue Links

relates to

MDEV-17482 InnoDB fails to say which fatal error fsync() returned

Closed

MDEV-21950 Mariadb (galera) node crashed with an error: [ERROR] [FATAL] InnoDB: fsync() returned 5

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2019-12-13 14:45

We got a similar report in ~~MDEV-17482~~. To get more accurate diagnostics, I added output of the error code.

The error code 5 should be "Input/Output errror".

laci, do you see any messages about file system corruption or block device errors in the output of the following commands?

sudo dmesg

journalctl -xe

Also, if applicable, I would recommend to check sudo smartctl -A /dev/sda (assuming that the file system of the InnoDB data directory is located on that device).

Marko Mäkelä added a comment - 2019-12-13 14:45 We got a similar report in MDEV-17482 . To get more accurate diagnostics, I added output of the error code. The error code 5 should be "Input/Output errror". laci , do you see any messages about file system corruption or block device errors in the output of the following commands? sudo dmesg journalctl -xe Also, if applicable, I would recommend to check sudo smartctl -A /dev/sda (assuming that the file system of the InnoDB data directory is located on that device).

Laszlo Laci added a comment - 2019-12-16 09:44

It's a Xen VM and we see same problems with other VMs too. The VMs runs on different dedicated servers, none of them has disk errors.

Laszlo Laci added a comment - 2019-12-16 09:44 It's a Xen VM and we see same problems with other VMs too. The VMs runs on different dedicated servers, none of them has disk errors.

Marko Mäkelä added a comment - 2019-12-19 07:49

laci, given that hardware failure has been ruled out, I would primarily point the finger to the file system (btrfs). A quick search returned a Linux kernel fix for something in the fsync() on btrfs. It might not exactly match what you are seeing, because it mentions an assertion failure. If those assertions are not enabled in normal kernel builds, under that scenario you might observe fsync() returning EIO instead.

I wonder if a different innodb_flush_method could work around it.

As far as I know, we do not use btrfs in internal testing. I do not remember the fsync() call ever failing in our internal tests.

Marko Mäkelä added a comment - 2019-12-19 07:49 laci , given that hardware failure has been ruled out, I would primarily point the finger to the file system (btrfs). A quick search returned a Linux kernel fix for something in the fsync() on btrfs . It might not exactly match what you are seeing, because it mentions an assertion failure. If those assertions are not enabled in normal kernel builds, under that scenario you might observe fsync() returning EIO instead. I wonder if a different innodb_flush_method could work around it. As far as I know, we do not use btrfs in internal testing. I do not remember the fsync() call ever failing in our internal tests.

Laszlo Laci added a comment - 2019-12-20 11:18

Thank you very much. The Debian buster-backport's kernel 5.3.9-2~bpo10+1 contains that patch. I will install it and let it run for 2-3 weeks. I hope it fixes this problem. Early next year, I'll let you know if the issue has been resolved with that kernel.

Which filesystems do you use in MariaDB internal testing?

Laszlo Laci added a comment - 2019-12-20 11:18 Thank you very much. The Debian buster-backport's kernel 5.3.9-2~bpo10+1 contains that patch. I will install it and let it run for 2-3 weeks. I hope it fixes this problem. Early next year, I'll let you know if the issue has been resolved with that kernel. Which filesystems do you use in MariaDB internal testing?

Marko Mäkelä added a comment - 2020-01-03 18:08

laci, did the kernel upgrade help?
I think that we mostly use ext4, tmpfs (/dev/shm) on GNU/Linux and NFTS on Microsoft Windows.

Marko Mäkelä added a comment - 2020-01-03 18:08 laci , did the kernel upgrade help? I think that we mostly use ext4, tmpfs (/dev/shm) on GNU/Linux and NFTS on Microsoft Windows.

Laszlo Laci added a comment - 2020-01-06 08:06

The backport kernel seems to have fixed the bug, it hasn't come up since.

Laszlo Laci added a comment - 2020-01-06 08:06 The backport kernel seems to have fixed the bug, it hasn't come up since.

People

Assignee:: Unassigned

Reporter:: Laszlo Laci

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2019-12-04 10:36

Updated:: 2020-03-16 09:17

Resolved:: 2020-01-12 23:42

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server