[MDEV-21215] Random InnoDB: fsync() returned 5 using Btrfs with 10.3.17 Created: 2019-12-04  Updated: 2020-03-16  Resolved: 2020-01-12

Status: Closed
Project: MariaDB Server
Component/s: Platform Debian, Storage Engine - InnoDB
Affects Version/s: 10.3.17
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Laszlo Laci Assignee: Unassigned
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Debian 10 64 bit Btrfs


Issue Links:
Relates
relates to MDEV-17482 InnoDB fails to say which fatal error... Closed
relates to MDEV-21950 Mariadb (galera) node crashed with an... Closed

 Description   

We upgraded Debian 9 (MariaDB 10.1.38) to Debian 10 (MariaDB 10.3.17). We experienced huge slowdowns (perhaps related to: https://jira.mariadb.org/browse/MDEV-16333)

We tried to speed up with these settings:

innodb_flush_method = O_DIRECT_NO_FSYNC
innodb_use_atomic_writes = 0
innodb_deadlock_detect = 0

But randomly it produces these errors:

2019-12-04 0:24:31 12111440 [ERROR] [FATAL] InnoDB: fsync() returned 5
191204 0:24:31 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.3.17-MariaDB-0+deb10u1-log



 Comments   
Comment by Marko Mäkelä [ 2019-12-13 ]

We got a similar report in MDEV-17482. To get more accurate diagnostics, I added output of the error code.

The error code 5 should be "Input/Output errror".

laci, do you see any messages about file system corruption or block device errors in the output of the following commands?

sudo dmesg
journalctl -xe

Also, if applicable, I would recommend to check sudo smartctl -A /dev/sda (assuming that the file system of the InnoDB data directory is located on that device).

Comment by Laszlo Laci [ 2019-12-16 ]

It's a Xen VM and we see same problems with other VMs too. The VMs runs on different dedicated servers, none of them has disk errors.

Comment by Marko Mäkelä [ 2019-12-19 ]

laci, given that hardware failure has been ruled out, I would primarily point the finger to the file system (btrfs). A quick search returned a Linux kernel fix for something in the fsync() on btrfs. It might not exactly match what you are seeing, because it mentions an assertion failure. If those assertions are not enabled in normal kernel builds, under that scenario you might observe fsync() returning EIO instead.

I wonder if a different innodb_flush_method could work around it.

As far as I know, we do not use btrfs in internal testing. I do not remember the fsync() call ever failing in our internal tests.

Comment by Laszlo Laci [ 2019-12-20 ]

Thank you very much. The Debian buster-backport's kernel 5.3.9-2~bpo10+1 contains that patch. I will install it and let it run for 2-3 weeks. I hope it fixes this problem. Early next year, I'll let you know if the issue has been resolved with that kernel.

Which filesystems do you use in MariaDB internal testing?

Comment by Marko Mäkelä [ 2020-01-03 ]

laci, did the kernel upgrade help?
I think that we mostly use ext4, tmpfs (/dev/shm) on GNU/Linux and NFTS on Microsoft Windows.

Comment by Laszlo Laci [ 2020-01-06 ]

The backport kernel seems to have fixed the bug, it hasn't come up since.

Generated at Thu Feb 08 09:05:27 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.