[MDEV-28143] Data table corruption/crashing on btrfs Created: 2022-03-21  Updated: 2023-09-25

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6.7, 10.7.3
Fix Version/s: 10.6

Type: Bug Priority: Major
Reporter: K Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

debian/devuan


Attachments: Text File up-1.txt     Text File up.txt    
Issue Links:
Relates
relates to MDEV-13542 Crashing on a corrupted page is unhel... Closed

 Description   

Creating a DB from a SQL dump (or reading an existing table) is returning various errors - 1877, "cannot open table", or simply segfaulting when a CHECK TABLE command is issued.

This has happened on multiple versions of MariaDB and on multiple machines, including during a SQL dump/restore - in which case the offending table was EMPTY on both the source and destination machine.

Attached is a stack trace and log dump which may be of use.



 Comments   
Comment by K [ 2022-03-22 ]

Another crash loading a SQL dump.

Comment by Marko Mäkelä [ 2022-03-22 ]

Which file system and Linux kernel version are you using? The failed page reads might be a duplicate of MDEV-27900.

Comment by K [ 2022-03-22 ]

btrfs mounted w/ nodatacow
5.16.0-3-cloud-amd64 #1 SMP PREEMPT Debian 5.16.11-1 (2022-02-25) x86_64 GNU/Linux

Comment by K [ 2022-03-22 ]

I've been made aware a fix to this issue was pushed into 10.7.4, so disregard my prior comment.

Comment by K [ 2022-03-23 ]

I'm not sure that MDEV-27900 is the exact same issue as this - I have a snapshot of an affected DB that I ran on a kernel 5.16.14 ( 5.16.0-5-cloud-amd64 #1 SMP PREEMPT Debian 5.16.14-1 (2022-03-15) x86_64 GNU/Linux) and the issue seems to persist. I'm going to try with a freshly formatted volume and a SQL dump.

Comment by Marko Mäkelä [ 2022-03-23 ]

Would setting innodb_flush_method=fsync work around the problem? If that is not enough, please set also innodb_use_native_aio=OFF. Both options may reduce performance.

Comment by K [ 2022-03-24 ]

Okay, loading the SQL dump failed the first time - so, to eliminate BTRFS as a confounding factor, I formatted the drive to XFS and attempted to load the SQL dump again into 10.7.3; it ran for several hours before hanging in an NON KILLABLE state (did not respond to repeated kill -9 calls), while emiting the following kernel error:

INFO: task mysql:12211 blocked for more than 1208 seconds.
Not tainted 5.16.0-5-cloud-amd64 #1 Debian 5.16.14-1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:mysql state stack: 0 pid:12211 ppid: 5267 flags:0x0000400
Call Trace:
<TASK>
__schedule+0x30a/0x9f0
? preempt_count_add+0x68/0xa0
schedule+0x4e/0xc0
io_schedule+0x47/0x70
blk_mq_get_tag+0x11a/0x2b0
? do_wait_intr_irq+0xa0/0xa0
__blk_mq_alloc_requests+0x175/0x2d0
blk_mq_submit_bio+0x1c9/0x710
submit_bio_noacct+0x257/0x2a0
btrfs_map_bio+0x18a/0x4a0 [btrfs]
btrfs_submit_data_bio+0x104/0x1e0 [btrfs]
submit_one_bio+0x44/0x70 [btrfs]
extent_readahead+0x3c4/0x3f0 [btrfs]
? __mod_memcg_lruvec_state+0x6e/0xc0
? mod_lruvec_state+0x17/0x30
? workingset_refault+0x152/0x2c0
read_pages+0x84/0x240
page_cache_ra_unbounded+0x1ab/0x260
filemap_get_pages+0xec/0x760
filemap_read+0xbd/0x350
new_sync_read+0x118/0x1a0
vfs_read+0xf1/0x190
ksys_read+0x5f/0xe0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fe06441410e
RSP: 002b:00007ffe50654118 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007fe06441410e
RDX: 0000000000fff000 RSI: 00007fe06193b29f RDI: 0000000000000004
RBP: 00007ffe50654170 R08: 0000000000fff000 R09: 00007fe062bba700
R10: 0000000000ffeef1 R11: 0000000000000246 R12: 0000000000fff000
R13: 00007fe063af56a8 R14: 0000000000000000 R15: 00007fe06193b028
</TASK>

Please note that while /var/lib/mysql was formatted as XFS, the volume which housed the SQL dump (which I imported via source /path/to/dump.sql) was btrfs, so this apparently happened while reading the dump, not while saving it to the DB data directory.

So I formatted the drive to XFS again, reinstalled the DB, and re-imported the dump, this time with the following settings enabled:
innodb_change_buffering=none
innodb_flush_method=fsync
innodb_use_native_aio=OFF

This permitted the dump to import successfully; as far as I can tell, the data loaded correctly, although I still need to run CHECK TABLE on the database, but as there are nearly 700 tables that will take some time.

Comment by Marko Mäkelä [ 2022-03-24 ]

vector_gorgoth, thank you. I think that danblack has the best knowledge of the Linux kernel bugs and changes in this area, or whether the default innodb_flush_method=O_DIRECT could cause trouble on XFS (which, like btrfs, supports file system snapshots and copy-on-write). It definitely does cause trouble on btrfs and reiserfs (MDEV-28100).

Disabling the change buffer is a good idea in any case; see MDEV-27734.

One more work-around might be innodb_page_size=4k, if your schema is compatible with that. I do not know it for sure, but I would expect that if your drive has a physical block size of 4096 bytes or if it is an SSD (whose flash translation layer internally performs copy-on-write), then the InnoDB doublewrite buffer could be safely disabled.

I’d like to know whether XFS works for you with both asynchronous I/O and O_DIRECT enabled. We mostly use ext4 in our internal testing, without any problems. The O_DIRECT trouble took me by surprise.

Comment by K [ 2022-03-25 ]

For various reasons I attempted another dump/restore - as before, the volume containing the SQL dump is btrfs; the mysql data volume is xfs, and the server version is 10.7.3 on a 5.16.4 kernel. the 3 settings I had enabled before are still enabled:
innodb_change_buffering=none
innodb_flush_method=fsync
innodb_use_native_aio=OFF

But this time I got another hang, as before:

INFO: task mysql:43584 blocked for more than 1208 seconds.
Not tainted 5.16.0-5-cloud-amd64 #1 Debian 5.16.14-1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:mysql state stack: 0 pid:43584 ppid: 30023 flags:0x00000000
Call Trace:
<TASK>
__schedule+0x30a/0x9f0
? preempt_count_add+0x68/0xa0
schedule+0x4e/0xc0
io_schedule+0x47/0x70
blk_mq_get_tag+0x11a/0x2b0
? do_wait_intr_irq+0xa0/0xa0
__blk_mq_alloc_requests+0x175/0x2d0
blk_mq_submit_bio+0x1c9/0x710
submit_bio_noacct+0x257/0x2a0
btrfs_map_bio+0x18a/0x4a0 [btrfs]
btrfs_submit_data_bio+0x104/0x1e0 [btrfs]
submit_one_bio+0x44/0x70 [btrfs]
extent_readahead+0x3c4/0x3f0 [btrfs]
? __mod_memcg_lruvec_state+0x6e/0xc0
? mod_lruvec_state+0x17/0x30
? workingset_refault+0x152/0x2c0
read_pages+0x84/0x240
page_cache_ra_unbounded+0x1ab/0x260
filemap_get_pages+0xec/0x760
filemap_read+0xbd/0x350
new_sync_read+0x118/0x1a0
vfs_read+0xf1/0x190
ksys_read+0x5f/0xe0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7ff5b4f6910e
RSP: 002b:00007ffd76e6be58 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007ff5b4f6910e
RDX: 0000000000fff000 RSI: 00007ff5b248e2af RDI: 0000000000000004
RBP: 00007ffd76e6beb0 R08: 0000000000fff000 R09: 00007ff5b3760920
R10: 0000000000fff147 R11: 0000000000000246 R12: 0000000000fff000
R13: 00007ff5b464a6a8 R14: 0000000000000000 R15: 00007ff5b248e028
</TASK>

I'm going to attempt again with innodb_page_size=4k enabled - just for the sake of experimentation.

In the meantime, it appears that something is either seriously wrong with the btrfs driver (the volume itself was freshly created immediately prior to placing the SQL dump on it, on a brand new EBS volume) or with the way mariadb reads data when parsing dumps.

Comment by Marko Mäkelä [ 2022-06-22 ]

vector_gorgoth, thank you for the updates. How did the experiment with innodb_page_size=4k work out?

I wonder if a recent development snapshot of 10.6 (or any of the 10.10 preview releases), which contain a fix of MDEV-13542, would avoid a crash in this case. (Sure, the data would still be inaccessible, but the process should hopefully not crash.) If you are going to try this, please be aware that a downgrade from 10.8 or later to an older major version will require special tricks, due to MDEV-14425. We do not test downgrades between major versions ourselves, and we never guarantee that they work.

When it comes to the root cause of this, I would shift the blame to the btrfs implementation in the Linux kernel that you are using. It is possible that the bug has been fixed in a newer kernel.

I am reassigning this to danblack, who is our operating system ‘liaison officer’.

Comment by K [ 2022-06-22 ]

No combination of config options helped reliably - eventually I simply moved everything to XFS.

Generated at Thu Feb 08 09:58:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.