[MDEV-28143] Data table corruption/crashing on btrfs - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.6.7, 10.7.3
Fix Version/s: 10.6
Component/s: Storage Engine - InnoDB
Labels:
None
Environment:
debian/devuan

Description

Creating a DB from a SQL dump (or reading an existing table) is returning various errors - 1877, "cannot open table", or simply segfaulting when a CHECK TABLE command is issued.

This has happened on multiple versions of MariaDB and on multiple machines, including during a SQL dump/restore - in which case the offending table was EMPTY on both the source and destination machine.

Attached is a stack trace and log dump which may be of use.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

up.txt
4.85 MB
2022-03-21 20:05
up-1.txt
55 kB
2022-03-22 03:30

Issue Links

relates to

MDEV-13542 Crashing on a corrupted page is unhelpful

Closed

Activity

Ascending order - Click to sort in descending order

View 6 older comments

K added a comment - 2022-03-24 04:43 - edited

Okay, loading the SQL dump failed the first time - so, to eliminate BTRFS as a confounding factor, I formatted the drive to XFS and attempted to load the SQL dump again into 10.7.3; it ran for several hours before hanging in an NON KILLABLE state (did not respond to repeated kill -9 calls), while emiting the following kernel error:

INFO: task mysql:12211 blocked for more than 1208 seconds.
Not tainted 5.16.0-5-cloud-amd64 #1 Debian 5.16.14-1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:mysql state stack: 0 pid:12211 ppid: 5267 flags:0x0000400
Call Trace:
<TASK>
__schedule+0x30a/0x9f0
? preempt_count_add+0x68/0xa0
schedule+0x4e/0xc0
io_schedule+0x47/0x70
blk_mq_get_tag+0x11a/0x2b0
? do_wait_intr_irq+0xa0/0xa0
__blk_mq_alloc_requests+0x175/0x2d0
blk_mq_submit_bio+0x1c9/0x710
submit_bio_noacct+0x257/0x2a0
btrfs_map_bio+0x18a/0x4a0 [btrfs]
btrfs_submit_data_bio+0x104/0x1e0 [btrfs]
submit_one_bio+0x44/0x70 [btrfs]
extent_readahead+0x3c4/0x3f0 [btrfs]
? __mod_memcg_lruvec_state+0x6e/0xc0
? mod_lruvec_state+0x17/0x30
? workingset_refault+0x152/0x2c0
read_pages+0x84/0x240
page_cache_ra_unbounded+0x1ab/0x260
filemap_get_pages+0xec/0x760
filemap_read+0xbd/0x350
new_sync_read+0x118/0x1a0
vfs_read+0xf1/0x190
ksys_read+0x5f/0xe0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fe06441410e
RSP: 002b:00007ffe50654118 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007fe06441410e
RDX: 0000000000fff000 RSI: 00007fe06193b29f RDI: 0000000000000004
RBP: 00007ffe50654170 R08: 0000000000fff000 R09: 00007fe062bba700
R10: 0000000000ffeef1 R11: 0000000000000246 R12: 0000000000fff000
R13: 00007fe063af56a8 R14: 0000000000000000 R15: 00007fe06193b028
</TASK>

Please note that while /var/lib/mysql was formatted as XFS, the volume which housed the SQL dump (which I imported via source /path/to/dump.sql) was btrfs, so this apparently happened while reading the dump, not while saving it to the DB data directory.

So I formatted the drive to XFS again, reinstalled the DB, and re-imported the dump, this time with the following settings enabled:
innodb_change_buffering=none
innodb_flush_method=fsync
innodb_use_native_aio=OFF

This permitted the dump to import successfully; as far as I can tell, the data loaded correctly, although I still need to run CHECK TABLE on the database, but as there are nearly 700 tables that will take some time.

K added a comment - 2022-03-24 04:43 - edited Okay, loading the SQL dump failed the first time - so, to eliminate BTRFS as a confounding factor, I formatted the drive to XFS and attempted to load the SQL dump again into 10.7.3; it ran for several hours before hanging in an NON KILLABLE state (did not respond to repeated kill -9 calls), while emiting the following kernel error: INFO: task mysql:12211 blocked for more than 1208 seconds. Not tainted 5.16.0-5-cloud-amd64 #1 Debian 5.16.14-1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:mysql state stack: 0 pid:12211 ppid: 5267 flags:0x0000400 Call Trace: <TASK> __schedule+0x30a/0x9f0 ? preempt_count_add+0x68/0xa0 schedule+0x4e/0xc0 io_schedule+0x47/0x70 blk_mq_get_tag+0x11a/0x2b0 ? do_wait_intr_irq+0xa0/0xa0 __blk_mq_alloc_requests+0x175/0x2d0 blk_mq_submit_bio+0x1c9/0x710 submit_bio_noacct+0x257/0x2a0 btrfs_map_bio+0x18a/0x4a0 [btrfs] btrfs_submit_data_bio+0x104/0x1e0 [btrfs] submit_one_bio+0x44/0x70 [btrfs] extent_readahead+0x3c4/0x3f0 [btrfs] ? __mod_memcg_lruvec_state+0x6e/0xc0 ? mod_lruvec_state+0x17/0x30 ? workingset_refault+0x152/0x2c0 read_pages+0x84/0x240 page_cache_ra_unbounded+0x1ab/0x260 filemap_get_pages+0xec/0x760 filemap_read+0xbd/0x350 new_sync_read+0x118/0x1a0 vfs_read+0xf1/0x190 ksys_read+0x5f/0xe0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fe06441410e RSP: 002b:00007ffe50654118 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007fe06441410e RDX: 0000000000fff000 RSI: 00007fe06193b29f RDI: 0000000000000004 RBP: 00007ffe50654170 R08: 0000000000fff000 R09: 00007fe062bba700 R10: 0000000000ffeef1 R11: 0000000000000246 R12: 0000000000fff000 R13: 00007fe063af56a8 R14: 0000000000000000 R15: 00007fe06193b028 </TASK> Please note that while /var/lib/mysql was formatted as XFS, the volume which housed the SQL dump (which I imported via source /path/to/dump.sql) was btrfs, so this apparently happened while reading the dump, not while saving it to the DB data directory. So I formatted the drive to XFS again, reinstalled the DB, and re-imported the dump, this time with the following settings enabled: innodb_change_buffering=none innodb_flush_method=fsync innodb_use_native_aio=OFF This permitted the dump to import successfully; as far as I can tell, the data loaded correctly, although I still need to run CHECK TABLE on the database, but as there are nearly 700 tables that will take some time.

Marko Mäkelä added a comment - 2022-03-24 08:34

vector_gorgoth, thank you. I think that danblack has the best knowledge of the Linux kernel bugs and changes in this area, or whether the default innodb_flush_method=O_DIRECT could cause trouble on XFS (which, like btrfs, supports file system snapshots and copy-on-write). It definitely does cause trouble on btrfs and reiserfs (MDEV-28100).

Disabling the change buffer is a good idea in any case; see ~~MDEV-27734~~.

One more work-around might be innodb_page_size=4k, if your schema is compatible with that. I do not know it for sure, but I would expect that if your drive has a physical block size of 4096 bytes or if it is an SSD (whose flash translation layer internally performs copy-on-write), then the InnoDB doublewrite buffer could be safely disabled.

I’d like to know whether XFS works for you with both asynchronous I/O and O_DIRECT enabled. We mostly use ext4 in our internal testing, without any problems. The O_DIRECT trouble took me by surprise.

Marko Mäkelä added a comment - 2022-03-24 08:34 vector_gorgoth , thank you. I think that danblack has the best knowledge of the Linux kernel bugs and changes in this area, or whether the default innodb_flush_method=O_DIRECT could cause trouble on XFS (which, like btrfs, supports file system snapshots and copy-on-write). It definitely does cause trouble on btrfs and reiserfs ( MDEV-28100 ). Disabling the change buffer is a good idea in any case; see MDEV-27734 . One more work-around might be innodb_page_size=4k , if your schema is compatible with that. I do not know it for sure, but I would expect that if your drive has a physical block size of 4096 bytes or if it is an SSD (whose flash translation layer internally performs copy-on-write), then the InnoDB doublewrite buffer could be safely disabled. I’d like to know whether XFS works for you with both asynchronous I/O and O_DIRECT enabled. We mostly use ext4 in our internal testing, without any problems. The O_DIRECT trouble took me by surprise.

K added a comment - 2022-03-25 20:36

For various reasons I attempted another dump/restore - as before, the volume containing the SQL dump is btrfs; the mysql data volume is xfs, and the server version is 10.7.3 on a 5.16.4 kernel. the 3 settings I had enabled before are still enabled:
innodb_change_buffering=none
innodb_flush_method=fsync
innodb_use_native_aio=OFF

But this time I got another hang, as before:

INFO: task mysql:43584 blocked for more than 1208 seconds.
Not tainted 5.16.0-5-cloud-amd64 #1 Debian 5.16.14-1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:mysql state stack: 0 pid:43584 ppid: 30023 flags:0x00000000
Call Trace:
<TASK>
__schedule+0x30a/0x9f0
? preempt_count_add+0x68/0xa0
schedule+0x4e/0xc0
io_schedule+0x47/0x70
blk_mq_get_tag+0x11a/0x2b0
? do_wait_intr_irq+0xa0/0xa0
__blk_mq_alloc_requests+0x175/0x2d0
blk_mq_submit_bio+0x1c9/0x710
submit_bio_noacct+0x257/0x2a0
btrfs_map_bio+0x18a/0x4a0 [btrfs]
btrfs_submit_data_bio+0x104/0x1e0 [btrfs]
submit_one_bio+0x44/0x70 [btrfs]
extent_readahead+0x3c4/0x3f0 [btrfs]
? __mod_memcg_lruvec_state+0x6e/0xc0
? mod_lruvec_state+0x17/0x30
? workingset_refault+0x152/0x2c0
read_pages+0x84/0x240
page_cache_ra_unbounded+0x1ab/0x260
filemap_get_pages+0xec/0x760
filemap_read+0xbd/0x350
new_sync_read+0x118/0x1a0
vfs_read+0xf1/0x190
ksys_read+0x5f/0xe0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7ff5b4f6910e
RSP: 002b:00007ffd76e6be58 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007ff5b4f6910e
RDX: 0000000000fff000 RSI: 00007ff5b248e2af RDI: 0000000000000004
RBP: 00007ffd76e6beb0 R08: 0000000000fff000 R09: 00007ff5b3760920
R10: 0000000000fff147 R11: 0000000000000246 R12: 0000000000fff000
R13: 00007ff5b464a6a8 R14: 0000000000000000 R15: 00007ff5b248e028
</TASK>

I'm going to attempt again with innodb_page_size=4k enabled - just for the sake of experimentation.

In the meantime, it appears that something is either seriously wrong with the btrfs driver (the volume itself was freshly created immediately prior to placing the SQL dump on it, on a brand new EBS volume) or with the way mariadb reads data when parsing dumps.

K added a comment - 2022-03-25 20:36 For various reasons I attempted another dump/restore - as before, the volume containing the SQL dump is btrfs; the mysql data volume is xfs, and the server version is 10.7.3 on a 5.16.4 kernel. the 3 settings I had enabled before are still enabled: innodb_change_buffering=none innodb_flush_method=fsync innodb_use_native_aio=OFF But this time I got another hang, as before: INFO: task mysql:43584 blocked for more than 1208 seconds. Not tainted 5.16.0-5-cloud-amd64 #1 Debian 5.16.14-1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:mysql state stack: 0 pid:43584 ppid: 30023 flags:0x00000000 Call Trace: <TASK> __schedule+0x30a/0x9f0 ? preempt_count_add+0x68/0xa0 schedule+0x4e/0xc0 io_schedule+0x47/0x70 blk_mq_get_tag+0x11a/0x2b0 ? do_wait_intr_irq+0xa0/0xa0 __blk_mq_alloc_requests+0x175/0x2d0 blk_mq_submit_bio+0x1c9/0x710 submit_bio_noacct+0x257/0x2a0 btrfs_map_bio+0x18a/0x4a0 [btrfs] btrfs_submit_data_bio+0x104/0x1e0 [btrfs] submit_one_bio+0x44/0x70 [btrfs] extent_readahead+0x3c4/0x3f0 [btrfs] ? __mod_memcg_lruvec_state+0x6e/0xc0 ? mod_lruvec_state+0x17/0x30 ? workingset_refault+0x152/0x2c0 read_pages+0x84/0x240 page_cache_ra_unbounded+0x1ab/0x260 filemap_get_pages+0xec/0x760 filemap_read+0xbd/0x350 new_sync_read+0x118/0x1a0 vfs_read+0xf1/0x190 ksys_read+0x5f/0xe0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7ff5b4f6910e RSP: 002b:00007ffd76e6be58 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007ff5b4f6910e RDX: 0000000000fff000 RSI: 00007ff5b248e2af RDI: 0000000000000004 RBP: 00007ffd76e6beb0 R08: 0000000000fff000 R09: 00007ff5b3760920 R10: 0000000000fff147 R11: 0000000000000246 R12: 0000000000fff000 R13: 00007ff5b464a6a8 R14: 0000000000000000 R15: 00007ff5b248e028 </TASK> I'm going to attempt again with innodb_page_size=4k enabled - just for the sake of experimentation. In the meantime, it appears that something is either seriously wrong with the btrfs driver (the volume itself was freshly created immediately prior to placing the SQL dump on it, on a brand new EBS volume) or with the way mariadb reads data when parsing dumps.

Marko Mäkelä added a comment - 2022-06-22 09:25

vector_gorgoth, thank you for the updates. How did the experiment with innodb_page_size=4k work out?

I wonder if a recent development snapshot of 10.6 (or any of the 10.10 preview releases), which contain a fix of ~~MDEV-13542~~, would avoid a crash in this case. (Sure, the data would still be inaccessible, but the process should hopefully not crash.) If you are going to try this, please be aware that a downgrade from 10.8 or later to an older major version will require special tricks, due to ~~MDEV-14425~~. We do not test downgrades between major versions ourselves, and we never guarantee that they work.

When it comes to the root cause of this, I would shift the blame to the btrfs implementation in the Linux kernel that you are using. It is possible that the bug has been fixed in a newer kernel.

I am reassigning this to danblack, who is our operating system ‘liaison officer’.

Marko Mäkelä added a comment - 2022-06-22 09:25 vector_gorgoth , thank you for the updates. How did the experiment with innodb_page_size=4k work out? I wonder if a recent development snapshot of 10.6 (or any of the 10.10 preview releases), which contain a fix of MDEV-13542 , would avoid a crash in this case. (Sure, the data would still be inaccessible, but the process should hopefully not crash.) If you are going to try this, please be aware that a downgrade from 10.8 or later to an older major version will require special tricks, due to MDEV-14425 . We do not test downgrades between major versions ourselves, and we never guarantee that they work. When it comes to the root cause of this, I would shift the blame to the btrfs implementation in the Linux kernel that you are using. It is possible that the bug has been fixed in a newer kernel. I am reassigning this to danblack , who is our operating system ‘liaison officer’.

K added a comment - 2022-06-22 16:23

No combination of config options helped reliably - eventually I simply moved everything to XFS.

K added a comment - 2022-06-22 16:23 No combination of config options helped reliably - eventually I simply moved everything to XFS.

MariaDB Server

Data table corruption/crashing on btrfs

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration