Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28143

Data table corruption/crashing on btrfs

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.6.7, 10.7.3
    • 10.6
    • None
    • debian/devuan

    Description

      Creating a DB from a SQL dump (or reading an existing table) is returning various errors - 1877, "cannot open table", or simply segfaulting when a CHECK TABLE command is issued.

      This has happened on multiple versions of MariaDB and on multiple machines, including during a SQL dump/restore - in which case the offending table was EMPTY on both the source and destination machine.

      Attached is a stack trace and log dump which may be of use.

      Attachments

        1. up.txt
          4.85 MB
        2. up-1.txt
          55 kB

        Issue Links

          Activity

            vector_gorgoth K added a comment - - edited

            Okay, loading the SQL dump failed the first time - so, to eliminate BTRFS as a confounding factor, I formatted the drive to XFS and attempted to load the SQL dump again into 10.7.3; it ran for several hours before hanging in an NON KILLABLE state (did not respond to repeated kill -9 calls), while emiting the following kernel error:

            INFO: task mysql:12211 blocked for more than 1208 seconds.
            Not tainted 5.16.0-5-cloud-amd64 #1 Debian 5.16.14-1
            "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            task:mysql state stack: 0 pid:12211 ppid: 5267 flags:0x0000400
            Call Trace:
            <TASK>
            __schedule+0x30a/0x9f0
            ? preempt_count_add+0x68/0xa0
            schedule+0x4e/0xc0
            io_schedule+0x47/0x70
            blk_mq_get_tag+0x11a/0x2b0
            ? do_wait_intr_irq+0xa0/0xa0
            __blk_mq_alloc_requests+0x175/0x2d0
            blk_mq_submit_bio+0x1c9/0x710
            submit_bio_noacct+0x257/0x2a0
            btrfs_map_bio+0x18a/0x4a0 [btrfs]
            btrfs_submit_data_bio+0x104/0x1e0 [btrfs]
            submit_one_bio+0x44/0x70 [btrfs]
            extent_readahead+0x3c4/0x3f0 [btrfs]
            ? __mod_memcg_lruvec_state+0x6e/0xc0
            ? mod_lruvec_state+0x17/0x30
            ? workingset_refault+0x152/0x2c0
            read_pages+0x84/0x240
            page_cache_ra_unbounded+0x1ab/0x260
            filemap_get_pages+0xec/0x760
            filemap_read+0xbd/0x350
            new_sync_read+0x118/0x1a0
            vfs_read+0xf1/0x190
            ksys_read+0x5f/0xe0
            do_syscall_64+0x3b/0x90
            entry_SYSCALL_64_after_hwframe+0x44/0xae
            RIP: 0033:0x7fe06441410e
            RSP: 002b:00007ffe50654118 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
            RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007fe06441410e
            RDX: 0000000000fff000 RSI: 00007fe06193b29f RDI: 0000000000000004
            RBP: 00007ffe50654170 R08: 0000000000fff000 R09: 00007fe062bba700
            R10: 0000000000ffeef1 R11: 0000000000000246 R12: 0000000000fff000
            R13: 00007fe063af56a8 R14: 0000000000000000 R15: 00007fe06193b028
            </TASK>

            Please note that while /var/lib/mysql was formatted as XFS, the volume which housed the SQL dump (which I imported via source /path/to/dump.sql) was btrfs, so this apparently happened while reading the dump, not while saving it to the DB data directory.

            So I formatted the drive to XFS again, reinstalled the DB, and re-imported the dump, this time with the following settings enabled:
            innodb_change_buffering=none
            innodb_flush_method=fsync
            innodb_use_native_aio=OFF

            This permitted the dump to import successfully; as far as I can tell, the data loaded correctly, although I still need to run CHECK TABLE on the database, but as there are nearly 700 tables that will take some time.

            vector_gorgoth K added a comment - - edited Okay, loading the SQL dump failed the first time - so, to eliminate BTRFS as a confounding factor, I formatted the drive to XFS and attempted to load the SQL dump again into 10.7.3; it ran for several hours before hanging in an NON KILLABLE state (did not respond to repeated kill -9 calls), while emiting the following kernel error: INFO: task mysql:12211 blocked for more than 1208 seconds. Not tainted 5.16.0-5-cloud-amd64 #1 Debian 5.16.14-1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:mysql state stack: 0 pid:12211 ppid: 5267 flags:0x0000400 Call Trace: <TASK> __schedule+0x30a/0x9f0 ? preempt_count_add+0x68/0xa0 schedule+0x4e/0xc0 io_schedule+0x47/0x70 blk_mq_get_tag+0x11a/0x2b0 ? do_wait_intr_irq+0xa0/0xa0 __blk_mq_alloc_requests+0x175/0x2d0 blk_mq_submit_bio+0x1c9/0x710 submit_bio_noacct+0x257/0x2a0 btrfs_map_bio+0x18a/0x4a0 [btrfs] btrfs_submit_data_bio+0x104/0x1e0 [btrfs] submit_one_bio+0x44/0x70 [btrfs] extent_readahead+0x3c4/0x3f0 [btrfs] ? __mod_memcg_lruvec_state+0x6e/0xc0 ? mod_lruvec_state+0x17/0x30 ? workingset_refault+0x152/0x2c0 read_pages+0x84/0x240 page_cache_ra_unbounded+0x1ab/0x260 filemap_get_pages+0xec/0x760 filemap_read+0xbd/0x350 new_sync_read+0x118/0x1a0 vfs_read+0xf1/0x190 ksys_read+0x5f/0xe0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fe06441410e RSP: 002b:00007ffe50654118 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007fe06441410e RDX: 0000000000fff000 RSI: 00007fe06193b29f RDI: 0000000000000004 RBP: 00007ffe50654170 R08: 0000000000fff000 R09: 00007fe062bba700 R10: 0000000000ffeef1 R11: 0000000000000246 R12: 0000000000fff000 R13: 00007fe063af56a8 R14: 0000000000000000 R15: 00007fe06193b028 </TASK> Please note that while /var/lib/mysql was formatted as XFS, the volume which housed the SQL dump (which I imported via source /path/to/dump.sql) was btrfs, so this apparently happened while reading the dump, not while saving it to the DB data directory. So I formatted the drive to XFS again, reinstalled the DB, and re-imported the dump, this time with the following settings enabled: innodb_change_buffering=none innodb_flush_method=fsync innodb_use_native_aio=OFF This permitted the dump to import successfully; as far as I can tell, the data loaded correctly, although I still need to run CHECK TABLE on the database, but as there are nearly 700 tables that will take some time.

            vector_gorgoth, thank you. I think that danblack has the best knowledge of the Linux kernel bugs and changes in this area, or whether the default innodb_flush_method=O_DIRECT could cause trouble on XFS (which, like btrfs, supports file system snapshots and copy-on-write). It definitely does cause trouble on btrfs and reiserfs (MDEV-28100).

            Disabling the change buffer is a good idea in any case; see MDEV-27734.

            One more work-around might be innodb_page_size=4k, if your schema is compatible with that. I do not know it for sure, but I would expect that if your drive has a physical block size of 4096 bytes or if it is an SSD (whose flash translation layer internally performs copy-on-write), then the InnoDB doublewrite buffer could be safely disabled.

            I’d like to know whether XFS works for you with both asynchronous I/O and O_DIRECT enabled. We mostly use ext4 in our internal testing, without any problems. The O_DIRECT trouble took me by surprise.

            marko Marko Mäkelä added a comment - vector_gorgoth , thank you. I think that danblack has the best knowledge of the Linux kernel bugs and changes in this area, or whether the default innodb_flush_method=O_DIRECT could cause trouble on XFS (which, like btrfs, supports file system snapshots and copy-on-write). It definitely does cause trouble on btrfs and reiserfs ( MDEV-28100 ). Disabling the change buffer is a good idea in any case; see MDEV-27734 . One more work-around might be innodb_page_size=4k , if your schema is compatible with that. I do not know it for sure, but I would expect that if your drive has a physical block size of 4096 bytes or if it is an SSD (whose flash translation layer internally performs copy-on-write), then the InnoDB doublewrite buffer could be safely disabled. I’d like to know whether XFS works for you with both asynchronous I/O and O_DIRECT enabled. We mostly use ext4 in our internal testing, without any problems. The O_DIRECT trouble took me by surprise.
            vector_gorgoth K added a comment -

            For various reasons I attempted another dump/restore - as before, the volume containing the SQL dump is btrfs; the mysql data volume is xfs, and the server version is 10.7.3 on a 5.16.4 kernel. the 3 settings I had enabled before are still enabled:
            innodb_change_buffering=none
            innodb_flush_method=fsync
            innodb_use_native_aio=OFF

            But this time I got another hang, as before:

            INFO: task mysql:43584 blocked for more than 1208 seconds.
            Not tainted 5.16.0-5-cloud-amd64 #1 Debian 5.16.14-1
            "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            task:mysql state stack: 0 pid:43584 ppid: 30023 flags:0x00000000
            Call Trace:
            <TASK>
            __schedule+0x30a/0x9f0
            ? preempt_count_add+0x68/0xa0
            schedule+0x4e/0xc0
            io_schedule+0x47/0x70
            blk_mq_get_tag+0x11a/0x2b0
            ? do_wait_intr_irq+0xa0/0xa0
            __blk_mq_alloc_requests+0x175/0x2d0
            blk_mq_submit_bio+0x1c9/0x710
            submit_bio_noacct+0x257/0x2a0
            btrfs_map_bio+0x18a/0x4a0 [btrfs]
            btrfs_submit_data_bio+0x104/0x1e0 [btrfs]
            submit_one_bio+0x44/0x70 [btrfs]
            extent_readahead+0x3c4/0x3f0 [btrfs]
            ? __mod_memcg_lruvec_state+0x6e/0xc0
            ? mod_lruvec_state+0x17/0x30
            ? workingset_refault+0x152/0x2c0
            read_pages+0x84/0x240
            page_cache_ra_unbounded+0x1ab/0x260
            filemap_get_pages+0xec/0x760
            filemap_read+0xbd/0x350
            new_sync_read+0x118/0x1a0
            vfs_read+0xf1/0x190
            ksys_read+0x5f/0xe0
            do_syscall_64+0x3b/0x90
            entry_SYSCALL_64_after_hwframe+0x44/0xae
            RIP: 0033:0x7ff5b4f6910e
            RSP: 002b:00007ffd76e6be58 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
            RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007ff5b4f6910e
            RDX: 0000000000fff000 RSI: 00007ff5b248e2af RDI: 0000000000000004
            RBP: 00007ffd76e6beb0 R08: 0000000000fff000 R09: 00007ff5b3760920
            R10: 0000000000fff147 R11: 0000000000000246 R12: 0000000000fff000
            R13: 00007ff5b464a6a8 R14: 0000000000000000 R15: 00007ff5b248e028
            </TASK>

            I'm going to attempt again with innodb_page_size=4k enabled - just for the sake of experimentation.

            In the meantime, it appears that something is either seriously wrong with the btrfs driver (the volume itself was freshly created immediately prior to placing the SQL dump on it, on a brand new EBS volume) or with the way mariadb reads data when parsing dumps.

            vector_gorgoth K added a comment - For various reasons I attempted another dump/restore - as before, the volume containing the SQL dump is btrfs; the mysql data volume is xfs, and the server version is 10.7.3 on a 5.16.4 kernel. the 3 settings I had enabled before are still enabled: innodb_change_buffering=none innodb_flush_method=fsync innodb_use_native_aio=OFF But this time I got another hang, as before: INFO: task mysql:43584 blocked for more than 1208 seconds. Not tainted 5.16.0-5-cloud-amd64 #1 Debian 5.16.14-1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:mysql state stack: 0 pid:43584 ppid: 30023 flags:0x00000000 Call Trace: <TASK> __schedule+0x30a/0x9f0 ? preempt_count_add+0x68/0xa0 schedule+0x4e/0xc0 io_schedule+0x47/0x70 blk_mq_get_tag+0x11a/0x2b0 ? do_wait_intr_irq+0xa0/0xa0 __blk_mq_alloc_requests+0x175/0x2d0 blk_mq_submit_bio+0x1c9/0x710 submit_bio_noacct+0x257/0x2a0 btrfs_map_bio+0x18a/0x4a0 [btrfs] btrfs_submit_data_bio+0x104/0x1e0 [btrfs] submit_one_bio+0x44/0x70 [btrfs] extent_readahead+0x3c4/0x3f0 [btrfs] ? __mod_memcg_lruvec_state+0x6e/0xc0 ? mod_lruvec_state+0x17/0x30 ? workingset_refault+0x152/0x2c0 read_pages+0x84/0x240 page_cache_ra_unbounded+0x1ab/0x260 filemap_get_pages+0xec/0x760 filemap_read+0xbd/0x350 new_sync_read+0x118/0x1a0 vfs_read+0xf1/0x190 ksys_read+0x5f/0xe0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7ff5b4f6910e RSP: 002b:00007ffd76e6be58 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007ff5b4f6910e RDX: 0000000000fff000 RSI: 00007ff5b248e2af RDI: 0000000000000004 RBP: 00007ffd76e6beb0 R08: 0000000000fff000 R09: 00007ff5b3760920 R10: 0000000000fff147 R11: 0000000000000246 R12: 0000000000fff000 R13: 00007ff5b464a6a8 R14: 0000000000000000 R15: 00007ff5b248e028 </TASK> I'm going to attempt again with innodb_page_size=4k enabled - just for the sake of experimentation. In the meantime, it appears that something is either seriously wrong with the btrfs driver (the volume itself was freshly created immediately prior to placing the SQL dump on it, on a brand new EBS volume) or with the way mariadb reads data when parsing dumps.

            vector_gorgoth, thank you for the updates. How did the experiment with innodb_page_size=4k work out?

            I wonder if a recent development snapshot of 10.6 (or any of the 10.10 preview releases), which contain a fix of MDEV-13542, would avoid a crash in this case. (Sure, the data would still be inaccessible, but the process should hopefully not crash.) If you are going to try this, please be aware that a downgrade from 10.8 or later to an older major version will require special tricks, due to MDEV-14425. We do not test downgrades between major versions ourselves, and we never guarantee that they work.

            When it comes to the root cause of this, I would shift the blame to the btrfs implementation in the Linux kernel that you are using. It is possible that the bug has been fixed in a newer kernel.

            I am reassigning this to danblack, who is our operating system ‘liaison officer’.

            marko Marko Mäkelä added a comment - vector_gorgoth , thank you for the updates. How did the experiment with innodb_page_size=4k work out? I wonder if a recent development snapshot of 10.6 (or any of the 10.10 preview releases), which contain a fix of MDEV-13542 , would avoid a crash in this case. (Sure, the data would still be inaccessible, but the process should hopefully not crash.) If you are going to try this, please be aware that a downgrade from 10.8 or later to an older major version will require special tricks, due to MDEV-14425 . We do not test downgrades between major versions ourselves, and we never guarantee that they work. When it comes to the root cause of this, I would shift the blame to the btrfs implementation in the Linux kernel that you are using. It is possible that the bug has been fixed in a newer kernel. I am reassigning this to danblack , who is our operating system ‘liaison officer’.
            vector_gorgoth K added a comment -

            No combination of config options helped reliably - eventually I simply moved everything to XFS.

            vector_gorgoth K added a comment - No combination of config options helped reliably - eventually I simply moved everything to XFS.

            People

              Unassigned Unassigned
              vector_gorgoth K
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.