Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23599

btrfs coming to Fedora 33 - soon October 2020

Details

    • Task
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Won't Fix
    • N/A
    • N/A
    • None

    Description

      btrfs is coming to Fedora 33
      https://fedoramagazine.org/btrfs-coming-to-fedora-33/

      Copy on Write file-systems normally aren't great with regard to databases, so lets be prepared for when this happens we need to examine:

      • What are performance differences with btrfs compared to ext3/xfs?
      • Are the MariaDB options that can assist with performance (disable double write buffer in default)?
      • Are there packaging options (like a btrfs subvolume for /var/lib/mysql with specific set of optimized parameters)
      • Are separate volumes needed/desired for redo, undo, binlog?
      • Can the aspects of the btrfs volume be auto-detected and some auto-tuning of mariadb to its configuration?
      • Do features like FIDEDUPERANGE need to be explored for Alter table copy table schematics?
      • Is snapshot consistent and suitable as a backup and/or SST mechanism?

      FYI: mschorm axel

      Attachments

        Issue Links

          Activity

            danblack Daniel Black added a comment - Arch linux recommend disabling the COW for /var/lib/mysql

            To improve write performance, one might want to disable copy-on-write on the circular InnoDB write-ahead log file. But, would that ruin the key feature (snapshot-based backups)? Perhaps it would be better to tune innodb_log_file_size so that it works well. Could there be some fcntl() for enabling and disabling the copy-on-write? For the log, it should be sufficient to offer snapshots at each log checkpoint, and allow any other writes to ‘overwrite’ the latest snapshot. For this to work, we might need a parameter ‘recover the log at most to this LSN’ in order to get a fully consistent snapshot. Or we could perform something like FLUSH TABLES WITH READ LOCK to make everything consistent with the latest LSN.

            In https://blog.pgaddict.com/posts/friends-dont-let-friends-use-btrfs-for-oltp there is a suggestion to try the nossd option even on SSD, to reduce the size of individual write requests.

            Some users may be surprised of ‘out of space’ errors that the copy-on-write may increase.

            Side note: InnoDB BLOBs as well as secondary indexes use a copy-on-write mechanism inside the data files.

            marko Marko Mäkelä added a comment - To improve write performance, one might want to disable copy-on-write on the circular InnoDB write-ahead log file. But, would that ruin the key feature (snapshot-based backups)? Perhaps it would be better to tune innodb_log_file_size so that it works well. Could there be some fcntl() for enabling and disabling the copy-on-write? For the log, it should be sufficient to offer snapshots at each log checkpoint, and allow any other writes to ‘overwrite’ the latest snapshot. For this to work, we might need a parameter ‘recover the log at most to this LSN’ in order to get a fully consistent snapshot. Or we could perform something like FLUSH TABLES WITH READ LOCK to make everything consistent with the latest LSN. In https://blog.pgaddict.com/posts/friends-dont-let-friends-use-btrfs-for-oltp there is a suggestion to try the nossd option even on SSD, to reduce the size of individual write requests. Some users may be surprised of ‘out of space’ errors that the copy-on-write may increase. Side note: InnoDB BLOBs as well as secondary indexes use a copy-on-write mechanism inside the data files.

            Are there packaging options (like a btrfs subvolume for /var/lib/mysql with specific set of optimized parameters)

            There may be a good argument for an Aria-default focused package (as an alt) which does exactly this to keep /var/lib/mysql/ enabled with datacow and to keep the transactional log files on nodatacow volumes.

            For the mainline packages though where InnoDB is the expected default, nodatacow should be used across the board unless direct collaboration/coordination between BTRFS and MariaDB developers yields new BTRFS parameters which apply datacow more intelligently (not a knock on datacow here, just the nature of what datacow is trying to do versus what InnoDB is already trying to do).

            All that said, I have run InnoDB successfully on CentOS 7 servers where I installed BTRFS as the only file system and left datacow enabled for a /var/lib/mysql/ subvolume. Which brings us to-

            Is snapshot consistent and suitable as a backup and/or SST mechanism?

            Yes, to the point where the users we do have who are currently using BTRFS with datacow enabled for their MariaDB directory (and who do not shift critical transaction/log files to other subvolumes) are told to NOT use mariabackup and TO USE BTRFS snapshot instead. In short, what mariabackup does is a complicated attempt to replicate what true, ACOW filesystems like BTRFS do inherently.

            And while this does need testing for MariaDB to make formal recommendations to customers who have not already decided they are going to run MariaDB like this, my personal experience suggests that for "lighter" use-cases or use-cases where customers are comfortable over-speccing hardware (ex- better/faster drives, more RAM, and more CPU cores than otherwise needed for the workload), running MariaDB on datacow subvolumes is completely viable and the flexibility gained from snapshots is enormous (it really makes recovery from corruption and other issues a breeze).

            There is actually a really special use-case we should be looking deeper into which is for replication clusters, to have the primary server use nodatacow, but for replicas to use datacow and to be responsible for backups. Many of the techniques we already recommend for conservative parallel replication are applicable towards ways nodatacow primaries can relay groups of events to datacow replicas which should reduce the performance impact of datacow being enabled on those replicas, while the overall benefits gained from datacow being enabled allow those replicas to enable enhanced recovery scenarios. Of note is just that during failover/switchover, a quick remount would need to be done of the replica being promoted so its subvolume switches to nodatacow at that time- this is something MaxScale's MariaDB Monitor could coordinate with the existing capability to execute customer scripts on demotion and promotion.

            There are also possibly use-cases for the datacow replica scenario for Galera and Xpand scale-out. Such nodes would need to be treated differently than regular nodes, but the point of including them in such a cluster would be to enable simple scale-out via btrfs snapshot ... btrfs send ... btrfs receive to efficiently get consistent subvols to new nodes (which can then mount those received snapshots with nodatacow).

            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) added a comment - Are there packaging options (like a btrfs subvolume for /var/lib/mysql with specific set of optimized parameters) There may be a good argument for an Aria-default focused package (as an alt) which does exactly this to keep /var/lib/mysql/ enabled with datacow and to keep the transactional log files on nodatacow volumes. For the mainline packages though where InnoDB is the expected default, nodatacow should be used across the board unless direct collaboration/coordination between BTRFS and MariaDB developers yields new BTRFS parameters which apply datacow more intelligently (not a knock on datacow here, just the nature of what datacow is trying to do versus what InnoDB is already trying to do). All that said, I have run InnoDB successfully on CentOS 7 servers where I installed BTRFS as the only file system and left datacow enabled for a /var/lib/mysql/ subvolume. Which brings us to- Is snapshot consistent and suitable as a backup and/or SST mechanism? Yes, to the point where the users we do have who are currently using BTRFS with datacow enabled for their MariaDB directory (and who do not shift critical transaction/log files to other subvolumes) are told to NOT use mariabackup and TO USE BTRFS snapshot instead. In short, what mariabackup does is a complicated attempt to replicate what true, ACOW filesystems like BTRFS do inherently. And while this does need testing for MariaDB to make formal recommendations to customers who have not already decided they are going to run MariaDB like this, my personal experience suggests that for "lighter" use-cases or use-cases where customers are comfortable over-speccing hardware (ex- better/faster drives, more RAM, and more CPU cores than otherwise needed for the workload), running MariaDB on datacow subvolumes is completely viable and the flexibility gained from snapshots is enormous (it really makes recovery from corruption and other issues a breeze). There is actually a really special use-case we should be looking deeper into which is for replication clusters, to have the primary server use nodatacow , but for replicas to use datacow and to be responsible for backups. Many of the techniques we already recommend for conservative parallel replication are applicable towards ways nodatacow primaries can relay groups of events to datacow replicas which should reduce the performance impact of datacow being enabled on those replicas, while the overall benefits gained from datacow being enabled allow those replicas to enable enhanced recovery scenarios. Of note is just that during failover/switchover, a quick remount would need to be done of the replica being promoted so its subvolume switches to nodatacow at that time- this is something MaxScale's MariaDB Monitor could coordinate with the existing capability to execute customer scripts on demotion and promotion. There are also possibly use-cases for the datacow replica scenario for Galera and Xpand scale-out. Such nodes would need to be treated differently than regular nodes, but the point of including them in such a cluster would be to enable simple scale-out via btrfs snapshot ... btrfs send ... btrfs receive to efficiently get consistent subvols to new nodes (which can then mount those received snapshots with nodatacow ).

            People

              Unassigned Unassigned
              danblack Daniel Black
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.