Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25734

mbstream breaks page compression on XFS

Details

    Description

      As InnoDB page compression uses “hole punching,” and it seems like ignored by somehow. mariabackup SST method leaves compressed data from donor to decompressed on joiner node. mariabackup SST not copying compressed files so page compression seems not beneficial in this case.

      MariaDB [tt]> show create table t1\G
      *************************** 1. row ***************************
             Table: t1
      Create Table: CREATE TABLE `t1` (
        `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
        `msg` varchar(100) DEFAULT NULL,
        UNIQUE KEY `id` (`id`)
      ) ENGINE=InnoDB AUTO_INCREMENT=34209256 DEFAULT CHARSET=latin1 `PAGE_COMPRESSED`='ON'
      1 row in set (0.005 sec)
       
      +--------------+------------+------------+
      | TABLE_SCHEMA | TABLE_NAME | ROW_FORMAT |
      +--------------+------------+------------+
      | tt           | t1         | Dynamic    |
      +--------------+------------+------------+
       
      MariaDB [tt]> show global variables like '%compression%';
      +------------------------------------------+------------------+
      | Variable_name                            | Value            |
      +------------------------------------------+------------------+
      | column_compression_threshold             | 100              |
      | column_compression_zlib_level            | 6                |
      | column_compression_zlib_strategy         | DEFAULT_STRATEGY |
      | column_compression_zlib_wrap             | OFF              |
      | innodb_compression_algorithm             | zlib             |
      | innodb_compression_default               | ON               |
      | innodb_compression_failure_threshold_pct | 5                |
      | innodb_compression_level                 | 1                |
      | innodb_compression_pad_pct_max           | 50               |
      +------------------------------------------+------------------+
      9 rows in set (0.001 sec)
      

      page compression is enabled. Where size of tablespace file is:

      $ ls -lrth /var/lib/mysql/tt/t1.ibd
      2.1G /var/lib/mysql/tt/t1.ibd
       
      $ du -hs --block-size=1 /var/lib/mysql/tt/t1.ibd
      1098792960	/var/lib/mysql/tt/t1.ibd
       
      $ du -hs /var/lib/mysql/tt/t1.ibd
      1.1G	/var/lib/mysql/tt/t1.ibd
      

      Now, let's try mariabackup:

      mariabackup --backup --target-dir=/home/vagrant/backup/ -u root -p'xxxx'
      .
      [00] 2021-05-19 20:08:13 completed OK!
      

      Let's verify backup table t1 tablespace file size again:

      $ ls -la /home/vagrant/backup/tt/t1.ibd
      -rw-r-----. 1 root root 2231369728 May 19 20:08 /home/vagrant/backup/tt/t1.ibd
       
      $ du -hs --block-size=1 /home/vagrant/backup/tt/t1.ibd
      2214154240	/home/vagrant/backup/tt/t1.ibd
       
      $ du -hs /home/vagrant/backup/tt/t1.ibd
      2.1G	/home/vagrant/backup/tt/t1.ibd
      

      As per https://dev.mysql.com/doc/refman/5.7/en/innodb-page-compression.html
      Preserving page compression when moving a page-compressed tablespace file from one host to another requires a utility that preserves sparse files

      I think that is related xtrabackup report https://jira.percona.com/browse/PXB-1557

      Attachments

        Issue Links

          Activity

            Schulze Christian added a comment -

            Also mariabackup in 10.5 can't handle sparse-files.
            didn't test 10.6

            Schulze Christian added a comment - Also mariabackup in 10.5 can't handle sparse-files. didn't test 10.6

            thiru, you fixed something similar in MDEV-15527 in IMPORT TABLESPACE.

            marko Marko Mäkelä added a comment - thiru , you fixed something similar in MDEV-15527 in IMPORT TABLESPACE .

            In 10.6, MDEV-26029 disables sparse files on purpose on thinly provisioned storage (computational storage device), specifically ScaleFlux. We must not break that logic when merging this fix to 10.6.

            marko Marko Mäkelä added a comment - In 10.6, MDEV-26029 disables sparse files on purpose on thinly provisioned storage (computational storage device), specifically ScaleFlux. We must not break that logic when merging this fix to 10.6.

            MariaDB [(none)]> use test;
            Reading table information for completion of table and column names
            You can turn off this feature to get a quicker startup with -A
             
            Database changed
            MariaDB [test]> show create table t1;
            +-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table                                                                                                                                                                                                           |
            +-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t1    | CREATE TABLE `t1` (
              `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
              `msg` varchar(100) DEFAULT NULL,
              UNIQUE KEY `id` (`id`)
            ) ENGINE=InnoDB AUTO_INCREMENT=96860716 DEFAULT CHARSET=latin1 `page_compressed`=1 |
            +-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            1 row in set (0.000 sec)
             
            MariaDB [test]> select count(*) from t1;
            +----------+
            | count(*) |
            +----------+
            | 67108864 |
            +----------+
            1 row in set (43.579 sec)
             
            MariaDB [test]>  show global variables like '%compression%';
            +------------------------------------------+------------------+
            | Variable_name                            | Value            |
            +------------------------------------------+------------------+
            | column_compression_threshold             | 100              |
            | column_compression_zlib_level            | 6                |
            | column_compression_zlib_strategy         | DEFAULT_STRATEGY |
            | column_compression_zlib_wrap             | OFF              |
            | innodb_compression_algorithm             | zlib             |
            | innodb_compression_default               | ON               |
            | innodb_compression_failure_threshold_pct | 5                |
            | innodb_compression_level                 | 1                |
            | innodb_compression_pad_pct_max           | 50               |
            +------------------------------------------+------------------+
             
            Current file size:
            ===================
             
            thiru@rr:~/server/10.3/brl/data/test$ ls -lsrth ./t1.ibd 
            3.4G -rw-rw---- 1 thiru dev 6.0G Sep  8 12:30 ./t1.ibd
             
            thiru@rr:~/server/10.3/brl/data/test$ du -hs ./t1.ibd 
            3.4G	./t1.ibd
             
             
            Running the mariabackup with the following command:
            ===================================================
            ./mariabackup --target-dir=$PWD/backup --backup --socket=$PWD/../../data/mysql.sock --user=root
             
             
            Checking the file size in target directory:
            ===============================================
             
            thiru@rr:~/server/10.3/brl/extra/mariabackup/backup/test$ ls -lsrth ./t1.ibd 
            3.4G -rw-r----- 1 thiru dev 6.0G Sep  8 12:34 ./t1.ibd
             
            thiru@rr:~/server/10.3/brl/extra/mariabackup/backup/test$ du -hs ./t1.ibd 
            3.4G	./t1.ibd
            

            I don't see any uncompressed table in target directory.

            thiru@rr:~/server/10.3/brl/data/test$ cat /proc/version
            Linux version 5.4.0-62-generic (buildd@lgw01-amd64-045) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021
            

            Please verify what was the mariabackup command given by the user. Whether this issue is specific to centOS ? Please get the information about what filesystem they were using?

            thiru Thirunarayanan Balathandayuthapani added a comment - MariaDB [(none)]> use test; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A   Database changed MariaDB [test]> show create table t1; +-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `msg` varchar(100) DEFAULT NULL, UNIQUE KEY `id` (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=96860716 DEFAULT CHARSET=latin1 `page_compressed`=1 | +-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.000 sec)   MariaDB [test]> select count(*) from t1; +----------+ | count(*) | +----------+ | 67108864 | +----------+ 1 row in set (43.579 sec)   MariaDB [test]> show global variables like '%compression%'; +------------------------------------------+------------------+ | Variable_name | Value | +------------------------------------------+------------------+ | column_compression_threshold | 100 | | column_compression_zlib_level | 6 | | column_compression_zlib_strategy | DEFAULT_STRATEGY | | column_compression_zlib_wrap | OFF | | innodb_compression_algorithm | zlib | | innodb_compression_default | ON | | innodb_compression_failure_threshold_pct | 5 | | innodb_compression_level | 1 | | innodb_compression_pad_pct_max | 50 | +------------------------------------------+------------------+   Current file size: ===================   thiru@rr:~/server/10.3/brl/data/test$ ls -lsrth ./t1.ibd 3.4G -rw-rw---- 1 thiru dev 6.0G Sep 8 12:30 ./t1.ibd   thiru@rr:~/server/10.3/brl/data/test$ du -hs ./t1.ibd 3.4G ./t1.ibd     Running the mariabackup with the following command: =================================================== ./mariabackup --target-dir=$PWD/backup --backup --socket=$PWD/../../data/mysql.sock --user=root     Checking the file size in target directory: ===============================================   thiru@rr:~/server/10.3/brl/extra/mariabackup/backup/test$ ls -lsrth ./t1.ibd 3.4G -rw-r----- 1 thiru dev 6.0G Sep 8 12:34 ./t1.ibd   thiru@rr:~/server/10.3/brl/extra/mariabackup/backup/test$ du -hs ./t1.ibd 3.4G ./t1.ibd I don't see any uncompressed table in target directory. thiru@rr:~/server/10.3/brl/data/test$ cat /proc/version Linux version 5.4.0-62-generic (buildd@lgw01-amd64-045) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021 Please verify what was the mariabackup command given by the user. Whether this issue is specific to centOS ? Please get the information about what filesystem they were using?
            euglorg Eugene added a comment -

            Got similar issue.
            Our setup is Linux 5.10.67-gentoo, 3 absolutely identical servers with absolutely same mariadb configuration. Mariadb version is mariadb-10.4.21.
            We have similar definitions on all the tables with ENGINE=InnoDB ... 'PAGE_COMPRESSED'='ON'.
            We also have compression enabled by default for new tables:

            innodb_compression_algorithm		= snappy
            innodb_compression_default		= ON
            

            and compression setting like

            MariaDB [(none)]> show global variables like '%compression%';
            +------------------------------------------+------------------+
            | Variable_name                            | Value            |
            +------------------------------------------+------------------+
            | column_compression_threshold             | 100              |
            | column_compression_zlib_level            | 6                |
            | column_compression_zlib_strategy         | DEFAULT_STRATEGY |
            | column_compression_zlib_wrap             | OFF              |
            | innodb_compression_algorithm             | snappy           |
            | innodb_compression_default               | ON               |
            | innodb_compression_failure_threshold_pct | 5                |
            | innodb_compression_level                 | 6                |
            | innodb_compression_pad_pct_max           | 50               |
            +------------------------------------------+------------------+
            

            The issue was noticed when 'optimize' and several migrations were performed on the node running standalone: dataset size reduced significantly. Then data copied with SST to other nodes and SST took much longer than it was expected according to connection speed and dataset size on source node. The reason is that dataset size changed dramatically with SST and streamed volume appeared to be matching uncompressed data size.
            The result is:

            Node3
            ---------------
            Filesystem      Size  Used Avail Use% Mounted on
            /dev/sda8      3.5T  2.5T  1.1T  71% /var/lib/mysql
            ---------------
            Node2
            ---------------
            Filesystem      Size  Used Avail Use% Mounted on
            /dev/sda8      3.5T  2.5T  1.1T  71% /var/lib/mysql
            ---------------
            Node1
            ---------------
            Filesystem      Size  Used Avail Use% Mounted on
            /dev/sda8      3.5T  1.7T  1.9T  49% /var/lib/mysql
            

            On comparing tabespace file sizes, it was found that each and every .ibd file is bigger on Node3 and Node2 than it was on Node1, however, data size in information_schema is also identical for same table on source server and replica, meaning tablespaces are not compressed on repilcas while they were on source one.
            Also, on examining the backup taken on Node1 with mariabackup tool, it was found that tablespaces inside the backup have very same size they have on replicating (Node3 and Node2) servers (say, the table has size of 25GB on source node, it has size of 40GB on both replicas and in backup copy, while source node still operates compressed table with 25GB of size)
            SST section in mysql configuration file has no relation to this:

            [SST]
            compressor='/usr/bin/zstd --fast=3'
            decompressor='/usr/bin/zstd -d'
            

            Mariabackup command to make local backup is:
            mariabackup -u root -p --backup --galera-info --stream=xbstream --parallel 16 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql | /usr/bin/zstd --fast -T16 -q -o /mysql_backup.zst
            Again, on uncompressing archive mysql_backup.zst it was noticed that data size it contains is 30% bigger than in backup source directory.
            This all means that tablespace data is streamed uncompressed even on preforming local backup, and, thus, same happens on using mariabackup for SST while no compression happens on replica node.

            euglorg Eugene added a comment - Got similar issue. Our setup is Linux 5.10.67-gentoo, 3 absolutely identical servers with absolutely same mariadb configuration. Mariadb version is mariadb-10.4.21. We have similar definitions on all the tables with ENGINE=InnoDB ... 'PAGE_COMPRESSED'='ON'. We also have compression enabled by default for new tables: innodb_compression_algorithm = snappy innodb_compression_default = ON and compression setting like MariaDB [(none)]> show global variables like '%compression%' ; +------------------------------------------+------------------+ | Variable_name | Value | +------------------------------------------+------------------+ | column_compression_threshold | 100 | | column_compression_zlib_level | 6 | | column_compression_zlib_strategy | DEFAULT_STRATEGY | | column_compression_zlib_wrap | OFF | | innodb_compression_algorithm | snappy | | innodb_compression_default | ON | | innodb_compression_failure_threshold_pct | 5 | | innodb_compression_level | 6 | | innodb_compression_pad_pct_max | 50 | +------------------------------------------+------------------+ The issue was noticed when 'optimize' and several migrations were performed on the node running standalone: dataset size reduced significantly. Then data copied with SST to other nodes and SST took much longer than it was expected according to connection speed and dataset size on source node. The reason is that dataset size changed dramatically with SST and streamed volume appeared to be matching uncompressed data size. The result is: Node3 --------------- Filesystem Size Used Avail Use% Mounted on /dev/sda8 3 .5T 2 .5T 1 .1T 71 % /var/lib/mysql --------------- Node2 --------------- Filesystem Size Used Avail Use% Mounted on /dev/sda8 3 .5T 2 .5T 1 .1T 71 % /var/lib/mysql --------------- Node1 --------------- Filesystem Size Used Avail Use% Mounted on /dev/sda8 3 .5T 1 .7T 1 .9T 49 % /var/lib/mysql On comparing tabespace file sizes, it was found that each and every .ibd file is bigger on Node3 and Node2 than it was on Node1, however, data size in information_schema is also identical for same table on source server and replica, meaning tablespaces are not compressed on repilcas while they were on source one. Also, on examining the backup taken on Node1 with mariabackup tool, it was found that tablespaces inside the backup have very same size they have on replicating (Node3 and Node2) servers (say, the table has size of 25GB on source node, it has size of 40GB on both replicas and in backup copy, while source node still operates compressed table with 25GB of size) SST section in mysql configuration file has no relation to this: [SST] compressor= '/usr/bin/zstd --fast=3' decompressor= '/usr/bin/zstd -d' Mariabackup command to make local backup is: mariabackup -u root -p --backup --galera-info --stream=xbstream --parallel 16 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql | /usr/bin/zstd --fast -T16 -q -o /mysql_backup.zst Again, on uncompressing archive mysql_backup.zst it was noticed that data size it contains is 30% bigger than in backup source directory. This all means that tablespace data is streamed uncompressed even on preforming local backup, and, thus, same happens on using mariabackup for SST while no compression happens on replica node.

            thiru It appears on CentOS7/8 and it happens on XFS filesystem during mariabackup SST

            muhammad.irfan Muhammad Irfan added a comment - thiru It appears on CentOS7/8 and it happens on XFS filesystem during mariabackup SST

            Wrote the sample program to do create, write & seek system call.

            #include<unistd.h>
            #include<fcntl.h>
            #include<stdio.h>
            #include<string.h>
             
            #define DATA_SIZE 16384
             
            int main()
            {
              /* Variable to store user content */
              char data[DATA_SIZE];
              int data_length[8]= {16384, 109, 244, 249, 7346, 7078, 7146, 7839};
              int fd = open("./f1.txt", O_RDWR | O_CREAT);
             
              memset(data, 'a', DATA_SIZE);
              for (int i= 0; i < 8; i++)
              {
                int punch_hole_len= DATA_SIZE - data_length[i];
                write(fd, data, data_length[i]);
                if (punch_hole_len)
                    lseek(fd, punch_hole_len, SEEK_CUR);
              }
             
              close(fd);
              printf("File created and saved successfully \n");
              return 0;
            }
            

            In ext4 file system the output file `f1.txt` has different block size

            [~/mariarepo/server/10.7/10.7-sample]$df -T ./
            Filesystem     Type 1K-blocks      Used Available Use% Mounted on
            /dev/nvme0n1p7 ext4 364903200 326044404  20253024  95% /home
             
            [~/mariarepo/server/10.7/10.7-sample]$ls -ls f1.txt 
            64 ----r-x--T 1 thiru thiru 122527 Sep 22 17:46 f1.txt
             
            [~/mariarepo/server/10.7/10.7-sample]$du -hs f1.txt 
            64K	f1.txt
            
            

            In xfs filesystem:

            [/data/volumes/xfs32m]$df -T ./
            Filesystem     Type 1K-blocks  Used Available Use% Mounted on
            /dev/loop3     xfs      29348  1904     27444   7% /data/volumes/xfs32m
             
            [/data/volumes/xfs32m]$ls -ls ./f1.txt 
            76 -rwx--s--T 1 root root 122527 Sep 22 17:47 ./f1.txt
             
            [/data/volumes/xfs32m]$du -hs ./f1.txt 
            76K	./f1.txt
            

            AFAIK, Mariabackup doesn't do anything wrong logically. Mariabackup preserve punch hole for compressed format(MDEV-13023).
            I think the problem lies within file system.

            thiru Thirunarayanan Balathandayuthapani added a comment - Wrote the sample program to do create, write & seek system call. #include<unistd.h> #include<fcntl.h> #include<stdio.h> #include<string.h>   #define DATA_SIZE 16384   int main() { /* Variable to store user content */ char data[DATA_SIZE]; int data_length[8]= {16384, 109, 244, 249, 7346, 7078, 7146, 7839}; int fd = open( "./f1.txt" , O_RDWR | O_CREAT);   memset (data, 'a' , DATA_SIZE); for ( int i= 0; i < 8; i++) { int punch_hole_len= DATA_SIZE - data_length[i]; write(fd, data, data_length[i]); if (punch_hole_len) lseek(fd, punch_hole_len, SEEK_CUR); }   close(fd); printf ( "File created and saved successfully \n" ); return 0; } In ext4 file system the output file `f1.txt` has different block size [~/mariarepo/server/10.7/10.7-sample]$df -T ./ Filesystem Type 1K-blocks Used Available Use% Mounted on /dev/nvme0n1p7 ext4 364903200 326044404 20253024 95% /home   [~/mariarepo/server/10.7/10.7-sample]$ls -ls f1.txt 64 ----r-x--T 1 thiru thiru 122527 Sep 22 17:46 f1.txt   [~/mariarepo/server/10.7/10.7-sample]$du -hs f1.txt 64K f1.txt In xfs filesystem: [/data/volumes/xfs32m]$df -T ./ Filesystem Type 1K-blocks Used Available Use% Mounted on /dev/loop3 xfs 29348 1904 27444 7% /data/volumes/xfs32m   [/data/volumes/xfs32m]$ls -ls ./f1.txt 76 -rwx--s--T 1 root root 122527 Sep 22 17:47 ./f1.txt   [/data/volumes/xfs32m]$du -hs ./f1.txt 76K ./f1.txt AFAIK, Mariabackup doesn't do anything wrong logically. Mariabackup preserve punch hole for compressed format( MDEV-13023 ). I think the problem lies within file system.
            euglorg Eugene added a comment -

            Sorry, but this doesn't explain the case with same issue happening while streaming data to stdout and compressing it on the fly with mariabackup --stream=xbstream | zstd -o...
            Files inside the archive (that is filesystem-independent) have bigger size than they have in source /var/lib/mysql directory. Thus, mariadb handles them incorrectly while reading and streaming anywhere regardless of filesystem they will be written into.

            euglorg Eugene added a comment - Sorry, but this doesn't explain the case with same issue happening while streaming data to stdout and compressing it on the fly with mariabackup --stream=xbstream | zstd -o... Files inside the archive (that is filesystem-independent) have bigger size than they have in source /var/lib/mysql directory. Thus, mariadb handles them incorrectly while reading and streaming anywhere regardless of filesystem they will be written into.

            euglorg yes, MDEV-13023 doesn't deal with compressed file while streaming. Thanks for your analysis. I will consider this issue as Mariabackup doesn't punch hole the file during streaming.

            thiru Thirunarayanan Balathandayuthapani added a comment - euglorg yes, MDEV-13023 doesn't deal with compressed file while streaming. Thanks for your analysis. I will consider this issue as Mariabackup doesn't punch hole the file during streaming.

            As far as I understand, the problem is specifically with the Galera snapshot transfer (SST). Either something in the SST script or the mariabackup options that it invokes needs to be changed.

            marko Marko Mäkelä added a comment - As far as I understand, the problem is specifically with the Galera snapshot transfer (SST). Either something in the SST script or the mariabackup options that it invokes needs to be changed.
            euglorg Eugene added a comment -

            Marko, sorry, but the problem seems to be not just with SST as it also happens when mariabackup is used for creating a backup with mariabackup --stream=xbstream | zstd -o..., there's an example of complete command line with all the arguments above. Again, compression in archived data sent through the pipe as xbstream is also broken. As SST relies on mariabackup, it was just noticed on SST, but upon investigating backups it became clear that problem happens with mariabackup itself.

            euglorg Eugene added a comment - Marko, sorry, but the problem seems to be not just with SST as it also happens when mariabackup is used for creating a backup with mariabackup --stream=xbstream | zstd -o... , there's an example of complete command line with all the arguments above. Again, compression in archived data sent through the pipe as xbstream is also broken. As SST relies on mariabackup, it was just noticed on SST, but upon investigating backups it became clear that problem happens with mariabackup itself.
            wlad Vladislav Vaintroub added a comment - - edited

            euglorg - why do you uncompress the stream? What does that prove?
            Compressing the stream , lightly, gives you the ability to save on some binary zeros, which are actually, and undeniably, in the stream.
            uncompressing the compressed stream can be combined, via pipe, with mbstream -x , to actually create a data directory. And that step creates files with the holes. So, you have compressed stream, and compressed/with holes data directory.

            That uncompressed stream does not have holes, yes, we know about that, and thus the suggestion is to compress it. Lightly, so that larger runs of binary zeros are omitted.

            If you would like to point out that mbstream -x does not create holes, on some filesystem, I would think there is something wrong with filesystem. mariabackup makes an effort to create holes, portably, with original Unix APIs from 1970ies, seek() behind the end of file, and write() there. This popular answer from stackoverflow describes how to programmatically create holes, and this is what mariabackup exactly does.

            wlad Vladislav Vaintroub added a comment - - edited euglorg - why do you uncompress the stream? What does that prove? Compressing the stream , lightly, gives you the ability to save on some binary zeros, which are actually, and undeniably, in the stream. uncompressing the compressed stream can be combined, via pipe, with mbstream -x , to actually create a data directory. And that step creates files with the holes. So, you have compressed stream, and compressed/with holes data directory. That uncompressed stream does not have holes, yes, we know about that, and thus the suggestion is to compress it. Lightly, so that larger runs of binary zeros are omitted. If you would like to point out that mbstream -x does not create holes, on some filesystem, I would think there is something wrong with filesystem. mariabackup makes an effort to create holes, portably, with original Unix APIs from 1970ies, seek() behind the end of file, and write() there. This popular answer from stackoverflow describes how to programmatically create holes, and this is what mariabackup exactly does.
            euglorg Eugene added a comment - - edited

            Vladislav, sorry for misunderstanding, let's probably explain the problem step by step.

            First, your question is a bit unclear as there's no any mention on stream uncompression in the command I provided:

            mariabackup -u root -p --backup --galera-info --stream=xbstream --parallel 16 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql | ...
            

            It is expected that data received from pipe will have pages compressed, isn't it?
            Here commands to the right from the pipe are omitted as uncompressed stream is sent through the pipe while there's no any explicit mention on uncompressing tablespace data in either command or related hel page explanations (this has no relation to compressing and uncompressing the stream sent through socat while performing SST, for sure).
            To clear the case, full backup-restore command set would be:

            #create backup archive
            mariabackup -u root -p --backup --galera-info --stream=xbstream --parallel 16 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql | /usr/bin/zstd --fast -T16 -q -o /mysql_backup.zst
            #uncompress the archive
            zstd -d -c  /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp
            #prepare the backup
            mariabackup --prepare --innodb-doublewrite --target-dir /mnt/mariabackuptmp
            

            but second command always produces dataset that has uncompressed tablespace data. Now we use XFS. If you are sure this is filesystem problem, would you expect data to have same size in case I uncompress archived backup data to ext4fs?
            Please give me a couple of days, I'll try to extract archived backup to ext4fs... If it will have same size the original data has had, then the problem can be within FS, then there should be explicit note in the documentation that mariadb can't use tablespace compression if mariabackup for SST and data resides on XFS partition.
            However, in case zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp (see above how it was created) will produce also dataset with uncompressed tablespaces on ext4fs (mean, command chain mariabackup | mbstream -x creates dataset that is 30% bigger than expected), this would mean that data sent through the pipe has broken compression already, as it is not sent to any filesystem.

            If the above logic is wrong, please let me know.

            euglorg Eugene added a comment - - edited Vladislav, sorry for misunderstanding, let's probably explain the problem step by step. First, your question is a bit unclear as there's no any mention on stream uncompression in the command I provided: mariabackup -u root -p --backup --galera-info --stream=xbstream --parallel 16 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql | ... It is expected that data received from pipe will have pages compressed, isn't it? Here commands to the right from the pipe are omitted as uncompressed stream is sent through the pipe while there's no any explicit mention on uncompressing tablespace data in either command or related hel page explanations (this has no relation to compressing and uncompressing the stream sent through socat while performing SST, for sure). To clear the case, full backup-restore command set would be: #create backup archive mariabackup -u root -p --backup --galera-info --stream=xbstream --parallel 16 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql | /usr/bin/zstd --fast -T16 -q -o /mysql_backup.zst #uncompress the archive zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp #prepare the backup mariabackup --prepare --innodb-doublewrite --target-dir /mnt/mariabackuptmp but second command always produces dataset that has uncompressed tablespace data. Now we use XFS. If you are sure this is filesystem problem, would you expect data to have same size in case I uncompress archived backup data to ext4fs? Please give me a couple of days, I'll try to extract archived backup to ext4fs... If it will have same size the original data has had, then the problem can be within FS, then there should be explicit note in the documentation that mariadb can't use tablespace compression if mariabackup for SST and data resides on XFS partition. However, in case zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp (see above how it was created) will produce also dataset with uncompressed tablespaces on ext4fs (mean, command chain mariabackup | mbstream -x creates dataset that is 30% bigger than expected), this would mean that data sent through the pipe has broken compression already, as it is not sent to any filesystem. If the above logic is wrong, please let me know.

            euglorg, did you try the test program that thiru posted? It is producing different results on ext4 and XFS.

            As far as I can tell, the function write_compressed() in extra/mariabackup/ds_local.cc is creating sparse files in the traditional way: by writing and seeking past the current end of the file.

            We might try to convince XFS harder to create sparse files, by invoking a Linux specific fallocate() system call, to make it closer to the way how writes from the InnoDB buffer pool work.

            marko Marko Mäkelä added a comment - euglorg , did you try the test program that thiru posted? It is producing different results on ext4 and XFS. As far as I can tell, the function write_compressed() in extra/mariabackup/ds_local.cc is creating sparse files in the traditional way: by writing and seeking past the current end of the file. We might try to convince XFS harder to create sparse files, by invoking a Linux specific fallocate() system call, to make it closer to the way how writes from the InnoDB buffer pool work.

            I found a question that suggests that this is an XFS design feature:
            https://stackoverflow.com/questions/6940516/create-sparse-file-with-alternate-data-and-hole-on-ext3-and-xfs

            This one may be relevant too:
            https://serverfault.com/questions/406069/why-are-my-xfs-filesystems-suddenly-consuming-more-space-and-full-of-sparse-file

            If XFS is on purpose so reluctant to create sparse files with tiny blocks when they are created in the traditional POSIX way (seeking and writing) as opposed to FALLOC_FL_PUNCH_HOLE, maybe it is not an ideal match for the page_compressed tables.

            marko Marko Mäkelä added a comment - I found a question that suggests that this is an XFS design feature: https://stackoverflow.com/questions/6940516/create-sparse-file-with-alternate-data-and-hole-on-ext3-and-xfs This one may be relevant too: https://serverfault.com/questions/406069/why-are-my-xfs-filesystems-suddenly-consuming-more-space-and-full-of-sparse-file If XFS is on purpose so reluctant to create sparse files with tiny blocks when they are created in the traditional POSIX way (seeking and writing) as opposed to FALLOC_FL_PUNCH_HOLE , maybe it is not an ideal match for the page_compressed tables.

            Here is another link http://web.archive.org/web/20171010154045/http://oss.sgi.com/archives/xfs/2012-02/msg00413.html, with discussion on xfs mount options. disclaimer - I did not try that.

            wlad Vladislav Vaintroub added a comment - Here is another link http://web.archive.org/web/20171010154045/http://oss.sgi.com/archives/xfs/2012-02/msg00413.html , with discussion on xfs mount options. disclaimer - I did not try that.

            It is probably helpful to punch holes on XFS explicitly, with FALLOC_FL_PUNCH_HOLE, after determinining that file system is XFS.

            wlad Vladislav Vaintroub added a comment - It is probably helpful to punch holes on XFS explicitly, with FALLOC_FL_PUNCH_HOLE, after determinining that file system is XFS.
            euglorg Eugene added a comment -

            Marko, I haven't tried that test. Instead, I tried to extract backup archive to ext4fs partition and have to confirm that the issue happens on XFS only.
            The case looks as follows:

            • Original dataset with page compression has size of 1.7T
            • After running `mariabackup --stream=xbstream | /usr/bin/zstd -o /mysql_backup.zst && zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp` where /mnt/mariabackuptmp is XFS partition, dataset inside /mnt/mariabackuptmp has size of 2.5T
            • After running `mariabackup --stream=xbstream | /usr/bin/zstd -o /mysql_backup.zst && zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp` where /mnt/mariabackuptmp is ext4fs partition, dataset inside /mnt/mariabackuptmp has size of 1.7T

            The issue is that XFS has better performance for multithreaded IO, that was the reason of using it. Probably, Vladislav is right and XFS requires specific handling if page compression is in use.

            euglorg Eugene added a comment - Marko, I haven't tried that test. Instead, I tried to extract backup archive to ext4fs partition and have to confirm that the issue happens on XFS only. The case looks as follows: Original dataset with page compression has size of 1.7T After running `mariabackup --stream=xbstream | /usr/bin/zstd -o /mysql_backup.zst && zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp` where /mnt/mariabackuptmp is XFS partition, dataset inside /mnt/mariabackuptmp has size of 2.5T After running `mariabackup --stream=xbstream | /usr/bin/zstd -o /mysql_backup.zst && zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp` where /mnt/mariabackuptmp is ext4fs partition, dataset inside /mnt/mariabackuptmp has size of 1.7T The issue is that XFS has better performance for multithreaded IO, that was the reason of using it. Probably, Vladislav is right and XFS requires specific handling if page compression is in use.

            Perhaps, fallocate --dig-holes on the backup files could help, after decompression. As a workaround

            wlad Vladislav Vaintroub added a comment - Perhaps, fallocate --dig-holes on the backup files could help, after decompression. As a workaround
            euglorg Eugene added a comment -

            But this workaround is probably far from being good option for SST transfer. It will make it much slower.

            euglorg Eugene added a comment - But this workaround is probably far from being good option for SST transfer. It will make it much slower.
            wlad Vladislav Vaintroub added a comment - - edited

            the xbstream/mbstream format is currently not-optimized for the holes. Thus, the suggestion to use lightweight stream compression. Lightweight means it would be is capable of detecting and compressing large runs of 0 bytes, yet would not try too hard to compress already compressed data. zlib with level 1 might be a good candidate

            wlad Vladislav Vaintroub added a comment - - edited the xbstream/mbstream format is currently not-optimized for the holes. Thus, the suggestion to use lightweight stream compression. Lightweight means it would be is capable of detecting and compressing large runs of 0 bytes, yet would not try too hard to compress already compressed data. zlib with level 1 might be a good candidate
            tanj Guillaume Lefranc added a comment - - edited

            Just as a side note, punching holes with fallocate on XFS is slow:
            https://issues.apache.org/jira/browse/KUDU-2052
            So, eventually one of the solutions could be to patch mariabackup so if XFS is detected XFS_IOC_UNRESVSP64 is used to punch holes in files

            tanj Guillaume Lefranc added a comment - - edited Just as a side note, punching holes with fallocate on XFS is slow : https://issues.apache.org/jira/browse/KUDU-2052 So, eventually one of the solutions could be to patch mariabackup so if XFS is detected XFS_IOC_UNRESVSP64 is used to punch holes in files
            wlad Vladislav Vaintroub added a comment - - edited

            We do not do fallocate in backup, tanj. There is the traditional, not specifically Linuxy, solution , lseek() behind the end of file, and then write(). distance between old end of file, and the place we write, is supposed to be the hole. The most modern in the current backup solution, is that we do set sparse file attribute on Windows, which works since year 2000 or so

            The server does fallocates, and maybe it does them slow, but there is nothing else we do.

            wlad Vladislav Vaintroub added a comment - - edited We do not do fallocate in backup, tanj . There is the traditional, not specifically Linuxy, solution , lseek() behind the end of file, and then write(). distance between old end of file, and the place we write, is supposed to be the hole. The most modern in the current backup solution, is that we do set sparse file attribute on Windows, which works since year 2000 or so The server does fallocates, and maybe it does them slow, but there is nothing else we do.

            @wlad I was referring to Marko's comment:

            We might try to convince XFS harder to create sparse files, by invoking a Linux specific fallocate() system call, to make it closer to the way how writes from the InnoDB buffer pool work.

            tanj Guillaume Lefranc added a comment - @wlad I was referring to Marko's comment: We might try to convince XFS harder to create sparse files, by invoking a Linux specific fallocate() system call, to make it closer to the way how writes from the InnoDB buffer pool work.

            It is not up to me entirely, but I do not welcome a workaround for every bug-of-day on every possible Linux filesystem. Maybe we eventually just document the behavior of this and that filesystem, and leave it as-is.

            wlad Vladislav Vaintroub added a comment - It is not up to me entirely, but I do not welcome a workaround for every bug-of-day on every possible Linux filesystem. Maybe we eventually just document the behavior of this and that filesystem, and leave it as-is.
            euglorg Eugene added a comment -

            Indeed, workaround for every 'bug-of-the-day' is a bad idea, strongly agree with Vladislav!
            But looking at this case, it is not even clear what part should be treated as a bug: xfs behaves correctly from filesystem point of view because: (a) correct way to create sparse file is fallocate() and not seek-beyond-EOF, and (b) xfs preallocates blocks to minimize fragmentation, thus, it behaves correctly utilizing certain space that should be a hole on other filesystems. Thus, there's even no reason to report this as a bug of xfs.
            The point is that both MariaDB does nothing wrong, just uses generic method of punching holes, but also xfs does nothing wrong, too, while together this causes an unwanted effect with unpredictable disk space utilization after SST.
            For now I'd ask for

            • adding a note into documentation about this (known) behavior that is specific for SST with xfs.
            • leaving behavior unchanged but leaving the bug open (yes, this sounds like "known bug is a feature")
              The reason for leaving the bug open is that there's a question whether using seek-beyond-EOF is in fact correct method of handling hole punching: this probably should not be reevaluated and replaced with another (fallocate?) filesystem-specific methods and not just for xfs but for other filesystems also... not sure in fact.
            euglorg Eugene added a comment - Indeed, workaround for every 'bug-of-the-day' is a bad idea, strongly agree with Vladislav! But looking at this case, it is not even clear what part should be treated as a bug: xfs behaves correctly from filesystem point of view because: (a) correct way to create sparse file is fallocate() and not seek-beyond-EOF, and (b) xfs preallocates blocks to minimize fragmentation, thus, it behaves correctly utilizing certain space that should be a hole on other filesystems. Thus, there's even no reason to report this as a bug of xfs. The point is that both MariaDB does nothing wrong, just uses generic method of punching holes, but also xfs does nothing wrong, too, while together this causes an unwanted effect with unpredictable disk space utilization after SST. For now I'd ask for adding a note into documentation about this (known) behavior that is specific for SST with xfs. leaving behavior unchanged but leaving the bug open (yes, this sounds like "known bug is a feature") The reason for leaving the bug open is that there's a question whether using seek-beyond-EOF is in fact correct method of handling hole punching: this probably should not be reevaluated and replaced with another (fallocate?) filesystem-specific methods and not just for xfs but for other filesystems also... not sure in fact.

            I'd also agree that proper documentation is preferable to workaround. In this case writing that ext4 is preferable as a filesystem "in certain cases"

            Just as a note because I tried on XFS but fallocate --dig-holes does not reclaim space sadly.

            tanj Guillaume Lefranc added a comment - I'd also agree that proper documentation is preferable to workaround. In this case writing that ext4 is preferable as a filesystem "in certain cases" Just as a note because I tried on XFS but fallocate --dig-holes does not reclaim space sadly.
            euglorg Eugene added a comment - - edited

            Tried fallocate --dig-holes on some files from my dataset (on xfs), there's some effect but not that expected difference seen on ext4fs. It probably depends on real data size, hole sizes and extents allocated in every particular case.
            By the way, it would be good to add another note on dealing with holes (regardless of filesystem, probably). The case is that in case you have sparse files containing tablespaces on SSD drive and issue 'fstrim' command to trim unused blocks, fstrim locks files having holes and trimming them (that's expected) with freezing IO on them, thus, freezing mariadbd. Once space is trimmed, files "released" and mariadbd resumes normal operation (fstrim can still be handling unoccupied space on same drive yet). On galera cluster this triggers flowcontrol and slows down whole cluster. No reason to report this as a separate bug (as this is also expected behavior in fact and caused by same compression with storing data in sparse files) but a warning about this effect might be helpful.

            euglorg Eugene added a comment - - edited Tried fallocate --dig-holes on some files from my dataset (on xfs), there's some effect but not that expected difference seen on ext4fs. It probably depends on real data size, hole sizes and extents allocated in every particular case. By the way, it would be good to add another note on dealing with holes (regardless of filesystem, probably). The case is that in case you have sparse files containing tablespaces on SSD drive and issue 'fstrim' command to trim unused blocks, fstrim locks files having holes and trimming them (that's expected) with freezing IO on them, thus, freezing mariadbd. Once space is trimmed, files "released" and mariadbd resumes normal operation (fstrim can still be handling unoccupied space on same drive yet). On galera cluster this triggers flowcontrol and slows down whole cluster. No reason to report this as a separate bug (as this is also expected behavior in fact and caused by same compression with storing data in sparse files) but a warning about this effect might be helpful.

            It looks like there is not much that can be improved outside the file system code in the operating system kernel.

            euglorg, the traditional UNIX way to create sparse files is to seek beyond the end of the file and write. The Linux fallocate() flag FALLOC_FL_PUNCH_HOLE is much newer; I do not think that it existed 20 years ago.

            I mostly agree with the legendary blog post https://smalldatum.blogspot.com/2015/10/wanted-file-system-on-which-innodb.html which is a criticism of the page_compressed format. By outsourcing some complexity to the Linux kernel, we essentially are at the mercy of the kernel. Either we get small annoyances like this one, or potential data corruption like https://lwn.net/Articles/864363/. I have not seen that in the wild, but I remember kernel crashes or file system corruption when Oracle tested the MySQL 5.7 response to the MariaDB 10.1 page_compressed format on a FusionIO device. We do test page_compressed internally, but mostly on the ext4 file system, and I do not think that we have tests that would monitor the actual space allocation.

            To my surprise, there still are users who prefer my filesystem-agnostic and overly complex implementation of ROW_FORMAT=COMPRESSED. I wanted to deprecate and remove write support for it, but due to community feedback, I reverted that decision in MDEV-27736.

            marko Marko Mäkelä added a comment - It looks like there is not much that can be improved outside the file system code in the operating system kernel. euglorg , the traditional UNIX way to create sparse files is to seek beyond the end of the file and write. The Linux fallocate() flag FALLOC_FL_PUNCH_HOLE is much newer; I do not think that it existed 20 years ago. I mostly agree with the legendary blog post https://smalldatum.blogspot.com/2015/10/wanted-file-system-on-which-innodb.html which is a criticism of the page_compressed format. By outsourcing some complexity to the Linux kernel, we essentially are at the mercy of the kernel. Either we get small annoyances like this one, or potential data corruption like https://lwn.net/Articles/864363/ . I have not seen that in the wild, but I remember kernel crashes or file system corruption when Oracle tested the MySQL 5.7 response to the MariaDB 10.1 page_compressed format on a FusionIO device. We do test page_compressed internally, but mostly on the ext4 file system, and I do not think that we have tests that would monitor the actual space allocation. To my surprise, there still are users who prefer my filesystem-agnostic and overly complex implementation of ROW_FORMAT=COMPRESSED . I wanted to deprecate and remove write support for it, but due to community feedback, I reverted that decision in MDEV-27736 .
            euglorg Eugene added a comment -

            Thank you for your explanation, marko.
            It seems that there's really nothing to be done here (except adding warnings in documentation about the case), at least at the moment.

            euglorg Eugene added a comment - Thank you for your explanation, marko . It seems that there's really nothing to be done here (except adding warnings in documentation about the case), at least at the moment.

            euglorg, thank you. Let’s address this in the documentation.

            I would expect the InnoDB page_compressed tables to be able to save space in a traditional file system, such as Linux ext4 or Microsoft NTFS, at the cost of increased fragmentation.

            The documentation might also include an example how to check the actually allocated physical blocks. That could be ls -ls on GNU/Linux.

            On file systems that support snapshots, the allocation of blocks could be quite different, and questions like ‘how much space does a file consume’ become more complicated. We have observed that enabling O_DIRECT (MDEV-24854) caused problems for several Linux file systems.

            marko Marko Mäkelä added a comment - euglorg , thank you. Let’s address this in the documentation. I would expect the InnoDB page_compressed tables to be able to save space in a traditional file system, such as Linux ext4 or Microsoft NTFS, at the cost of increased fragmentation. The documentation might also include an example how to check the actually allocated physical blocks. That could be ls -ls on GNU/Linux. On file systems that support snapshots, the allocation of blocks could be quite different, and questions like ‘how much space does a file consume’ become more complicated. We have observed that enabling O_DIRECT ( MDEV-24854 ) caused problems for several Linux file systems.

            It seems that Percona has fixed this by adding fallocate support in xbstream, it would be nice if MariaDB did backport this to mbstream.
            https://jira.percona.com/browse/PXB-2614

            In relevance to my previous comments, fallocate --dig-holes works with XFS, if speculative allocation is disabled (allocsize=64k in mount options)

            tanj Guillaume Lefranc added a comment - It seems that Percona has fixed this by adding fallocate support in xbstream, it would be nice if MariaDB did backport this to mbstream. https://jira.percona.com/browse/PXB-2614 In relevance to my previous comments, fallocate --dig-holes works with XFS, if speculative allocation is disabled (allocsize=64k in mount options)

            So, do we plan to fix anything in mariabackup implementation for XFS in frames of this bug report?

            valerii Valerii Kravchuk added a comment - So, do we plan to fix anything in mariabackup implementation for XFS in frames of this bug report?

            I think that tanj’s suggestion is reasonable. This should be relatively simple to implement, but a bit tricky to test.

            marko Marko Mäkelä added a comment - I think that tanj ’s suggestion is reasonable. This should be relatively simple to implement, but a bit tricky to test.

            This change looks OK to me. This will add some fallocate() system calls, but only when page_compressed tables are being used.

            marko Marko Mäkelä added a comment - This change looks OK to me. This will add some fallocate() system calls, but only when page_compressed tables are being used.

            People

              serg Sergei Golubchik
              muhammad.irfan Muhammad Irfan
              Votes:
              6 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.