[MDEV-25734] mbstream breaks page compression on XFS - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5, 10.6
Fix Version/s: 10.4.32, 10.5.23, 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3
Component/s: mariabackup, Storage Engine - InnoDB
Labels:
- mariabackup
- page-compression

Description

As InnoDB page compression uses “hole punching,” and it seems like ignored by somehow. mariabackup SST method leaves compressed data from donor to decompressed on joiner node. mariabackup SST not copying compressed files so page compression seems not beneficial in this case.

MariaDB [tt]> show create table t1\G

*************************** 1. row ***************************

       Table: t1

Create Table: CREATE TABLE `t1` (

  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,

  `msg` varchar(100) DEFAULT NULL,

  UNIQUE KEY `id` (`id`)

) ENGINE=InnoDB AUTO_INCREMENT=34209256 DEFAULT CHARSET=latin1 `PAGE_COMPRESSED`='ON'

1 row in set (0.005 sec)

+--------------+------------+------------+

| TABLE_SCHEMA | TABLE_NAME | ROW_FORMAT |

+--------------+------------+------------+

| tt           | t1         | Dynamic    |

+--------------+------------+------------+

MariaDB [tt]> show global variables like '%compression%';

+------------------------------------------+------------------+

| Variable_name                            | Value            |

+------------------------------------------+------------------+

| column_compression_threshold             | 100              |

| column_compression_zlib_level            | 6                |

| column_compression_zlib_strategy         | DEFAULT_STRATEGY |

| column_compression_zlib_wrap             | OFF              |

| innodb_compression_algorithm             | zlib             |

| innodb_compression_default               | ON               |

| innodb_compression_failure_threshold_pct | 5                |

| innodb_compression_level                 | 1                |

| innodb_compression_pad_pct_max           | 50               |

+------------------------------------------+------------------+

9 rows in set (0.001 sec)

page compression is enabled. Where size of tablespace file is:

$ ls -lrth /var/lib/mysql/tt/t1.ibd

2.1G /var/lib/mysql/tt/t1.ibd

$ du -hs --block-size=1 /var/lib/mysql/tt/t1.ibd

1098792960	/var/lib/mysql/tt/t1.ibd

$ du -hs /var/lib/mysql/tt/t1.ibd

1.1G	/var/lib/mysql/tt/t1.ibd

Now, let's try mariabackup:

mariabackup --backup --target-dir=/home/vagrant/backup/ -u root -p'xxxx'

[00] 2021-05-19 20:08:13 completed OK!

Let's verify backup table t1 tablespace file size again:

$ ls -la /home/vagrant/backup/tt/t1.ibd

-rw-r-----. 1 root root 2231369728 May 19 20:08 /home/vagrant/backup/tt/t1.ibd

$ du -hs --block-size=1 /home/vagrant/backup/tt/t1.ibd

2214154240	/home/vagrant/backup/tt/t1.ibd

$ du -hs /home/vagrant/backup/tt/t1.ibd

2.1G	/home/vagrant/backup/tt/t1.ibd

As per https://dev.mysql.com/doc/refman/5.7/en/innodb-page-compression.html
Preserving page compression when moving a page-compressed tablespace file from one host to another requires a utility that preserves sparse files

I think that is related xtrabackup report https://jira.percona.com/browse/PXB-1557

Attachments

Issue Links

relates to

MDEV-16129 InnoDB page compression not working as documented

Open

MDEV-22839 ROW_FORMAT=COMPRESSED vs PAGE_COMPRESSION=1 - size comparison

Open

MDEV-13023 mariabackup does not preserve "holes punched" for tables with page_compressed=1

Closed

MDEV-15527 page_compressed compressed page partially during import tablespace

Closed

Activity

Ascending order - Click to sort in descending order

Christian added a comment - 2021-07-05 09:01

Also mariabackup in 10.5 can't handle sparse-files.
didn't test 10.6

Christian added a comment - 2021-07-05 09:01 Also mariabackup in 10.5 can't handle sparse-files. didn't test 10.6

Marko Mäkelä added a comment - 2021-07-23 16:23

thiru, you fixed something similar in ~~MDEV-15527~~ in IMPORT TABLESPACE.

Marko Mäkelä added a comment - 2021-07-23 16:23 thiru , you fixed something similar in MDEV-15527 in IMPORT TABLESPACE .

Marko Mäkelä added a comment - 2021-07-23 16:24

In 10.6, ~~MDEV-26029~~ disables sparse files on purpose on thinly provisioned storage (computational storage device), specifically ScaleFlux. We must not break that logic when merging this fix to 10.6.

Marko Mäkelä added a comment - 2021-07-23 16:24 In 10.6, MDEV-26029 disables sparse files on purpose on thinly provisioned storage (computational storage device), specifically ScaleFlux. We must not break that logic when merging this fix to 10.6.

Thirunarayanan Balathandayuthapani added a comment - 2021-09-08 12:57

MariaDB [(none)]> use test;

Reading table information for completion of table and column names

You can turn off this feature to get a quicker startup with -A

Database changed

MariaDB [test]> show create table t1;

+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

| Table | Create Table                                                                                                                                                                                                           |

+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

| t1    | CREATE TABLE `t1` (

  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,

  `msg` varchar(100) DEFAULT NULL,

  UNIQUE KEY `id` (`id`)

) ENGINE=InnoDB AUTO_INCREMENT=96860716 DEFAULT CHARSET=latin1 `page_compressed`=1 |

+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

1 row in set (0.000 sec)

MariaDB [test]> select count(*) from t1;

+----------+

| count(*) |

+----------+

| 67108864 |

+----------+

1 row in set (43.579 sec)

MariaDB [test]>  show global variables like '%compression%';

+------------------------------------------+------------------+

| Variable_name                            | Value            |

+------------------------------------------+------------------+

| column_compression_threshold             | 100              |

| column_compression_zlib_level            | 6                |

| column_compression_zlib_strategy         | DEFAULT_STRATEGY |

| column_compression_zlib_wrap             | OFF              |

| innodb_compression_algorithm             | zlib             |

| innodb_compression_default               | ON               |

| innodb_compression_failure_threshold_pct | 5                |

| innodb_compression_level                 | 1                |

| innodb_compression_pad_pct_max           | 50               |

+------------------------------------------+------------------+

Current file size:

===================

thiru@rr:~/server/10.3/brl/data/test$ ls -lsrth ./t1.ibd

3.4G -rw-rw---- 1 thiru dev 6.0G Sep  8 12:30 ./t1.ibd

thiru@rr:~/server/10.3/brl/data/test$ du -hs ./t1.ibd

3.4G	./t1.ibd

Running the mariabackup with the following command:

===================================================

./mariabackup --target-dir=$PWD/backup --backup --socket=$PWD/../../data/mysql.sock --user=root

Checking the file size in target directory:

===============================================

thiru@rr:~/server/10.3/brl/extra/mariabackup/backup/test$ ls -lsrth ./t1.ibd

3.4G -rw-r----- 1 thiru dev 6.0G Sep  8 12:34 ./t1.ibd

thiru@rr:~/server/10.3/brl/extra/mariabackup/backup/test$ du -hs ./t1.ibd

3.4G	./t1.ibd

I don't see any uncompressed table in target directory.

thiru@rr:~/server/10.3/brl/data/test$ cat /proc/version

Linux version 5.4.0-62-generic (buildd@lgw01-amd64-045) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021

Please verify what was the mariabackup command given by the user. Whether this issue is specific to centOS ? Please get the information about what filesystem they were using?

Thirunarayanan Balathandayuthapani added a comment - 2021-09-08 12:57 MariaDB [(none)]> use test; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [test]> show create table t1; +-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t1 | CREATE TABLE `t1` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `msg` varchar(100) DEFAULT NULL, UNIQUE KEY `id` (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=96860716 DEFAULT CHARSET=latin1 `page_compressed`=1 | +-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.000 sec) MariaDB [test]> select count(*) from t1; +----------+ | count(*) | +----------+ | 67108864 | +----------+ 1 row in set (43.579 sec) MariaDB [test]> show global variables like '%compression%'; +------------------------------------------+------------------+ | Variable_name | Value | +------------------------------------------+------------------+ | column_compression_threshold | 100 | | column_compression_zlib_level | 6 | | column_compression_zlib_strategy | DEFAULT_STRATEGY | | column_compression_zlib_wrap | OFF | | innodb_compression_algorithm | zlib | | innodb_compression_default | ON | | innodb_compression_failure_threshold_pct | 5 | | innodb_compression_level | 1 | | innodb_compression_pad_pct_max | 50 | +------------------------------------------+------------------+ Current file size: =================== thiru@rr:~/server/10.3/brl/data/test$ ls -lsrth ./t1.ibd 3.4G -rw-rw---- 1 thiru dev 6.0G Sep 8 12:30 ./t1.ibd thiru@rr:~/server/10.3/brl/data/test$ du -hs ./t1.ibd 3.4G ./t1.ibd Running the mariabackup with the following command: =================================================== ./mariabackup --target-dir=$PWD/backup --backup --socket=$PWD/../../data/mysql.sock --user=root Checking the file size in target directory: =============================================== thiru@rr:~/server/10.3/brl/extra/mariabackup/backup/test$ ls -lsrth ./t1.ibd 3.4G -rw-r----- 1 thiru dev 6.0G Sep 8 12:34 ./t1.ibd thiru@rr:~/server/10.3/brl/extra/mariabackup/backup/test$ du -hs ./t1.ibd 3.4G ./t1.ibd I don't see any uncompressed table in target directory. thiru@rr:~/server/10.3/brl/data/test$ cat /proc/version Linux version 5.4.0-62-generic (buildd@lgw01-amd64-045) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021 Please verify what was the mariabackup command given by the user. Whether this issue is specific to centOS ? Please get the information about what filesystem they were using?

Eugene added a comment - 2021-09-21 20:41

Got similar issue.
Our setup is Linux 5.10.67-gentoo, 3 absolutely identical servers with absolutely same mariadb configuration. Mariadb version is mariadb-10.4.21.
We have similar definitions on all the tables with ENGINE=InnoDB ... 'PAGE_COMPRESSED'='ON'.
We also have compression enabled by default for new tables:

innodb_compression_algorithm		= snappy

innodb_compression_default		= ON

and compression setting like

MariaDB [(none)]> show global variables like '%compression%';

+------------------------------------------+------------------+

| Variable_name                            | Value            |

+------------------------------------------+------------------+

| column_compression_threshold             | 100              |

| column_compression_zlib_level            | 6                |

| column_compression_zlib_strategy         | DEFAULT_STRATEGY |

| column_compression_zlib_wrap             | OFF              |

| innodb_compression_algorithm             | snappy           |

| innodb_compression_default               | ON               |

| innodb_compression_failure_threshold_pct | 5                |

| innodb_compression_level                 | 6                |

| innodb_compression_pad_pct_max           | 50               |

+------------------------------------------+------------------+

The issue was noticed when 'optimize' and several migrations were performed on the node running standalone: dataset size reduced significantly. Then data copied with SST to other nodes and SST took much longer than it was expected according to connection speed and dataset size on source node. The reason is that dataset size changed dramatically with SST and streamed volume appeared to be matching uncompressed data size.
The result is:

Node3

---------------

Filesystem      Size  Used Avail Use% Mounted on

/dev/sda8      3.5T  2.5T  1.1T  71% /var/lib/mysql

---------------

Node2

---------------

Filesystem      Size  Used Avail Use% Mounted on

/dev/sda8      3.5T  2.5T  1.1T  71% /var/lib/mysql

---------------

Node1

---------------

Filesystem      Size  Used Avail Use% Mounted on

/dev/sda8      3.5T  1.7T  1.9T  49% /var/lib/mysql

On comparing tabespace file sizes, it was found that each and every .ibd file is bigger on Node3 and Node2 than it was on Node1, however, data size in information_schema is also identical for same table on source server and replica, meaning tablespaces are not compressed on repilcas while they were on source one.
Also, on examining the backup taken on Node1 with mariabackup tool, it was found that tablespaces inside the backup have very same size they have on replicating (Node3 and Node2) servers (say, the table has size of 25GB on source node, it has size of 40GB on both replicas and in backup copy, while source node still operates compressed table with 25GB of size)
SST section in mysql configuration file has no relation to this:

[SST]

compressor='/usr/bin/zstd --fast=3'

decompressor='/usr/bin/zstd -d'

Mariabackup command to make local backup is:
mariabackup -u root -p --backup --galera-info --stream=xbstream --parallel 16 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql | /usr/bin/zstd --fast -T16 -q -o /mysql_backup.zst
Again, on uncompressing archive mysql_backup.zst it was noticed that data size it contains is 30% bigger than in backup source directory.
This all means that tablespace data is streamed uncompressed even on preforming local backup, and, thus, same happens on using mariabackup for SST while no compression happens on replica node.

Eugene added a comment - 2021-09-21 20:41 Got similar issue. Our setup is Linux 5.10.67-gentoo, 3 absolutely identical servers with absolutely same mariadb configuration. Mariadb version is mariadb-10.4.21. We have similar definitions on all the tables with ENGINE=InnoDB ... 'PAGE_COMPRESSED'='ON'. We also have compression enabled by default for new tables: innodb_compression_algorithm = snappy innodb_compression_default = ON and compression setting like MariaDB [(none)]> show global variables like '%compression%' ; +------------------------------------------+------------------+ | Variable_name | Value | +------------------------------------------+------------------+ | column_compression_threshold | 100 | | column_compression_zlib_level | 6 | | column_compression_zlib_strategy | DEFAULT_STRATEGY | | column_compression_zlib_wrap | OFF | | innodb_compression_algorithm | snappy | | innodb_compression_default | ON | | innodb_compression_failure_threshold_pct | 5 | | innodb_compression_level | 6 | | innodb_compression_pad_pct_max | 50 | +------------------------------------------+------------------+ The issue was noticed when 'optimize' and several migrations were performed on the node running standalone: dataset size reduced significantly. Then data copied with SST to other nodes and SST took much longer than it was expected according to connection speed and dataset size on source node. The reason is that dataset size changed dramatically with SST and streamed volume appeared to be matching uncompressed data size. The result is: Node3 --------------- Filesystem Size Used Avail Use% Mounted on /dev/sda8 3 .5T 2 .5T 1 .1T 71 % /var/lib/mysql --------------- Node2 --------------- Filesystem Size Used Avail Use% Mounted on /dev/sda8 3 .5T 2 .5T 1 .1T 71 % /var/lib/mysql --------------- Node1 --------------- Filesystem Size Used Avail Use% Mounted on /dev/sda8 3 .5T 1 .7T 1 .9T 49 % /var/lib/mysql On comparing tabespace file sizes, it was found that each and every .ibd file is bigger on Node3 and Node2 than it was on Node1, however, data size in information_schema is also identical for same table on source server and replica, meaning tablespaces are not compressed on repilcas while they were on source one. Also, on examining the backup taken on Node1 with mariabackup tool, it was found that tablespaces inside the backup have very same size they have on replicating (Node3 and Node2) servers (say, the table has size of 25GB on source node, it has size of 40GB on both replicas and in backup copy, while source node still operates compressed table with 25GB of size) SST section in mysql configuration file has no relation to this: [SST] compressor= '/usr/bin/zstd --fast=3' decompressor= '/usr/bin/zstd -d' Mariabackup command to make local backup is: mariabackup -u root -p --backup --galera-info --stream=xbstream --parallel 16 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql | /usr/bin/zstd --fast -T16 -q -o /mysql_backup.zst Again, on uncompressing archive mysql_backup.zst it was noticed that data size it contains is 30% bigger than in backup source directory. This all means that tablespace data is streamed uncompressed even on preforming local backup, and, thus, same happens on using mariabackup for SST while no compression happens on replica node.

Muhammad Irfan added a comment - 2021-09-22 08:19

thiru It appears on CentOS7/8 and it happens on XFS filesystem during mariabackup SST

Muhammad Irfan added a comment - 2021-09-22 08:19 thiru It appears on CentOS7/8 and it happens on XFS filesystem during mariabackup SST

Thirunarayanan Balathandayuthapani added a comment - 2021-09-22 12:22

Wrote the sample program to do create, write & seek system call.

#include<unistd.h>

#include<fcntl.h>

#include<stdio.h>

#include<string.h>

#define DATA_SIZE 16384

int main()

  /* Variable to store user content */

  char data[DATA_SIZE];

  int data_length[8]= {16384, 109, 244, 249, 7346, 7078, 7146, 7839};

  int fd = open("./f1.txt", O_RDWR | O_CREAT);

  memset(data, 'a', DATA_SIZE);

  for (int i= 0; i < 8; i++)

    int punch_hole_len= DATA_SIZE - data_length[i];

    write(fd, data, data_length[i]);

    if (punch_hole_len)

        lseek(fd, punch_hole_len, SEEK_CUR);

  close(fd);

  printf("File created and saved successfully \n");

  return 0;

In ext4 file system the output file `f1.txt` has different block size

[~/mariarepo/server/10.7/10.7-sample]$df -T ./

Filesystem     Type 1K-blocks      Used Available Use% Mounted on

/dev/nvme0n1p7 ext4 364903200 326044404  20253024  95% /home

[~/mariarepo/server/10.7/10.7-sample]$ls -ls f1.txt

64 ----r-x--T 1 thiru thiru 122527 Sep 22 17:46 f1.txt

[~/mariarepo/server/10.7/10.7-sample]$du -hs f1.txt

64K	f1.txt

In xfs filesystem:

[/data/volumes/xfs32m]$df -T ./

Filesystem     Type 1K-blocks  Used Available Use% Mounted on

/dev/loop3     xfs      29348  1904     27444   7% /data/volumes/xfs32m

[/data/volumes/xfs32m]$ls -ls ./f1.txt

76 -rwx--s--T 1 root root 122527 Sep 22 17:47 ./f1.txt

[/data/volumes/xfs32m]$du -hs ./f1.txt

76K	./f1.txt

AFAIK, Mariabackup doesn't do anything wrong logically. Mariabackup preserve punch hole for compressed format(~~MDEV-13023~~).
I think the problem lies within file system.

Thirunarayanan Balathandayuthapani added a comment - 2021-09-22 12:22 Wrote the sample program to do create, write & seek system call. #include<unistd.h> #include<fcntl.h> #include<stdio.h> #include<string.h> #define DATA_SIZE 16384 int main() { /* Variable to store user content */ char data[DATA_SIZE]; int data_length[8]= {16384, 109, 244, 249, 7346, 7078, 7146, 7839}; int fd = open( "./f1.txt" , O_RDWR | O_CREAT); memset (data, 'a' , DATA_SIZE); for ( int i= 0; i < 8; i++) { int punch_hole_len= DATA_SIZE - data_length[i]; write(fd, data, data_length[i]); if (punch_hole_len) lseek(fd, punch_hole_len, SEEK_CUR); } close(fd); printf ( "File created and saved successfully \n" ); return 0; } In ext4 file system the output file `f1.txt` has different block size [~/mariarepo/server/10.7/10.7-sample]$df -T ./ Filesystem Type 1K-blocks Used Available Use% Mounted on /dev/nvme0n1p7 ext4 364903200 326044404 20253024 95% /home [~/mariarepo/server/10.7/10.7-sample]$ls -ls f1.txt 64 ----r-x--T 1 thiru thiru 122527 Sep 22 17:46 f1.txt [~/mariarepo/server/10.7/10.7-sample]$du -hs f1.txt 64K f1.txt In xfs filesystem: [/data/volumes/xfs32m]$df -T ./ Filesystem Type 1K-blocks Used Available Use% Mounted on /dev/loop3 xfs 29348 1904 27444 7% /data/volumes/xfs32m [/data/volumes/xfs32m]$ls -ls ./f1.txt 76 -rwx--s--T 1 root root 122527 Sep 22 17:47 ./f1.txt [/data/volumes/xfs32m]$du -hs ./f1.txt 76K ./f1.txt AFAIK, Mariabackup doesn't do anything wrong logically. Mariabackup preserve punch hole for compressed format( MDEV-13023 ). I think the problem lies within file system.

Eugene added a comment - 2021-09-22 15:00

Sorry, but this doesn't explain the case with same issue happening while streaming data to stdout and compressing it on the fly with mariabackup --stream=xbstream | zstd -o...
Files inside the archive (that is filesystem-independent) have bigger size than they have in source /var/lib/mysql directory. Thus, mariadb handles them incorrectly while reading and streaming anywhere regardless of filesystem they will be written into.

Eugene added a comment - 2021-09-22 15:00 Sorry, but this doesn't explain the case with same issue happening while streaming data to stdout and compressing it on the fly with mariabackup --stream=xbstream | zstd -o... Files inside the archive (that is filesystem-independent) have bigger size than they have in source /var/lib/mysql directory. Thus, mariadb handles them incorrectly while reading and streaming anywhere regardless of filesystem they will be written into.

Thirunarayanan Balathandayuthapani added a comment - 2021-09-24 05:46

euglorg yes, ~~MDEV-13023~~ doesn't deal with compressed file while streaming. Thanks for your analysis. I will consider this issue as Mariabackup doesn't punch hole the file during streaming.

Thirunarayanan Balathandayuthapani added a comment - 2021-09-24 05:46 euglorg yes, MDEV-13023 doesn't deal with compressed file while streaming. Thanks for your analysis. I will consider this issue as Mariabackup doesn't punch hole the file during streaming.

Marko Mäkelä added a comment - 2021-10-01 05:45

As far as I understand, the problem is specifically with the Galera snapshot transfer (SST). Either something in the SST script or the mariabackup options that it invokes needs to be changed.

Marko Mäkelä added a comment - 2021-10-01 05:45 As far as I understand, the problem is specifically with the Galera snapshot transfer (SST). Either something in the SST script or the mariabackup options that it invokes needs to be changed.

Eugene added a comment - 2021-10-01 07:29

Marko, sorry, but the problem seems to be not just with SST as it also happens when mariabackup is used for creating a backup with mariabackup --stream=xbstream | zstd -o..., there's an example of complete command line with all the arguments above. Again, compression in archived data sent through the pipe as xbstream is also broken. As SST relies on mariabackup, it was just noticed on SST, but upon investigating backups it became clear that problem happens with mariabackup itself.

Eugene added a comment - 2021-10-01 07:29 Marko, sorry, but the problem seems to be not just with SST as it also happens when mariabackup is used for creating a backup with mariabackup --stream=xbstream | zstd -o... , there's an example of complete command line with all the arguments above. Again, compression in archived data sent through the pipe as xbstream is also broken. As SST relies on mariabackup, it was just noticed on SST, but upon investigating backups it became clear that problem happens with mariabackup itself.

Vladislav Vaintroub added a comment - 2021-10-07 20:55 - edited

euglorg - why do you uncompress the stream? What does that prove?
Compressing the stream , lightly, gives you the ability to save on some binary zeros, which are actually, and undeniably, in the stream.
uncompressing the compressed stream can be combined, via pipe, with mbstream -x , to actually create a data directory. And that step creates files with the holes. So, you have compressed stream, and compressed/with holes data directory.

That uncompressed stream does not have holes, yes, we know about that, and thus the suggestion is to compress it. Lightly, so that larger runs of binary zeros are omitted.

If you would like to point out that mbstream -x does not create holes, on some filesystem, I would think there is something wrong with filesystem. mariabackup makes an effort to create holes, portably, with original Unix APIs from 1970ies, seek() behind the end of file, and write() there. This popular answer from stackoverflow describes how to programmatically create holes, and this is what mariabackup exactly does.

Vladislav Vaintroub added a comment - 2021-10-07 20:55 - edited euglorg - why do you uncompress the stream? What does that prove? Compressing the stream , lightly, gives you the ability to save on some binary zeros, which are actually, and undeniably, in the stream. uncompressing the compressed stream can be combined, via pipe, with mbstream -x , to actually create a data directory. And that step creates files with the holes. So, you have compressed stream, and compressed/with holes data directory. That uncompressed stream does not have holes, yes, we know about that, and thus the suggestion is to compress it. Lightly, so that larger runs of binary zeros are omitted. If you would like to point out that mbstream -x does not create holes, on some filesystem, I would think there is something wrong with filesystem. mariabackup makes an effort to create holes, portably, with original Unix APIs from 1970ies, seek() behind the end of file, and write() there. This popular answer from stackoverflow describes how to programmatically create holes, and this is what mariabackup exactly does.

Eugene added a comment - 2021-10-08 10:39 - edited

Vladislav, sorry for misunderstanding, let's probably explain the problem step by step.

First, your question is a bit unclear as there's no any mention on stream uncompression in the command I provided:

mariabackup -u root -p --backup --galera-info --stream=xbstream --parallel 16 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql | ...

It is expected that data received from pipe will have pages compressed, isn't it?
Here commands to the right from the pipe are omitted as uncompressed stream is sent through the pipe while there's no any explicit mention on uncompressing tablespace data in either command or related hel page explanations (this has no relation to compressing and uncompressing the stream sent through socat while performing SST, for sure).
To clear the case, full backup-restore command set would be:

#create backup archive

mariabackup -u root -p --backup --galera-info --stream=xbstream --parallel 16 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql | /usr/bin/zstd --fast -T16 -q -o /mysql_backup.zst

#uncompress the archive

zstd -d -c  /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp

#prepare the backup

mariabackup --prepare --innodb-doublewrite --target-dir /mnt/mariabackuptmp

but second command always produces dataset that has uncompressed tablespace data. Now we use XFS. If you are sure this is filesystem problem, would you expect data to have same size in case I uncompress archived backup data to ext4fs?
Please give me a couple of days, I'll try to extract archived backup to ext4fs... If it will have same size the original data has had, then the problem can be within FS, then there should be explicit note in the documentation that mariadb can't use tablespace compression if mariabackup for SST and data resides on XFS partition.
However, in case zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp (see above how it was created) will produce also dataset with uncompressed tablespaces on ext4fs (mean, command chain mariabackup | mbstream -x creates dataset that is 30% bigger than expected), this would mean that data sent through the pipe has broken compression already, as it is not sent to any filesystem.

If the above logic is wrong, please let me know.

Eugene added a comment - 2021-10-08 10:39 - edited Vladislav, sorry for misunderstanding, let's probably explain the problem step by step. First, your question is a bit unclear as there's no any mention on stream uncompression in the command I provided: mariabackup -u root -p --backup --galera-info --stream=xbstream --parallel 16 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql | ... It is expected that data received from pipe will have pages compressed, isn't it? Here commands to the right from the pipe are omitted as uncompressed stream is sent through the pipe while there's no any explicit mention on uncompressing tablespace data in either command or related hel page explanations (this has no relation to compressing and uncompressing the stream sent through socat while performing SST, for sure). To clear the case, full backup-restore command set would be: #create backup archive mariabackup -u root -p --backup --galera-info --stream=xbstream --parallel 16 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql | /usr/bin/zstd --fast -T16 -q -o /mysql_backup.zst #uncompress the archive zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp #prepare the backup mariabackup --prepare --innodb-doublewrite --target-dir /mnt/mariabackuptmp but second command always produces dataset that has uncompressed tablespace data. Now we use XFS. If you are sure this is filesystem problem, would you expect data to have same size in case I uncompress archived backup data to ext4fs? Please give me a couple of days, I'll try to extract archived backup to ext4fs... If it will have same size the original data has had, then the problem can be within FS, then there should be explicit note in the documentation that mariadb can't use tablespace compression if mariabackup for SST and data resides on XFS partition. However, in case zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp (see above how it was created) will produce also dataset with uncompressed tablespaces on ext4fs (mean, command chain mariabackup | mbstream -x creates dataset that is 30% bigger than expected), this would mean that data sent through the pipe has broken compression already, as it is not sent to any filesystem. If the above logic is wrong, please let me know.

Marko Mäkelä added a comment - 2021-10-12 07:41

euglorg, did you try the test program that thiru posted? It is producing different results on ext4 and XFS.

As far as I can tell, the function write_compressed() in extra/mariabackup/ds_local.cc is creating sparse files in the traditional way: by writing and seeking past the current end of the file.

We might try to convince XFS harder to create sparse files, by invoking a Linux specific fallocate() system call, to make it closer to the way how writes from the InnoDB buffer pool work.

Marko Mäkelä added a comment - 2021-10-12 07:41 euglorg , did you try the test program that thiru posted? It is producing different results on ext4 and XFS. As far as I can tell, the function write_compressed() in extra/mariabackup/ds_local.cc is creating sparse files in the traditional way: by writing and seeking past the current end of the file. We might try to convince XFS harder to create sparse files, by invoking a Linux specific fallocate() system call, to make it closer to the way how writes from the InnoDB buffer pool work.

Marko Mäkelä added a comment - 2021-10-12 07:47

I found a question that suggests that this is an XFS design feature:
https://stackoverflow.com/questions/6940516/create-sparse-file-with-alternate-data-and-hole-on-ext3-and-xfs

This one may be relevant too:
https://serverfault.com/questions/406069/why-are-my-xfs-filesystems-suddenly-consuming-more-space-and-full-of-sparse-file

If XFS is on purpose so reluctant to create sparse files with tiny blocks when they are created in the traditional POSIX way (seeking and writing) as opposed to FALLOC_FL_PUNCH_HOLE, maybe it is not an ideal match for the page_compressed tables.

Marko Mäkelä added a comment - 2021-10-12 07:47 I found a question that suggests that this is an XFS design feature: https://stackoverflow.com/questions/6940516/create-sparse-file-with-alternate-data-and-hole-on-ext3-and-xfs This one may be relevant too: https://serverfault.com/questions/406069/why-are-my-xfs-filesystems-suddenly-consuming-more-space-and-full-of-sparse-file If XFS is on purpose so reluctant to create sparse files with tiny blocks when they are created in the traditional POSIX way (seeking and writing) as opposed to FALLOC_FL_PUNCH_HOLE , maybe it is not an ideal match for the page_compressed tables.

Vladislav Vaintroub added a comment - 2021-10-12 08:03

Here is another link http://web.archive.org/web/20171010154045/http://oss.sgi.com/archives/xfs/2012-02/msg00413.html, with discussion on xfs mount options. disclaimer - I did not try that.

Vladislav Vaintroub added a comment - 2021-10-12 08:03 Here is another link http://web.archive.org/web/20171010154045/http://oss.sgi.com/archives/xfs/2012-02/msg00413.html , with discussion on xfs mount options. disclaimer - I did not try that.

Vladislav Vaintroub added a comment - 2021-10-12 10:12

It is probably helpful to punch holes on XFS explicitly, with FALLOC_FL_PUNCH_HOLE, after determinining that file system is XFS.

Vladislav Vaintroub added a comment - 2021-10-12 10:12 It is probably helpful to punch holes on XFS explicitly, with FALLOC_FL_PUNCH_HOLE, after determinining that file system is XFS.

Eugene added a comment - 2021-10-12 10:33

Marko, I haven't tried that test. Instead, I tried to extract backup archive to ext4fs partition and have to confirm that the issue happens on XFS only.
The case looks as follows:

Original dataset with page compression has size of 1.7T
After running `mariabackup --stream=xbstream | /usr/bin/zstd -o /mysql_backup.zst && zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp` where /mnt/mariabackuptmp is XFS partition, dataset inside /mnt/mariabackuptmp has size of 2.5T
After running `mariabackup --stream=xbstream | /usr/bin/zstd -o /mysql_backup.zst && zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp` where /mnt/mariabackuptmp is ext4fs partition, dataset inside /mnt/mariabackuptmp has size of 1.7T

The issue is that XFS has better performance for multithreaded IO, that was the reason of using it. Probably, Vladislav is right and XFS requires specific handling if page compression is in use.

Eugene added a comment - 2021-10-12 10:33 Marko, I haven't tried that test. Instead, I tried to extract backup archive to ext4fs partition and have to confirm that the issue happens on XFS only. The case looks as follows: Original dataset with page compression has size of 1.7T After running `mariabackup --stream=xbstream | /usr/bin/zstd -o /mysql_backup.zst && zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp` where /mnt/mariabackuptmp is XFS partition, dataset inside /mnt/mariabackuptmp has size of 2.5T After running `mariabackup --stream=xbstream | /usr/bin/zstd -o /mysql_backup.zst && zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp` where /mnt/mariabackuptmp is ext4fs partition, dataset inside /mnt/mariabackuptmp has size of 1.7T The issue is that XFS has better performance for multithreaded IO, that was the reason of using it. Probably, Vladislav is right and XFS requires specific handling if page compression is in use.

Vladislav Vaintroub added a comment - 2021-10-12 11:07

Perhaps, fallocate --dig-holes on the backup files could help, after decompression. As a workaround

Vladislav Vaintroub added a comment - 2021-10-12 11:07 Perhaps, fallocate --dig-holes on the backup files could help, after decompression. As a workaround

Eugene added a comment - 2021-10-12 11:11

But this workaround is probably far from being good option for SST transfer. It will make it much slower.

Eugene added a comment - 2021-10-12 11:11 But this workaround is probably far from being good option for SST transfer. It will make it much slower.

Vladislav Vaintroub added a comment - 2021-10-12 11:22 - edited

the xbstream/mbstream format is currently not-optimized for the holes. Thus, the suggestion to use lightweight stream compression. Lightweight means it would be is capable of detecting and compressing large runs of 0 bytes, yet would not try too hard to compress already compressed data. zlib with level 1 might be a good candidate

Vladislav Vaintroub added a comment - 2021-10-12 11:22 - edited the xbstream/mbstream format is currently not-optimized for the holes. Thus, the suggestion to use lightweight stream compression. Lightweight means it would be is capable of detecting and compressing large runs of 0 bytes, yet would not try too hard to compress already compressed data. zlib with level 1 might be a good candidate

Guillaume Lefranc added a comment - 2021-10-27 15:20 - edited

Just as a side note, punching holes with fallocate on XFS is slow:
https://issues.apache.org/jira/browse/KUDU-2052
So, eventually one of the solutions could be to patch mariabackup so if XFS is detected XFS_IOC_UNRESVSP64 is used to punch holes in files

Guillaume Lefranc added a comment - 2021-10-27 15:20 - edited Just as a side note, punching holes with fallocate on XFS is slow : https://issues.apache.org/jira/browse/KUDU-2052 So, eventually one of the solutions could be to patch mariabackup so if XFS is detected XFS_IOC_UNRESVSP64 is used to punch holes in files

Vladislav Vaintroub added a comment - 2021-10-27 18:05 - edited

We do not do fallocate in backup, tanj. There is the traditional, not specifically Linuxy, solution , lseek() behind the end of file, and then write(). distance between old end of file, and the place we write, is supposed to be the hole. The most modern in the current backup solution, is that we do set sparse file attribute on Windows, which works since year 2000 or so

The server does fallocates, and maybe it does them slow, but there is nothing else we do.

Vladislav Vaintroub added a comment - 2021-10-27 18:05 - edited We do not do fallocate in backup, tanj . There is the traditional, not specifically Linuxy, solution , lseek() behind the end of file, and then write(). distance between old end of file, and the place we write, is supposed to be the hole. The most modern in the current backup solution, is that we do set sparse file attribute on Windows, which works since year 2000 or so The server does fallocates, and maybe it does them slow, but there is nothing else we do.

Guillaume Lefranc added a comment - 2021-11-05 12:14

@wlad I was referring to Marko's comment:

We might try to convince XFS harder to create sparse files, by invoking a Linux specific fallocate() system call, to make it closer to the way how writes from the InnoDB buffer pool work.

Guillaume Lefranc added a comment - 2021-11-05 12:14 @wlad I was referring to Marko's comment: We might try to convince XFS harder to create sparse files, by invoking a Linux specific fallocate() system call, to make it closer to the way how writes from the InnoDB buffer pool work.

Vladislav Vaintroub added a comment - 2021-11-05 12:20

It is not up to me entirely, but I do not welcome a workaround for every bug-of-day on every possible Linux filesystem. Maybe we eventually just document the behavior of this and that filesystem, and leave it as-is.

Vladislav Vaintroub added a comment - 2021-11-05 12:20 It is not up to me entirely, but I do not welcome a workaround for every bug-of-day on every possible Linux filesystem. Maybe we eventually just document the behavior of this and that filesystem, and leave it as-is.

Eugene added a comment - 2021-11-05 13:48

Indeed, workaround for every 'bug-of-the-day' is a bad idea, strongly agree with Vladislav!
But looking at this case, it is not even clear what part should be treated as a bug: xfs behaves correctly from filesystem point of view because: (a) correct way to create sparse file is fallocate() and not seek-beyond-EOF, and (b) xfs preallocates blocks to minimize fragmentation, thus, it behaves correctly utilizing certain space that should be a hole on other filesystems. Thus, there's even no reason to report this as a bug of xfs.
The point is that both MariaDB does nothing wrong, just uses generic method of punching holes, but also xfs does nothing wrong, too, while together this causes an unwanted effect with unpredictable disk space utilization after SST.
For now I'd ask for

adding a note into documentation about this (known) behavior that is specific for SST with xfs.
leaving behavior unchanged but leaving the bug open (yes, this sounds like "known bug is a feature")
The reason for leaving the bug open is that there's a question whether using seek-beyond-EOF is in fact correct method of handling hole punching: this probably should not be reevaluated and replaced with another (fallocate?) filesystem-specific methods and not just for xfs but for other filesystems also... not sure in fact.

Eugene added a comment - 2021-11-05 13:48 Indeed, workaround for every 'bug-of-the-day' is a bad idea, strongly agree with Vladislav! But looking at this case, it is not even clear what part should be treated as a bug: xfs behaves correctly from filesystem point of view because: (a) correct way to create sparse file is fallocate() and not seek-beyond-EOF, and (b) xfs preallocates blocks to minimize fragmentation, thus, it behaves correctly utilizing certain space that should be a hole on other filesystems. Thus, there's even no reason to report this as a bug of xfs. The point is that both MariaDB does nothing wrong, just uses generic method of punching holes, but also xfs does nothing wrong, too, while together this causes an unwanted effect with unpredictable disk space utilization after SST. For now I'd ask for adding a note into documentation about this (known) behavior that is specific for SST with xfs. leaving behavior unchanged but leaving the bug open (yes, this sounds like "known bug is a feature") The reason for leaving the bug open is that there's a question whether using seek-beyond-EOF is in fact correct method of handling hole punching: this probably should not be reevaluated and replaced with another (fallocate?) filesystem-specific methods and not just for xfs but for other filesystems also... not sure in fact.

Guillaume Lefranc added a comment - 2021-11-05 13:55

I'd also agree that proper documentation is preferable to workaround. In this case writing that ext4 is preferable as a filesystem "in certain cases"

Just as a note because I tried on XFS but fallocate --dig-holes does not reclaim space sadly.

Guillaume Lefranc added a comment - 2021-11-05 13:55 I'd also agree that proper documentation is preferable to workaround. In this case writing that ext4 is preferable as a filesystem "in certain cases" Just as a note because I tried on XFS but fallocate --dig-holes does not reclaim space sadly.

Eugene added a comment - 2021-11-08 12:39 - edited

Tried fallocate --dig-holes on some files from my dataset (on xfs), there's some effect but not that expected difference seen on ext4fs. It probably depends on real data size, hole sizes and extents allocated in every particular case.
By the way, it would be good to add another note on dealing with holes (regardless of filesystem, probably). The case is that in case you have sparse files containing tablespaces on SSD drive and issue 'fstrim' command to trim unused blocks, fstrim locks files having holes and trimming them (that's expected) with freezing IO on them, thus, freezing mariadbd. Once space is trimmed, files "released" and mariadbd resumes normal operation (fstrim can still be handling unoccupied space on same drive yet). On galera cluster this triggers flowcontrol and slows down whole cluster. No reason to report this as a separate bug (as this is also expected behavior in fact and caused by same compression with storing data in sparse files) but a warning about this effect might be helpful.

Eugene added a comment - 2021-11-08 12:39 - edited Tried fallocate --dig-holes on some files from my dataset (on xfs), there's some effect but not that expected difference seen on ext4fs. It probably depends on real data size, hole sizes and extents allocated in every particular case. By the way, it would be good to add another note on dealing with holes (regardless of filesystem, probably). The case is that in case you have sparse files containing tablespaces on SSD drive and issue 'fstrim' command to trim unused blocks, fstrim locks files having holes and trimming them (that's expected) with freezing IO on them, thus, freezing mariadbd. Once space is trimmed, files "released" and mariadbd resumes normal operation (fstrim can still be handling unoccupied space on same drive yet). On galera cluster this triggers flowcontrol and slows down whole cluster. No reason to report this as a separate bug (as this is also expected behavior in fact and caused by same compression with storing data in sparse files) but a warning about this effect might be helpful.

Marko Mäkelä added a comment - 2022-03-18 14:27

It looks like there is not much that can be improved outside the file system code in the operating system kernel.

euglorg, the traditional UNIX way to create sparse files is to seek beyond the end of the file and write. The Linux fallocate() flag FALLOC_FL_PUNCH_HOLE is much newer; I do not think that it existed 20 years ago.

I mostly agree with the legendary blog post https://smalldatum.blogspot.com/2015/10/wanted-file-system-on-which-innodb.html which is a criticism of the page_compressed format. By outsourcing some complexity to the Linux kernel, we essentially are at the mercy of the kernel. Either we get small annoyances like this one, or potential data corruption like https://lwn.net/Articles/864363/. I have not seen that in the wild, but I remember kernel crashes or file system corruption when Oracle tested the MySQL 5.7 response to the MariaDB 10.1 page_compressed format on a FusionIO device. We do test page_compressed internally, but mostly on the ext4 file system, and I do not think that we have tests that would monitor the actual space allocation.

To my surprise, there still are users who prefer my filesystem-agnostic and overly complex implementation of ROW_FORMAT=COMPRESSED. I wanted to deprecate and remove write support for it, but due to community feedback, I reverted that decision in ~~MDEV-27736~~.

Marko Mäkelä added a comment - 2022-03-18 14:27 It looks like there is not much that can be improved outside the file system code in the operating system kernel. euglorg , the traditional UNIX way to create sparse files is to seek beyond the end of the file and write. The Linux fallocate() flag FALLOC_FL_PUNCH_HOLE is much newer; I do not think that it existed 20 years ago. I mostly agree with the legendary blog post https://smalldatum.blogspot.com/2015/10/wanted-file-system-on-which-innodb.html which is a criticism of the page_compressed format. By outsourcing some complexity to the Linux kernel, we essentially are at the mercy of the kernel. Either we get small annoyances like this one, or potential data corruption like https://lwn.net/Articles/864363/ . I have not seen that in the wild, but I remember kernel crashes or file system corruption when Oracle tested the MySQL 5.7 response to the MariaDB 10.1 page_compressed format on a FusionIO device. We do test page_compressed internally, but mostly on the ext4 file system, and I do not think that we have tests that would monitor the actual space allocation. To my surprise, there still are users who prefer my filesystem-agnostic and overly complex implementation of ROW_FORMAT=COMPRESSED . I wanted to deprecate and remove write support for it, but due to community feedback, I reverted that decision in MDEV-27736 .

Eugene added a comment - 2022-03-18 16:34

Thank you for your explanation, marko.
It seems that there's really nothing to be done here (except adding warnings in documentation about the case), at least at the moment.

Eugene added a comment - 2022-03-18 16:34 Thank you for your explanation, marko . It seems that there's really nothing to be done here (except adding warnings in documentation about the case), at least at the moment.

Marko Mäkelä added a comment - 2022-03-21 11:23

euglorg, thank you. Let’s address this in the documentation.

I would expect the InnoDB page_compressed tables to be able to save space in a traditional file system, such as Linux ext4 or Microsoft NTFS, at the cost of increased fragmentation.

The documentation might also include an example how to check the actually allocated physical blocks. That could be ls -ls on GNU/Linux.

On file systems that support snapshots, the allocation of blocks could be quite different, and questions like ‘how much space does a file consume’ become more complicated. We have observed that enabling O_DIRECT (~~MDEV-24854~~) caused problems for several Linux file systems.

Marko Mäkelä added a comment - 2022-03-21 11:23 euglorg , thank you. Let’s address this in the documentation. I would expect the InnoDB page_compressed tables to be able to save space in a traditional file system, such as Linux ext4 or Microsoft NTFS, at the cost of increased fragmentation. The documentation might also include an example how to check the actually allocated physical blocks. That could be ls -ls on GNU/Linux. On file systems that support snapshots, the allocation of blocks could be quite different, and questions like ‘how much space does a file consume’ become more complicated. We have observed that enabling O_DIRECT ( MDEV-24854 ) caused problems for several Linux file systems.

Guillaume Lefranc added a comment - 2022-09-22 15:19

It seems that Percona has fixed this by adding fallocate support in xbstream, it would be nice if MariaDB did backport this to mbstream.
https://jira.percona.com/browse/PXB-2614

In relevance to my previous comments, fallocate --dig-holes works with XFS, if speculative allocation is disabled (allocsize=64k in mount options)

Guillaume Lefranc added a comment - 2022-09-22 15:19 It seems that Percona has fixed this by adding fallocate support in xbstream, it would be nice if MariaDB did backport this to mbstream. https://jira.percona.com/browse/PXB-2614 In relevance to my previous comments, fallocate --dig-holes works with XFS, if speculative allocation is disabled (allocsize=64k in mount options)

Valerii Kravchuk added a comment - 2023-04-27 06:53

So, do we plan to fix anything in mariabackup implementation for XFS in frames of this bug report?

Valerii Kravchuk added a comment - 2023-04-27 06:53 So, do we plan to fix anything in mariabackup implementation for XFS in frames of this bug report?

Marko Mäkelä added a comment - 2023-05-11 12:16

I think that tanj’s suggestion is reasonable. This should be relatively simple to implement, but a bit tricky to test.

Marko Mäkelä added a comment - 2023-05-11 12:16 I think that tanj ’s suggestion is reasonable. This should be relatively simple to implement, but a bit tricky to test.

Marko Mäkelä added a comment - 2023-10-16 12:09

This change looks OK to me. This will add some fallocate() system calls, but only when page_compressed tables are being used.

Marko Mäkelä added a comment - 2023-10-16 12:09 This change looks OK to me. This will add some fallocate() system calls, but only when page_compressed tables are being used.

People

Assignee:: Sergei Golubchik

Reporter:: Muhammad Irfan

Votes:: 6 Vote for this issue

Watchers:: 19 Start watching this issue

Dates

Created:: 2021-05-19 21:25

Updated:: 2024-10-23 04:37

Resolved:: 2023-10-17 13:46

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server