Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5, 10.6
Description
As InnoDB page compression uses “hole punching,” and it seems like ignored by somehow. mariabackup SST method leaves compressed data from donor to decompressed on joiner node. mariabackup SST not copying compressed files so page compression seems not beneficial in this case.
MariaDB [tt]> show create table t1\G
|
*************************** 1. row *************************** |
Table: t1
|
Create Table: CREATE TABLE `t1` (
|
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, |
`msg` varchar(100) DEFAULT NULL, |
UNIQUE KEY `id` (`id`)
|
) ENGINE=InnoDB AUTO_INCREMENT=34209256 DEFAULT CHARSET=latin1 `PAGE_COMPRESSED`='ON' |
1 row in set (0.005 sec) |
|
+--------------+------------+------------+
|
| TABLE_SCHEMA | TABLE_NAME | ROW_FORMAT |
|
+--------------+------------+------------+
|
| tt | t1 | Dynamic |
|
+--------------+------------+------------+
|
|
MariaDB [tt]> show global variables like '%compression%'; |
+------------------------------------------+------------------+
|
| Variable_name | Value |
|
+------------------------------------------+------------------+
|
| column_compression_threshold | 100 | |
| column_compression_zlib_level | 6 | |
| column_compression_zlib_strategy | DEFAULT_STRATEGY |
|
| column_compression_zlib_wrap | OFF |
|
| innodb_compression_algorithm | zlib |
|
| innodb_compression_default | ON |
|
| innodb_compression_failure_threshold_pct | 5 | |
| innodb_compression_level | 1 | |
| innodb_compression_pad_pct_max | 50 | |
+------------------------------------------+------------------+
|
9 rows in set (0.001 sec) |
page compression is enabled. Where size of tablespace file is:
$ ls -lrth /var/lib/mysql/tt/t1.ibd
|
2.1G /var/lib/mysql/tt/t1.ibd |
|
$ du -hs --block-size=1 /var/lib/mysql/tt/t1.ibd |
1098792960 /var/lib/mysql/tt/t1.ibd |
|
$ du -hs /var/lib/mysql/tt/t1.ibd
|
1.1G /var/lib/mysql/tt/t1.ibd |
Now, let's try mariabackup:
mariabackup --backup --target-dir=/home/vagrant/backup/ -u root -p'xxxx' |
.
|
[00] 2021-05-19 20:08:13 completed OK! |
Let's verify backup table t1 tablespace file size again:
$ ls -la /home/vagrant/backup/tt/t1.ibd
|
-rw-r-----. 1 root root 2231369728 May 19 20:08 /home/vagrant/backup/tt/t1.ibd |
|
$ du -hs --block-size=1 /home/vagrant/backup/tt/t1.ibd |
2214154240 /home/vagrant/backup/tt/t1.ibd |
|
$ du -hs /home/vagrant/backup/tt/t1.ibd
|
2.1G /home/vagrant/backup/tt/t1.ibd |
As per https://dev.mysql.com/doc/refman/5.7/en/innodb-page-compression.html
Preserving page compression when moving a page-compressed tablespace file from one host to another requires a utility that preserves sparse files
I think that is related xtrabackup report https://jira.percona.com/browse/PXB-1557
Attachments
Issue Links
- relates to
-
MDEV-16129 InnoDB page compression not working as documented
-
- Open
-
-
MDEV-22839 ROW_FORMAT=COMPRESSED vs PAGE_COMPRESSION=1 - size comparison
-
- Open
-
-
MDEV-13023 mariabackup does not preserve "holes punched" for tables with page_compressed=1
-
- Closed
-
-
MDEV-15527 page_compressed compressed page partially during import tablespace
-
- Closed
-
Activity
thiru, you fixed something similar in MDEV-15527 in IMPORT TABLESPACE.
In 10.6, MDEV-26029 disables sparse files on purpose on thinly provisioned storage (computational storage device), specifically ScaleFlux. We must not break that logic when merging this fix to 10.6.
MariaDB [(none)]> use test;
|
Reading table information for completion of table and column names
|
You can turn off this feature to get a quicker startup with -A
|
|
Database changed
|
MariaDB [test]> show create table t1;
|
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Table | Create Table |
|
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
| t1 | CREATE TABLE `t1` (
|
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
|
`msg` varchar(100) DEFAULT NULL,
|
UNIQUE KEY `id` (`id`)
|
) ENGINE=InnoDB AUTO_INCREMENT=96860716 DEFAULT CHARSET=latin1 `page_compressed`=1 |
|
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
1 row in set (0.000 sec)
|
|
MariaDB [test]> select count(*) from t1;
|
+----------+
|
| count(*) |
|
+----------+
|
| 67108864 |
|
+----------+
|
1 row in set (43.579 sec)
|
|
MariaDB [test]> show global variables like '%compression%';
|
+------------------------------------------+------------------+
|
| Variable_name | Value |
|
+------------------------------------------+------------------+
|
| column_compression_threshold | 100 |
|
| column_compression_zlib_level | 6 |
|
| column_compression_zlib_strategy | DEFAULT_STRATEGY |
|
| column_compression_zlib_wrap | OFF |
|
| innodb_compression_algorithm | zlib |
|
| innodb_compression_default | ON |
|
| innodb_compression_failure_threshold_pct | 5 |
|
| innodb_compression_level | 1 |
|
| innodb_compression_pad_pct_max | 50 |
|
+------------------------------------------+------------------+
|
|
Current file size:
|
===================
|
|
thiru@rr:~/server/10.3/brl/data/test$ ls -lsrth ./t1.ibd
|
3.4G -rw-rw---- 1 thiru dev 6.0G Sep 8 12:30 ./t1.ibd
|
|
thiru@rr:~/server/10.3/brl/data/test$ du -hs ./t1.ibd
|
3.4G ./t1.ibd
|
|
|
Running the mariabackup with the following command:
|
===================================================
|
./mariabackup --target-dir=$PWD/backup --backup --socket=$PWD/../../data/mysql.sock --user=root
|
|
|
Checking the file size in target directory:
|
===============================================
|
|
thiru@rr:~/server/10.3/brl/extra/mariabackup/backup/test$ ls -lsrth ./t1.ibd
|
3.4G -rw-r----- 1 thiru dev 6.0G Sep 8 12:34 ./t1.ibd
|
|
thiru@rr:~/server/10.3/brl/extra/mariabackup/backup/test$ du -hs ./t1.ibd
|
3.4G ./t1.ibd
|
I don't see any uncompressed table in target directory.
thiru@rr:~/server/10.3/brl/data/test$ cat /proc/version
|
Linux version 5.4.0-62-generic (buildd@lgw01-amd64-045) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021
|
Please verify what was the mariabackup command given by the user. Whether this issue is specific to centOS ? Please get the information about what filesystem they were using?
Got similar issue.
Our setup is Linux 5.10.67-gentoo, 3 absolutely identical servers with absolutely same mariadb configuration. Mariadb version is mariadb-10.4.21.
We have similar definitions on all the tables with ENGINE=InnoDB ... 'PAGE_COMPRESSED'='ON'.
We also have compression enabled by default for new tables:
innodb_compression_algorithm = snappy
|
innodb_compression_default = ON
|
and compression setting like
MariaDB [(none)]> show global variables like '%compression%'; |
+------------------------------------------+------------------+
|
| Variable_name | Value |
|
+------------------------------------------+------------------+
|
| column_compression_threshold | 100 | |
| column_compression_zlib_level | 6 | |
| column_compression_zlib_strategy | DEFAULT_STRATEGY |
|
| column_compression_zlib_wrap | OFF |
|
| innodb_compression_algorithm | snappy |
|
| innodb_compression_default | ON |
|
| innodb_compression_failure_threshold_pct | 5 | |
| innodb_compression_level | 6 | |
| innodb_compression_pad_pct_max | 50 | |
+------------------------------------------+------------------+
|
The issue was noticed when 'optimize' and several migrations were performed on the node running standalone: dataset size reduced significantly. Then data copied with SST to other nodes and SST took much longer than it was expected according to connection speed and dataset size on source node. The reason is that dataset size changed dramatically with SST and streamed volume appeared to be matching uncompressed data size.
The result is:
Node3
|
---------------
|
Filesystem Size Used Avail Use% Mounted on
|
/dev/sda8 3.5T 2.5T 1.1T 71% /var/lib/mysql |
---------------
|
Node2
|
---------------
|
Filesystem Size Used Avail Use% Mounted on
|
/dev/sda8 3.5T 2.5T 1.1T 71% /var/lib/mysql |
---------------
|
Node1
|
---------------
|
Filesystem Size Used Avail Use% Mounted on
|
/dev/sda8 3.5T 1.7T 1.9T 49% /var/lib/mysql |
On comparing tabespace file sizes, it was found that each and every .ibd file is bigger on Node3 and Node2 than it was on Node1, however, data size in information_schema is also identical for same table on source server and replica, meaning tablespaces are not compressed on repilcas while they were on source one.
Also, on examining the backup taken on Node1 with mariabackup tool, it was found that tablespaces inside the backup have very same size they have on replicating (Node3 and Node2) servers (say, the table has size of 25GB on source node, it has size of 40GB on both replicas and in backup copy, while source node still operates compressed table with 25GB of size)
SST section in mysql configuration file has no relation to this:
[SST]
|
compressor='/usr/bin/zstd --fast=3' |
decompressor='/usr/bin/zstd -d' |
Mariabackup command to make local backup is:
mariabackup -u root -p --backup --galera-info --stream=xbstream --parallel 16 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql | /usr/bin/zstd --fast -T16 -q -o /mysql_backup.zst
Again, on uncompressing archive mysql_backup.zst it was noticed that data size it contains is 30% bigger than in backup source directory.
This all means that tablespace data is streamed uncompressed even on preforming local backup, and, thus, same happens on using mariabackup for SST while no compression happens on replica node.
thiru It appears on CentOS7/8 and it happens on XFS filesystem during mariabackup SST
Wrote the sample program to do create, write & seek system call.
#include<unistd.h>
|
#include<fcntl.h>
|
#include<stdio.h>
|
#include<string.h>
|
|
#define DATA_SIZE 16384
|
|
int main() |
{
|
/* Variable to store user content */ |
char data[DATA_SIZE]; |
int data_length[8]= {16384, 109, 244, 249, 7346, 7078, 7146, 7839}; |
int fd = open("./f1.txt", O_RDWR | O_CREAT); |
|
memset(data, 'a', DATA_SIZE); |
for (int i= 0; i < 8; i++) |
{
|
int punch_hole_len= DATA_SIZE - data_length[i]; |
write(fd, data, data_length[i]);
|
if (punch_hole_len) |
lseek(fd, punch_hole_len, SEEK_CUR);
|
}
|
|
close(fd);
|
printf("File created and saved successfully \n"); |
return 0; |
}
|
In ext4 file system the output file `f1.txt` has different block size
[~/mariarepo/server/10.7/10.7-sample]$df -T ./
|
Filesystem Type 1K-blocks Used Available Use% Mounted on
|
/dev/nvme0n1p7 ext4 364903200 326044404 20253024 95% /home
|
|
[~/mariarepo/server/10.7/10.7-sample]$ls -ls f1.txt
|
64 ----r-x--T 1 thiru thiru 122527 Sep 22 17:46 f1.txt
|
|
[~/mariarepo/server/10.7/10.7-sample]$du -hs f1.txt
|
64K f1.txt
|
|
In xfs filesystem:
[/data/volumes/xfs32m]$df -T ./
|
Filesystem Type 1K-blocks Used Available Use% Mounted on
|
/dev/loop3 xfs 29348 1904 27444 7% /data/volumes/xfs32m
|
|
[/data/volumes/xfs32m]$ls -ls ./f1.txt
|
76 -rwx--s--T 1 root root 122527 Sep 22 17:47 ./f1.txt
|
|
[/data/volumes/xfs32m]$du -hs ./f1.txt
|
76K ./f1.txt
|
AFAIK, Mariabackup doesn't do anything wrong logically. Mariabackup preserve punch hole for compressed format(MDEV-13023).
I think the problem lies within file system.
Sorry, but this doesn't explain the case with same issue happening while streaming data to stdout and compressing it on the fly with mariabackup --stream=xbstream | zstd -o...
Files inside the archive (that is filesystem-independent) have bigger size than they have in source /var/lib/mysql directory. Thus, mariadb handles them incorrectly while reading and streaming anywhere regardless of filesystem they will be written into.
euglorg yes, MDEV-13023 doesn't deal with compressed file while streaming. Thanks for your analysis. I will consider this issue as Mariabackup doesn't punch hole the file during streaming.
As far as I understand, the problem is specifically with the Galera snapshot transfer (SST). Either something in the SST script or the mariabackup options that it invokes needs to be changed.
Marko, sorry, but the problem seems to be not just with SST as it also happens when mariabackup is used for creating a backup with mariabackup --stream=xbstream | zstd -o..., there's an example of complete command line with all the arguments above. Again, compression in archived data sent through the pipe as xbstream is also broken. As SST relies on mariabackup, it was just noticed on SST, but upon investigating backups it became clear that problem happens with mariabackup itself.
euglorg - why do you uncompress the stream? What does that prove?
Compressing the stream , lightly, gives you the ability to save on some binary zeros, which are actually, and undeniably, in the stream.
uncompressing the compressed stream can be combined, via pipe, with mbstream -x , to actually create a data directory. And that step creates files with the holes. So, you have compressed stream, and compressed/with holes data directory.
That uncompressed stream does not have holes, yes, we know about that, and thus the suggestion is to compress it. Lightly, so that larger runs of binary zeros are omitted.
If you would like to point out that mbstream -x does not create holes, on some filesystem, I would think there is something wrong with filesystem. mariabackup makes an effort to create holes, portably, with original Unix APIs from 1970ies, seek() behind the end of file, and write() there. This popular answer from stackoverflow describes how to programmatically create holes, and this is what mariabackup exactly does.
Vladislav, sorry for misunderstanding, let's probably explain the problem step by step.
First, your question is a bit unclear as there's no any mention on stream uncompression in the command I provided:
mariabackup -u root -p --backup --galera-info --stream=xbstream --parallel 16 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql | ... |
It is expected that data received from pipe will have pages compressed, isn't it?
Here commands to the right from the pipe are omitted as uncompressed stream is sent through the pipe while there's no any explicit mention on uncompressing tablespace data in either command or related hel page explanations (this has no relation to compressing and uncompressing the stream sent through socat while performing SST, for sure).
To clear the case, full backup-restore command set would be:
#create backup archive
|
mariabackup -u root -p --backup --galera-info --stream=xbstream --parallel 16 --use-memory=16G --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql | /usr/bin/zstd --fast -T16 -q -o /mysql_backup.zst |
#uncompress the archive
|
zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp
|
#prepare the backup
|
mariabackup --prepare --innodb-doublewrite --target-dir /mnt/mariabackuptmp
|
but second command always produces dataset that has uncompressed tablespace data. Now we use XFS. If you are sure this is filesystem problem, would you expect data to have same size in case I uncompress archived backup data to ext4fs?
Please give me a couple of days, I'll try to extract archived backup to ext4fs... If it will have same size the original data has had, then the problem can be within FS, then there should be explicit note in the documentation that mariadb can't use tablespace compression if mariabackup for SST and data resides on XFS partition.
However, in case zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp (see above how it was created) will produce also dataset with uncompressed tablespaces on ext4fs (mean, command chain mariabackup | mbstream -x creates dataset that is 30% bigger than expected), this would mean that data sent through the pipe has broken compression already, as it is not sent to any filesystem.
If the above logic is wrong, please let me know.
euglorg, did you try the test program that thiru posted? It is producing different results on ext4 and XFS.
As far as I can tell, the function write_compressed() in extra/mariabackup/ds_local.cc is creating sparse files in the traditional way: by writing and seeking past the current end of the file.
We might try to convince XFS harder to create sparse files, by invoking a Linux specific fallocate() system call, to make it closer to the way how writes from the InnoDB buffer pool work.
I found a question that suggests that this is an XFS design feature:
https://stackoverflow.com/questions/6940516/create-sparse-file-with-alternate-data-and-hole-on-ext3-and-xfs
This one may be relevant too:
https://serverfault.com/questions/406069/why-are-my-xfs-filesystems-suddenly-consuming-more-space-and-full-of-sparse-file
If XFS is on purpose so reluctant to create sparse files with tiny blocks when they are created in the traditional POSIX way (seeking and writing) as opposed to FALLOC_FL_PUNCH_HOLE, maybe it is not an ideal match for the page_compressed tables.
Here is another link http://web.archive.org/web/20171010154045/http://oss.sgi.com/archives/xfs/2012-02/msg00413.html, with discussion on xfs mount options. disclaimer - I did not try that.
It is probably helpful to punch holes on XFS explicitly, with FALLOC_FL_PUNCH_HOLE, after determinining that file system is XFS.
Marko, I haven't tried that test. Instead, I tried to extract backup archive to ext4fs partition and have to confirm that the issue happens on XFS only.
The case looks as follows:
- Original dataset with page compression has size of 1.7T
- After running `mariabackup --stream=xbstream | /usr/bin/zstd -o /mysql_backup.zst && zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp` where /mnt/mariabackuptmp is XFS partition, dataset inside /mnt/mariabackuptmp has size of 2.5T
- After running `mariabackup --stream=xbstream | /usr/bin/zstd -o /mysql_backup.zst && zstd -d -c /mysql_backup.zst | mbstream -x -C /mnt/mariabackuptmp` where /mnt/mariabackuptmp is ext4fs partition, dataset inside /mnt/mariabackuptmp has size of 1.7T
The issue is that XFS has better performance for multithreaded IO, that was the reason of using it. Probably, Vladislav is right and XFS requires specific handling if page compression is in use.
Perhaps, fallocate --dig-holes on the backup files could help, after decompression. As a workaround
But this workaround is probably far from being good option for SST transfer. It will make it much slower.
the xbstream/mbstream format is currently not-optimized for the holes. Thus, the suggestion to use lightweight stream compression. Lightweight means it would be is capable of detecting and compressing large runs of 0 bytes, yet would not try too hard to compress already compressed data. zlib with level 1 might be a good candidate
Just as a side note, punching holes with fallocate on XFS is slow:
https://issues.apache.org/jira/browse/KUDU-2052
So, eventually one of the solutions could be to patch mariabackup so if XFS is detected XFS_IOC_UNRESVSP64 is used to punch holes in files
We do not do fallocate in backup, tanj. There is the traditional, not specifically Linuxy, solution , lseek() behind the end of file, and then write(). distance between old end of file, and the place we write, is supposed to be the hole. The most modern in the current backup solution, is that we do set sparse file attribute on Windows, which works since year 2000 or so
The server does fallocates, and maybe it does them slow, but there is nothing else we do.
@wlad I was referring to Marko's comment:
We might try to convince XFS harder to create sparse files, by invoking a Linux specific fallocate() system call, to make it closer to the way how writes from the InnoDB buffer pool work.
It is not up to me entirely, but I do not welcome a workaround for every bug-of-day on every possible Linux filesystem. Maybe we eventually just document the behavior of this and that filesystem, and leave it as-is.
Indeed, workaround for every 'bug-of-the-day' is a bad idea, strongly agree with Vladislav!
But looking at this case, it is not even clear what part should be treated as a bug: xfs behaves correctly from filesystem point of view because: (a) correct way to create sparse file is fallocate() and not seek-beyond-EOF, and (b) xfs preallocates blocks to minimize fragmentation, thus, it behaves correctly utilizing certain space that should be a hole on other filesystems. Thus, there's even no reason to report this as a bug of xfs.
The point is that both MariaDB does nothing wrong, just uses generic method of punching holes, but also xfs does nothing wrong, too, while together this causes an unwanted effect with unpredictable disk space utilization after SST.
For now I'd ask for
- adding a note into documentation about this (known) behavior that is specific for SST with xfs.
- leaving behavior unchanged but leaving the bug open (yes, this sounds like "known bug is a feature")
The reason for leaving the bug open is that there's a question whether using seek-beyond-EOF is in fact correct method of handling hole punching: this probably should not be reevaluated and replaced with another (fallocate?) filesystem-specific methods and not just for xfs but for other filesystems also... not sure in fact.
I'd also agree that proper documentation is preferable to workaround. In this case writing that ext4 is preferable as a filesystem "in certain cases"
Just as a note because I tried on XFS but fallocate --dig-holes does not reclaim space sadly.
Tried fallocate --dig-holes on some files from my dataset (on xfs), there's some effect but not that expected difference seen on ext4fs. It probably depends on real data size, hole sizes and extents allocated in every particular case.
By the way, it would be good to add another note on dealing with holes (regardless of filesystem, probably). The case is that in case you have sparse files containing tablespaces on SSD drive and issue 'fstrim' command to trim unused blocks, fstrim locks files having holes and trimming them (that's expected) with freezing IO on them, thus, freezing mariadbd. Once space is trimmed, files "released" and mariadbd resumes normal operation (fstrim can still be handling unoccupied space on same drive yet). On galera cluster this triggers flowcontrol and slows down whole cluster. No reason to report this as a separate bug (as this is also expected behavior in fact and caused by same compression with storing data in sparse files) but a warning about this effect might be helpful.
It looks like there is not much that can be improved outside the file system code in the operating system kernel.
euglorg, the traditional UNIX way to create sparse files is to seek beyond the end of the file and write. The Linux fallocate() flag FALLOC_FL_PUNCH_HOLE is much newer; I do not think that it existed 20 years ago.
I mostly agree with the legendary blog post https://smalldatum.blogspot.com/2015/10/wanted-file-system-on-which-innodb.html which is a criticism of the page_compressed format. By outsourcing some complexity to the Linux kernel, we essentially are at the mercy of the kernel. Either we get small annoyances like this one, or potential data corruption like https://lwn.net/Articles/864363/. I have not seen that in the wild, but I remember kernel crashes or file system corruption when Oracle tested the MySQL 5.7 response to the MariaDB 10.1 page_compressed format on a FusionIO device. We do test page_compressed internally, but mostly on the ext4 file system, and I do not think that we have tests that would monitor the actual space allocation.
To my surprise, there still are users who prefer my filesystem-agnostic and overly complex implementation of ROW_FORMAT=COMPRESSED. I wanted to deprecate and remove write support for it, but due to community feedback, I reverted that decision in MDEV-27736.
Thank you for your explanation, marko.
It seems that there's really nothing to be done here (except adding warnings in documentation about the case), at least at the moment.
euglorg, thank you. Let’s address this in the documentation.
I would expect the InnoDB page_compressed tables to be able to save space in a traditional file system, such as Linux ext4 or Microsoft NTFS, at the cost of increased fragmentation.
The documentation might also include an example how to check the actually allocated physical blocks. That could be ls -ls on GNU/Linux.
On file systems that support snapshots, the allocation of blocks could be quite different, and questions like ‘how much space does a file consume’ become more complicated. We have observed that enabling O_DIRECT (MDEV-24854) caused problems for several Linux file systems.
It seems that Percona has fixed this by adding fallocate support in xbstream, it would be nice if MariaDB did backport this to mbstream.
https://jira.percona.com/browse/PXB-2614
In relevance to my previous comments, fallocate --dig-holes works with XFS, if speculative allocation is disabled (allocsize=64k in mount options)
So, do we plan to fix anything in mariabackup implementation for XFS in frames of this bug report?
I think that tanj’s suggestion is reasonable. This should be relatively simple to implement, but a bit tricky to test.
This change looks OK to me. This will add some fallocate() system calls, but only when page_compressed tables are being used.
Also mariabackup in 10.5 can't handle sparse-files.
didn't test 10.6