[MDEV-22839] ROW_FORMAT=COMPRESSED vs PAGE_COMPRESSION=1 - size comparison Created: 2020-06-08  Updated: 2024-01-02

Status: Open
Project: MariaDB Server
Component/s: Server
Fix Version/s: 10.6

Type: Task Priority: Major
Reporter: Nuno Assignee: Marko Mäkelä
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-11068 Review which innodb_compression_algor... Closed
relates to MDEV-15528 Avoid writing freed InnoDB pages Closed
relates to MDEV-16129 InnoDB page compression not working a... Open
relates to MDEV-21727 Optimize redo logging for ROW_FORMAT=... Open
relates to MDEV-19780 Remove the TokuDB storage engine Closed
relates to MDEV-24797 Column Compression - ERROR 1265 (0100... Closed
relates to MDEV-25734 mbstream breaks page compression on XFS Closed

 Description   

I have the following tables:

Table1:
row_format=DYNAMIC
Size = 5.87 GB
Index = 3.88 GB
Free = 4 MB
Avg Row Length = 197 B
Rows = 38686411
(Actual file size: 15359541248 bytes)

.
Table2:
row_format=COMPRESSED key_block_size=8
Size = 4.26 GB
Index = 2.72 GB
Free = 3.5 MB
Avg Row Length = 86 B
Rows = 83063985
(Actual file size: 12448694272 bytes)

.
Both tables have some meta data (ints/enums) and a main text field (100% variable length, with user messages/comments).

.
As experiment, for Table1, I made the text column COMPRESSED (using pt-online-schema-change), and converted to:
row_format=DYNAMIC `PAGE_COMPRESSED`=1 `PAGE_COMPRESSION_LEVEL`=6

It became like this:
Size = 6.23 GB
Index = 6.61 GB
Free = 5 MB
Avg Row Length = 173 B
(Actual file size: 13988003840 bytes)

.
For Table2, using normal ALTER TABLE, I converted to:
row_format=DYNAMIC `PAGE_COMPRESSED`=1 `PAGE_COMPRESSION_LEVEL`=9

It became like this:
Size = 12.59 GB
Index = 7.06 GB
Free = 4 MB
Avg Row Length = 159 B
(Actual file size: 21399339008 bytes)

.
These are values from information_schema.

I understand that the actual file sizes don't matter so much (they can be sparse files),
but looking at the Size/Index, the PAGE_COMPRESSED seems to be causing the tables to be bigger, rather than smaller.

Table1 – From no compression at all, to be compressed, the data increased 1GB, and indexes doubled.
(I wanted to compress the text field only, but the innodb default configs made pt-online-schema-change also put PAGE_COMPRESSED)

Table2 – From ROW_FORMAT=COMPRESSED to maximum level of PAGE_COMPRESSION, both data and index increased considerably.

Does it make sense that PAGE_COMPRESSION, even with level 9, is compressing less than ROW_FORMAT=COMPRESSED?



 Comments   
Comment by Marko Mäkelä [ 2020-06-09 ]

Thank you for the input. MDEV-11068 had been filed earlier for something similar.

If your test includes long string columns that could be stored off-page by InnoDB, then ROW_FORMAT=COMPRESSED would gain some advantage, by compressing the singly-linked lists of pages of such columns as one zlib stream. That is similar to what MDEV-11371 implemented at the SQL layer. In page_compressed, each BLOB page in the singly-linked lists would be compressed separately, and we might punch holes on the tail of each page, to free up a little space.

The hole-punching of page_compressed tables could significantly reduce the file system performance. It was originally developed for an SSD manufacturer’s file system NVMFS. You are not the first one to question the usefulness: Wanted: a file system on which InnoDB transparent page compression works was posted in 2015.

Note: the logical file size will not change as a result of the hole-punching. You should monitor the physical file size (excluding any ‘sparse’ blocks), as reported by ls -s.

MDEV-15528 and its follow-up task MDEV-8139 should slightly improve the efficiency of page_compressed tables. When a page is freed inside the tablespace, we would deallocate the page_compressed block from the file system ("punch hole"). Before those fixes, we would not deallocate any garbage pages, for example after DROP INDEX or a large DELETE that causes the number of allocated pages to shrink.

Disclosure: I did the low-level design and the implementation of ROW_FORMAT=COMPRESSED based on a high-level idea of Heikki Tuuri, between 2005 and 2007. I believe that even that format is past its ‘best before’ date. The implementation is also complicating the operation of the buffer pool and crash recovery. Because Heikki wanted to minimize the changes to page headers, we are wasting close to 100 bytes per page. And because the page footer format that I designed uses 14-bit offsets, we can only support innodb_page_size≤16k. I did not even consider extending the format for instant column changes (MDEV-11369, MDEV-15562).

I think that we should gradually deprecate support for ROW_FORMAT=COMPRESSED, retaining read-only support to allow upgrades. This would simplify the buffer pool code further.

When it comes to page_compressed, in my opinion we should consider removing the option altogether, maybe retaining read-only support for a few major releases. Existing data files could be converted with an external tool.

Comment by Marko Mäkelä [ 2020-06-09 ]

In MariaDB 10.5, the write performance to ROW_FORMAT=COMPRESSED tables may degrade due to MDEV-12353, because until MDEV-21727 is addressed we may be writing more than the optimal amount of redo log.

Comment by Nuno [ 2020-06-09 ]

Hello,

Thank you very much for your detailed response, which I appreciate very much.

I see you want to deprecate ROW_FORMAT=COMPRESSED, which I'm ok with, as long as the alternative/successor becomes a better option, in terms of Storage and Performance.

What I don't want is to have my hands tied because:

1) my database is too large which makes it impracticable to convert to the new option, without requiring my website to go offline or tables being 100% locked

2) not being able to upgrade MariaDB because the performance will degrade due to the option I'm using

There is a similar example where something that I thought was implemented as an improvement and an online transaction, actually locked all tables while it was running – see MDEV-18385 for the innodb_defragment option.

But if it will be possible to convert my large ROW_FORMAT=COMPRESSED tables to the new option, online & smoothly without causing an impact on the uptime, that's all good for me

> When it comes to page_compressed, in my opinion we should consider removing the option altogether

I'm not sure what you tried to say here. So PAGE_COMPRESSED will be deprecated too? Or do you mean this will be the default and required for all tables?

Thank you very much.

Comment by Marko Mäkelä [ 2020-06-10 ]

I have filed MDEV-22367 for making ROW_FORMAT=COMPRESSED tables read-only. You could use online ALTER TABLE to rebuild affected tables. Another option would be to retain write support, but with a horrible compression ratio. The motivation of this exercise is to simplify the buffer pool code for improved performance.

I do not see much future for PAGE_COMPRESSED, unless the tests that should be done in the scope of MDEV-11068 surprise me.

I realize that forcing users to use an offline tool to convert data files would not allow a smooth upgrade path. Maybe we should start by throwing deprecation warnings for the page_compressed attribute, or just warn that the attribute will be ignored, and create a normal uncompressed table. Then, after several major releases, we could remove the code altogether.

RocksDB was developed with data compression in mind. I would expect you to get much better results with it. MDEV-16329 in a future release should allow a smooth upgrade path: you could perform online ALTER TABLE to change the storage engine.

Comment by Nuno [ 2020-06-10 ]

Hi Marko,

Ok, I see.
This is a total surprise for me, since in another conversation (MDEV-18586), Geoff was telling me that PAGE_COMPRESSED was a better option than ROW_FORMAT=COMPRESSED (and even the documentation was updated), so now I was in the process of upgrading to PAGE_COMPRESSED, as this was supposed to be the one I should be using. There was no signs that this would be deprecated, and I was preparing to put some downtime on my site for this conversion.

So, in summary, both ROW_FORMAT=COMPRESSED and PAGE_COMPRESSED should not be used. Fair enough.

However, my feedback is that it's a VERY bad idea to make tables read-only in a major update (10.6) like this, without first warning the clients in a previous major version (10.5). This will make many people very upset, especially those who are not able to test on a Test server and expect a smooth upgrade directly in production

Anyway - my understanding then is that I should keep ROW_FORMAT=COMPRESSED where it is already on,
and prepare to alter those tables to RocksDB.

I would like to take this opportunity now that I'm working on an upgrade, and would like to try this on my Test server.
Would you consider this stable enough in MariaDB 10.4? Would you recommend me to convert my existing tables now, or wait for when those JIRAs above are complete?

https://mariadb.com/kb/en/getting-started-with-myrocks/

I'm not familiar at all with RocksDB, but I assume none of the "innodb_*" variables will apply to RocksDB and that I will have to look into the below, am I right?

https://mariadb.com/kb/en/myrocks-system-variables/

And, does it play well with InnoDB tables?
I have mixed compressed/uncompressed tables, that need to JOIN between themselves, etc...

This feels like a forced learning about an engine I'm not familiar with (and risky to put in production, even if it works in the Test env), when InnoDB should have its own stable compression.
I'm not very sure about the idea of having to mix different engines, just to be able to have compressed tables.

Thank you very much.

Comment by Marko Mäkelä [ 2020-07-22 ]

Hi nunop, thank you for the feedback. We have not made any decision regarding compression in InnoDB yet. I do not disagree that PAGE_COMPRESSED could be a better option than ROW_FORMAT=COMPRESSED. The InnoDB code changes related to that are much simpler. The only question is if sparse files work efficiently on your file system.

Your questions on RocksDB would better be asked in a support ticket. I do not have much experience in any other storage engines than InnoDB.

I believe that compression and the ability to efficiently perform searched updates are conflicting goals.

Comment by Ian Gilfillan [ 2021-01-28 ]

marko are the MDEV-11068-related tests you mention above available to view? nunop can you supply the table structure for your examples?

Comment by Marko Mäkelä [ 2021-01-28 ]

greenman, I am not aware of anyone having developed or run benchmarks. MDEV-15528 and MDEV-8139 should have improved page_compressed space usage a little in 10.5, by punching holes over unused (freed) pages. I wonder if MDEV-8069 completely fixed the DROP TABLE issue, or whether fragmentation of page_compressed tables could cause long I/O stalls during DROP TABLE in the operating system kernel.

Alex/AT mentioned in MDEV-19780 that he is using InnoDB compression. I would like to see some performance data, preferrably including some scripts and configuration files that anyone can use to assess various aspects of performance.

Comment by Alex/AT [ 2021-01-28 ]

Hello Marko,

My worried comment in 19780 was about "I do not see much future for PAGE_COMPRESSED". Actually it works in place of TokuDB, although it's a good deal larger and slower, but still much better than uncompressed.

Our configuration is pretty much straightforward (although not 100% typical for database servers): Dell Compellent SC3020 SAN (SSD only) pushed over multiple links of 10G iSCSI, connected to XenServer pool servers where MariaDB 10.5 servers on CentOS 8 reside. The underlying filesystem is basic ext4 without any special tuning applied. MariaDB servers tuning is only related to buffer pool side, flush @ TRX commit (we disable the transaction flush and rely on each second one), log size & buffering. No special tuning otherwise.

The load patterns differ. One is heavy Zabbix installation with lots of history and hosts. Another is mobile usage pre-billing which processes and stores incoming dataflow in short amounts of time. Another is heavy logging storage (400Gb of compressed mail transfer logs). Another is realtime accessed Confluence installation. Another is heavy RADIUS accounting database, mostly written but read in bulk. And some more.

From what I see, I can tell TokuDB was the best storage for almost everything mentioned above (except for Confluence where InnoDB is a must due to system requirements). Uncompressed InnoDB performs very badly especially during read queries, saturating the SAN links. Compressed InnoDB survives well, but still behaves worse - although it is worth mentioning our links are never saturated with both TokuDB and page-compressed InnoDB.

One of the concerns of hole punching is slow file replacement on DDL. Well, this kicks in, but is totally tolerable (took under a minute for 400G mail logging file).

Comment by Alex/AT [ 2021-01-28 ]

Measuring exact performance may be pretty much tough because running synthetics won't do much. If you please can suggest some general guidelines for measuring, I may be able to do it on copies of running systems though.

Comment by Alex/AT [ 2021-01-28 ]

An example of mail logging system (~366G of data, compressed to 88G):

du -hs *
4.0K    db.opt
16K     messages.frm
88G     messages.ibd
 
ls -l
total 91811164
-rw-rw---- 1 mysql mysql           61 May  4  2020 db.opt
-rw-rw---- 1 mysql mysql        13443 Dec 13 07:23 messages.frm
-rw-rw---- 1 mysql mysql 366586363904 Jan 28 17:06 messages.ibd

TokuDB size was around 32G. Write-mostly, a multitude of transactions per second, but reads (searches) may be done in huge bulks, not always properly indexed.

+----------+------------+----------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table    | Non_unique | Key_name             | Seq_in_index | Column_name      | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+----------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| messages |          0 | PRIMARY              |            1 | id               | A         |   105755222 |     NULL | NULL   |      | BTREE      |         |               |
| messages |          0 | uidx_host_type_mid   |            1 | host             | A         |          28 |     NULL | NULL   |      | BTREE      |         |               |
| messages |          0 | uidx_host_type_mid   |            2 | type             | A         |          36 |     NULL | NULL   |      | BTREE      |         |               |
| messages |          0 | uidx_host_type_mid   |            3 | mid              | A         |   105755222 |     NULL | NULL   |      | BTREE      |         |               |
| messages |          1 | idx_gcTo             |            1 | gcTo             | A         |    10575522 |     1024 | NULL   | YES  | BTREE      |         |               |
| messages |          1 | idx_type             |            1 | type             | A         |           7 |     NULL | NULL   |      | BTREE      |         |               |
| messages |          1 | idx_created          |            1 | created          | A         |    52877611 |     NULL | NULL   |      | BTREE      |         |               |
| messages |          1 | idx_gcFrom           |            1 | gcFrom           | A         |    21151044 |     1024 | NULL   | YES  | BTREE      |         |               |
| messages |          1 | idx_gcMessageID      |            1 | gcMessageID      | A         |    52877611 |     1024 | NULL   | YES  | BTREE      |         |               |
| messages |          1 | idx_host             |            1 | host             | A         |          28 |     NULL | NULL   |      | BTREE      |         |               |
| messages |          1 | idx_updated          |            1 | updated          | A         |    52877611 |     NULL | NULL   |      | BTREE      |         |               |
| messages |          1 | idx_gcClientHost     |            1 | gcClientHost     | A         |     1229711 |     NULL | NULL   | YES  | BTREE      |         |               |
| messages |          1 | idx_gcClientIP       |            1 | gcClientIP       | A         |     2579395 |     NULL | NULL   | YES  | BTREE      |         |               |
| messages |          1 | idx_gcOriginalTo     |            1 | gcOriginalTo     | A         |       29615 |     1024 | NULL   | YES  | BTREE      |         |               |
| messages |          1 | idx_gcDSN            |            1 | gcDSN            | A         |       14295 |     1024 | NULL   | YES  | BTREE      |         |               |
| messages |          1 | idx_gcStatus         |            1 | gcStatus         | A         |         712 |     1024 | NULL   | YES  | BTREE      |         |               |
| messages |          1 | idx_gcStatusText     |            1 | gcStatusText     | A         |   105755222 |     1024 | NULL   | YES  | BTREE      |         |               |
| messages |          1 | idx_gcProgress       |            1 | gcProgress       | A         |          12 |     NULL | NULL   | YES  | BTREE      |         |               |
| messages |          1 | idx_gcSubject        |            1 | gcSubject        | A         |    26438805 |     1024 | NULL   | YES  | BTREE      |         |               |
| messages |          1 | idx_last_created     |            1 | last             | A         |    52877611 |     NULL | NULL   |      | BTREE      |         |               |
| messages |          1 | idx_last_created     |            2 | created          | A         |   105755222 |     NULL | NULL   |      | BTREE      |         |               |
| messages |          1 | idx_gcDecodedSubject |            1 | gcDecodedSubject | A         |    26438805 |     1024 | NULL   | YES  | BTREE      |         |               |
| messages |          1 | idx_created_host     |            1 | created          | A         |    52877611 |     NULL | NULL   |      | BTREE      |         |               |
| messages |          1 | idx_created_host     |            2 | host             | A         |   105755222 |     NULL | NULL   |      | BTREE      |         |               |
| messages |          1 | idx_mid              |            1 | mid              | A         |   105755222 |     NULL | NULL   |      | BTREE      |         |               |
+----------+------------+----------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

CREATE TABLE `messages` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `host` varchar(60) NOT NULL,
  `type` varchar(40) NOT NULL,
  `mid` varchar(100) NOT NULL,
  `created` datetime NOT NULL,
  `last` datetime NOT NULL,
  `updated` datetime NOT NULL,
  `messages` longtext NOT NULL,
  `gcProgress` varchar(40) DEFAULT NULL,
  `gcFrom` text DEFAULT NULL,
  `gcSubject` text DEFAULT NULL,
  `gcDecodedSubject` text DEFAULT NULL,
  `gcTo` text DEFAULT NULL,
  `gcOriginalTo` text DEFAULT NULL,
  `gcRelay` text DEFAULT NULL,
  `gcDSN` text DEFAULT NULL,
  `gcStatus` text DEFAULT NULL,
  `gcStatusText` text DEFAULT NULL,
  `gcMessageID` text DEFAULT NULL,
  `gcSize` bigint(20) DEFAULT NULL,
  `gcRecipients` bigint(20) DEFAULT NULL,
  `gcClientHost` varchar(200) DEFAULT NULL,
  `gcClientIP` varchar(200) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `uidx_host_type_mid` (`host`,`type`,`mid`),
  KEY `idx_gcTo` (`gcTo`(1024)),
  KEY `idx_type` (`type`),
  KEY `idx_created` (`created`),
  KEY `idx_gcFrom` (`gcFrom`(1024)),
  KEY `idx_gcMessageID` (`gcMessageID`(1024)),
  KEY `idx_host` (`host`),
  KEY `idx_updated` (`updated`),
  KEY `idx_gcClientHost` (`gcClientHost`),
  KEY `idx_gcClientIP` (`gcClientIP`),
  KEY `idx_gcOriginalTo` (`gcOriginalTo`(1024)),
  KEY `idx_gcDSN` (`gcDSN`(1024)),
  KEY `idx_gcStatus` (`gcStatus`(1024)),
  KEY `idx_gcStatusText` (`gcStatusText`(1024)),
  KEY `idx_gcProgress` (`gcProgress`),
  KEY `idx_gcSubject` (`gcSubject`(1024)),
  KEY `idx_last_created` (`last`,`created`),
  KEY `idx_gcDecodedSubject` (`gcDecodedSubject`(1024)),
  KEY `idx_created_host` (`created`,`host`),
  KEY `idx_mid` (`mid`)
) ENGINE=InnoDB AUTO_INCREMENT=283182705 DEFAULT CHARSET=utf8 `PAGE_COMPRESSED`=1

Comment by Alex/AT [ 2021-01-28 ]

To add to it, I tried to apply RocksDB to the task, but it consumes 20G of RAM and dies - literally - it does something with files, moving stuff around levels in logs, but does not arrive in time and starts doing it again and again and again. Did not test with more RAM, current InnoDB instance perfectly lives with 8G in total, 4G for buffer pool, like TokuDB fit okay into its 4G cache. That's to stress RocksDB is not a replacement for 'traditional engine' compression in any way.

Comment by Nuno [ 2021-01-29 ]

Alex/AT Seems TokuDB has been doing well for you. But since it's obsolete and MariaDB has suggested "MyRocks" as the long-term migration path, have you tried MyRocks?

(I honestly never heard about any of these 3 so far, so I have no idea how similar they are, how good they are, etc...)

Comment by Marko Mäkelä [ 2021-08-02 ]

nunop, I just checked your numbers, and it seems to me that what you quote as "Actual file size" in fact is the logical file size, because each number is a multiple of the InnoDB page size.

I do not think that we have any SQL interface for reporting the allocated file size. On GNU/Linux, you can get it reported by the following:

ls -ls table1.ibd

On my system, the first reported number is the allocated size, in units of 1024 bytes. It appears to include the file system overhead (indirect blocks).

axel recently conducted some tests for MDEV-11068, measuring the compression ratio and the write throughput of uncompressed and zlib and LZ4 page_compressed tables. In that test, the winner was uncompressed tables on a thinly provisioned computational storage device (ScaleFlux), both for the compression ratio and the execution speed. With LZ4 on a conventional SSD, the write performance penalty was not too bad, but zlib at innodb_compression_level=1 still compressed the data better (and slower).

nunop, can you repeat the experiment on MariaDB 10.5 or 10.6 and get the data for the actual file size? Would using COMPRESSED columns (MDEV-11371) together with either page_compressed or thinly provisioned storage beat the ROW_FORMAT=COMPRESSED size?

One thing that may be an inherent problem with the page_compressed approach is file system fragmentation, which was mentioned years ago in a blog post. It might be less of a problem nowadays, depending on the file system and operating system kernel version, of course.

Comment by Alex/AT [ 2021-08-02 ]

As this thread still flows and I see LZ4 tests, I want to ask, may there be any chance of adding LZMA (XZ) or ZSTD to the PAGE_COMPRESSED tables in some foreseeable future?

Comment by Alex/AT [ 2021-08-02 ]

I know LZMA is already included but this option is not in RPM builds and requires customly building MariaDB i.e. for LZMA now which is less than convenient.

Comment by Nuno [ 2021-08-02 ]

Nuno, can you repeat the experiment on MariaDB 10.5 or 10.6 and get the data for the actual file size? Would using COMPRESSED columns (MDEV-11371) together with either page_compressed or thinly provisioned storage beat the ROW_FORMAT=COMPRESSED size?

I can no longer test, as I have already migrated the database to a bigger storage (for significantly more monthly expense), one of the reasons being that MariaDB no longer supports ROW_FORMAT=COMPRESSED, so we no longer use that row format.

I did test using COMPRESSED columns, at the time I was testing/rehearsing the migration, but I gave up on the idea because of the error reported below, and no one replied to me at the time:
https://jira.mariadb.org/browse/MDEV-24797

Thank you.

Comment by Marko Mäkelä [ 2021-08-17 ]

nunop, thank you. MDEV-11371 and MDEV-24797 are outside the storage engine layer, so I cannot help with that myself.

MariaDB still supports ROW_FORMAT=COMPRESSED. MDEV-22367 was filed to remove that support, but no decision has been made yet. I think that the minimum prerequisite for the removal would be that a viable replacement has been demonstrated to exist, by running extensive performance tests. I do not remember any for ROW_FORMAT=COMPRESSED, other than some Oracle internal benchmarks that were around 2007 and were not at all flattering.

Alex/AT, I agree that it would be useful to cover also LZMA and ZSTD in MDEV-11068. ZSTD would have to be implemented first. It should not be hard, and code contributions are always welcome.

Comment by Marko Mäkelä [ 2021-09-01 ]

https://lwn.net/Articles/864363/ mentions that Linux 5.15 might finally include fixes for race conditions around hole-punching, which is what page_compressed tables are using. I can imagine that when not using innodb_flush_method=O_DIRECT in both the server and any backup programs, that race condition could result in data corruption. The patchset author comments that even kernel crashes are possible.

Comment by Marko Mäkelä [ 2021-09-22 ]

As noted in MDEV-25734, file systems can behave differently with regard to operations on sparse files. (XFS would allocate more physical space than ext4fs in one case.) Therefore, when testing page_compressed tables (which rely on sparse files), the operating system kernel version and the file system type need to be documented.

Generated at Thu Feb 08 09:17:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.