Details
-
Task
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
Description
hello
i used to have all tables on ibdata1 before, then i switched to one file per table, but the ibdata1 file keep its size.
I recently switched to mariadb 10.1, and hoped that the defragmenting function would help to shrink the file, but it didn't worked.
I optimised all tables, but the ibdata1 file did not changed at all, it is still 197gb
MariaDB [(none)]> show status like '%defragment%';
|
+----------------------------------------+--------+
|
| Variable_name | Value |
|
+----------------------------------------+--------+
|
| Innodb_defragment_compression_failures | 0 |
|
| Innodb_defragment_failures | 32872 |
|
| Innodb_defragment_count | 113744 |
|
+----------------------------------------+--------+
|
3 rows in set (0.01 sec)
|
MariaDB [(none)]> show variables like '%defragment%';
|
+--------------------------------------+----------+
|
| Variable_name | Value |
|
+--------------------------------------+----------+
|
| innodb_defragment | ON |
|
| innodb_defragment_fill_factor | 0.900000 |
|
| innodb_defragment_fill_factor_n_recs | 20 |
|
| innodb_defragment_frequency | 40 |
|
| innodb_defragment_n_pages | 7 |
|
| innodb_defragment_stats_accuracy | 0 |
|
+--------------------------------------+----------+
|
Attachments
Issue Links
- blocks
-
MDEV-28699 Shrink temporary tablespaces without restart
-
- Closed
-
-
MDEV-32452 InnoDB system tablespace is not shrunk on slow shutdown
-
- Closed
-
- causes
-
MDEV-34216 Possible corruption when shrinking the system tablespace on innodb_fast_shutdown=0
-
- Closed
-
- is blocked by
-
MDEV-31373 innodb_undo_log_truncate=ON recovery results in a corrupted undo log
-
- Closed
-
- is duplicated by
-
MDEV-31462 ibdata1 shrinking
-
- Closed
-
- relates to
-
MDEV-19229 Allow innodb_undo_tablespaces to be changed after database creation
-
- Closed
-
-
MDEV-28699 Shrink temporary tablespaces without restart
-
- Closed
-
-
MDEV-29983 Deprecate innodb_file_per_table
-
- Closed
-
-
MDEV-29999 innodb_undo_log_truncate=ON is not crash safe
-
- Closed
-
-
MDEV-33112 innodb_undo_log_truncate=ON is blocking page writes
-
- Closed
-
-
MDEV-35398 Improve shrinking of system tablespace
-
- Closed
-
-
MDEV-35689 InnoDB system tables cannot be optimized or defragmented
-
- Stalled
-
-
MDEV-21952 ibdata1 file size growing in MariaDB
-
- Closed
-
-
MDEV-29986 Set innodb_undo_tablespaces=3 by default
-
- Closed
-
-
MDEV-32622 [mariadb-galera] The undo log is very large, how can there be fewer undo log logs
-
- Closed
-
-
MDEV-34529 Shrink the system tablespace when system tablespace contains MDEV-30671 leaked undo pages
-
- Closed
-
Activity
greenman Can you clarify documentation that tablespace files (including ibdata1) will not shrink as the result of defragmentation.
thanks for the reply. Is the data actually moved to the beginning of the ibdata file ? is it difficult to add the possibility to shrink the file ibdata1 ?
Documentation now specifically mentions that the file size does not shrink. martina342 the only way to shrink the existing file size is with dumping and reloading. The article at https://www.thegeekstuff.com/2016/02/mysql-innodb-file-per-table/ seems to describe your scenario of moving from innodb_file_per_table = 0 to innodb_file_per_table=1
I think that it is a reasonable request to shrink the system tablespace after all tables have been moved out of it (innodb_file_per_table=1), dedicated undo tablespaces have been created (MDEV-19229) and the change buffer has been disabled (MDEV-27734).
We have to be careful about this in order to avoid introducing the equivalent of MDEV-29999. We may have to track changes not only to FSP_SIZE but also to FSP_FREE_LIMIT or something else that reliably tells the last actually used page offset within the file.
Because some code that is shared with undo tablespace truncation is being modified here, I think that we need to pay special attention to test coverage.
Preliminary results of RQG testing on origin/bb-10.11-MDEV-14795 5751dbd168147b1294d0dbde018adcab0fd02cb4 2023-06-08T15:31:09+05:30
|
|
1. make load, SIGKILL the server, restart fails
|
[rr 1719523 12457]2023-06-08 04:51:49 0x79d72909b640[rr 1719523 12460] InnoDB: Assertion failure in file /data/Server/bb-10.11-MDEV-14795C/storage/innobase/buf/buf0dblwr.cc line 618
|
[rr 1719523 12462]InnoDB: Failing assertion: fil_system.sys_space->acquire()
|
sdp:/data1/results/1686220665/TBR-1933$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio
|
|
2. make load, SIGKILL the server, restart passes, some tables get checked and the server crashes
|
[rr 3469512 6800]mysqld: /data/Server/bb-10.11-MDEV-14795C/storage/innobase/include/fil0fil.h:653: bool fil_space_t::set_needs_flush(): Assertion `n & PENDING' failed.
|
sdp:/data1/results/1686220665/TBR-1934$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio
|
|
|
There were more hits of sometimes other asserts around crash recovery.
|
But they have sometimes no rr trace or serious bigger rr traces.
|
|
3. make load, SIGKILL the server, restart, check certain tables with success, shutdown which ends with server crash
|
[rr 2614894 79103]2023-06-08 5:21:23 0 [Note] InnoDB: Starting shutdown...
|
[rr 2614894 79122]2023-06-08 5:21:23 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/rqg/1686220665/93/1/data/ib_buffer_pool
|
[rr 2614894 79132]2023-06-08 5:21:23 0 [Note] InnoDB: Restricted to 185 pages due to innodb_buf_pool_dump_pct=25
|
[rr 2614894 79139]2023-06-08 5:21:23 0 [ERROR] InnoDB: Summed data size 1859, returned by func 30316
|
[rr 2614894 79160]2023-06-08 5:21:23 0 [Note] InnoDB: Buffer pool(s) dump completed at 230608 5:21:23
|
[rr 2614894 79169]2023-06-08 5:21:23 0 [ERROR] InnoDB: Apparent corruption in space 0 page 1460 of index `IBUF_DUMMY` of table `IBUF_DUMMY`
|
[rr 2614894 79173]2023-06-08 5:21:23 0 [ERROR] InnoDB: Summed data size 1859, returned by func 30316
|
[rr 2614894 79177]2023-06-08 5:21:23 0 [ERROR] InnoDB: Apparent corruption in space 0 page 1460 of index `IBUF_DUMMY` of table `IBUF_DUMMY`
|
[rr 2614894 79182]mysqld: /data/Server/bb-10.11-MDEV-14795C/storage/innobase/ibuf/ibuf0ibuf.cc:4327: dberr_t ibuf_merge_or_delete_for_page(buf_block_t*, page_id_t, ulint): Assertion `page_validate(block->page.frame, dummy_index)' failed.
|
sdp:/data1/results/1686220665/TBR-1932$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio
|
|
Results of testing on origin/bb-10.11-MDEV-14795 873798f9c3052cdeea253f5bb398ba835346a63c 2023-06-12T16:07:47+05:30
|
1. Some of the problems mentioned above were fixed.
|
2. Other problems which showed up during testing turned out to be on
|
- origin/10.11 582d0cf5b062a9aa30f555f7f6c4f6e775bca408 2023-06-10T11:14:15+03:00
|
- origin/preview-11.2-preview 79f9e94195dc8a007debd27ec4e8d2c15f094a97 2023-06-13T17:23:05+10:00
|
too. Some of them have to reported.
|
Preliminary result of testing on origin/bb-10.11-MDEV-14795 e9e34b7b6c1409de26a84bd6bdbca45257b6263b 2023-06-16T18:28:00+05:30
|
The failure MENT-189
|
mariabackup --backup fails with
|
InnoDB: Assertion failure in file /data/Server/bb-10.11-MDEV-14795D/extra/mariabackup/xtrabackup.cc line 992
|
InnoDB: Failing assertion: opt_no_lock
|
shows up again.
|
sdp:/data1/results/1686921174/MENT-189_Rebirth$ _RR_TRACE_DIR=./1_clone/rr rr replay --mark-stdio
|
The reason is currently unknown.
|
origin/preview-11.2-preview 393bad0621ace7242f55e2cc5a6fb89ea9dfb8c6 2023-06-16T18:12:23+07:00 shows the same problem.
|
MDEV-29694 in 11.0 removed the InnoDB change buffer. If this change were ever applied to anything that supports the change buffer, then we should also think about deleting the change buffer records and clearing change buffer bitmap bits for the pages that will be trimmed off the end of the system tablespace.
I happened to notice a related-looking test failure in an IA-32 environment:
innodb.log_data_file_size '64k,innodb' w1 [ fail ]
|
…
|
2023-06-29 3:40:49 0 [Note] InnoDB: Truncating system tablespaces from 1344 to 320 pages
|
2023-06-29 03:40:49 0xa96f9b40 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-11.2.0/storage/innobase/fil/fil0fil.cc line 2988
|
InnoDB: Failing assertion: space->size == check.size
|
This might be fixed by acquiring fil_system.mutex earlier in the resizing code.
To better understand the maximum number of pages accessed and log records written by the shrinking operation, I came up with the following test:
rm -fr /dev/shm/data /dev/shm/rr && mkdir /dev/shm/data &&
|
./mariadbd --innodb-page-size=4k --innodb-log-file-size=4m --innodb-buffer-pool-size=4m --datadir /dev/shm/data --innodb-data-file-path=ibdata1:56G --bootstrap < /dev/null &&
|
_RR_TRACE_DIR=/dev/shm/rr rr record \
|
./mariadbd --innodb-page-size=4k --innodb-log-file-size=4m --innodb-buffer-pool-size=4m --datadir /dev/shm/data --innodb-data-file-path=ibdata1:12M:autoextend --bootstrap < /dev/null
|
rr replay /dev/shm/rr/latest-trace
|
There will be extent descriptor pages every innodb_page_size pages. There can be at most 2³² pages, and with the minimum innodb_page_size=4k we have the maximum number of extent descriptor pages: 2^(32-12)=2²⁰=1,048,576 pages. If these are the only pages that need to be accessed during the shrinking, then for any innodb_page_size, the worst case buffer pool consumption of the mini-transaction should be innodb_buffer_pool_size=4G, when shrinking from the maximum size (16 to 256 TiB depending on the page size) to the minimum size.
Based on the output of the above test, it looks like currently, no log writing whatsoever is taking place, and the shrinking may not be crash-safe at the moment?
2023-07-04 10:25:30 0 [Note] InnoDB: Initializing buffer pool, total size = 4.000MiB, chunk size = 1.000MiB
|
2023-07-04 10:25:30 0 [Note] InnoDB: Completed initialization of buffer pool
|
2023-07-04 10:25:30 0 [Note] InnoDB: End of log at LSN=47046
|
2023-07-04 10:25:30 0 [Note] InnoDB: 128 rollback segments are active.
|
2023-07-04 10:25:30 0 [Note] Truncating system tablespaces from 14680064 to 3072 pages
|
|
2023-07-04 10:25:36 0 [Note] System tablespace truncated successful
|
2023-07-04 10:25:36 0 [Note] InnoDB: Setting O_DIRECT on file ./ibtmp1 failed
|
2023-07-04 10:25:36 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
|
2023-07-04 10:25:36 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
|
2023-07-04 10:25:36 0 [Note] InnoDB: log sequence number 47046 (memory-mapped); transaction id 4
|
I think that we need some additional testing along these lines. For the record, the above truncation only accessed one page of the system tablespace:
Thread 1 hit Breakpoint 2, sql_print_information (format=format@entry=0x560400f53238 "Truncating system tablespaces from %d to %d pages\n") at /mariadb/11/sql/log.cc:9447
|
9447 {
|
(rr) continue
|
Continuing.
|
2023-07-04 10:25:30 0 [Note] Truncating system tablespaces from 14680064 to 3072 pages
|
|
|
Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=4, guess=0x0, mode=16, mtr=0x7ffdae517b80, err=0x7ffdae517b74) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270
|
2270 {
|
|
Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=4, guess=0x0, mode=16, mtr=0x7ffdae517b80, err=0x7ffdae517ae0) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270
|
2270 {
|
|
Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=4, guess=0x0, mode=16, mtr=0x7ffdae517b80, err=0x7ffdae517ae0) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270
|
2270 {
|
|
Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=2, guess=0x0, mode=10, mtr=0x7ffdae517b80, err=0x7ffdae517b78) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270
|
2270 {
|
|
Thread 1 hit Breakpoint 2, sql_print_information (format=format@entry=0x560400f60720 "System tablespace truncated successful") at /mariadb/11/sql/log.cc:9447
|
9447 {
|
Actually, the above test did write 12 bytes of log record payload (growing the LSN by 12+5 bytes), consisting of 2 log records:
bytes | explanation |
---|---|
0x24, 0x00, 0x8b, 0x80, 0x0a | EXTENDED, space 0, page 0xc00, TRIM_PAGES (shrinking the system tablespace to 0xc00*0x1000 bytes = 12MiB) |
0x36, 0x00, 0x00, 0x2f, 0x00, 0x0c, 0x00 | WRITE, space 0, page 0, offset 0x2f, bytes 0x000c00, updating the 24 least significant bits of FSP_SIZE |
In fact, I think that the log record size optimization for updating FSP_SIZE must be disabled, so that the file size change may be crash-safe. That is, the second log record should have been one byte longer:
0x37, 0x00, 0x00, 0x2e, 0x00, 0x00, 0x0c, 0x00 | WRITE, space 0, page 0, offset 0x2e (FSP_SIZE), 32 bits |
Here is a script that extends the system tablespace with actual data, so that fil_system.sys_space->free_limit will grow from its minimum value. It works on a debug build of the server:
#!/bin/sh
|
set -eux
|
rm -fr /dev/shm/data /dev/shm/rr
|
mkdir /dev/shm/data
|
echo "create database test;
|
use test;
|
create table t(a int,b char(255) not null default'')
|
engine=innodb stats_persistent=0;
|
set unique_checks=0,foreign_key_checks=0;
|
set global innodb_limit_optimistic_insert_debug=2;
|
insert into t(a) select * from seq_1_to_1000000;
|
drop table t;"|
|
./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4g --datadir /dev/shm/data --innodb-fast-shutdown=0 --innodb-file-per-table=0 --bootstrap
|
_RR_TRACE_DIR=/dev/shm/rr rr record \
|
./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4m --datadir /dev/shm/data --bootstrap < /dev/null
|
I tested today’s revised version with that:
23aadffa0ed8ffb037ebbb5f44ce8ea76c38fe6c |
2023-07-07 16:02:49 0 [Note] Truncating system tablespaces from 101376 to 3072 pages
|
2023-07-07 16:02:49 0 [Note] System tablespace truncated successful
|
Now the buffer pool usage has been optimized, and the mini-transaction only comprises the tablespace latch and a latch on page 0 of the system tablespace. It will write 125 bytes of redo log for the shrinking, as follows:
record | explanation | notes |
---|---|---|
0x36, 0x0, 0x0, 0x2f, 0x0, 0xc, 0x0 | WRITE 0:0, offset 0x2f: 0x00c00 | still wrong: all 4 bytes of FSP_SIZE must be written for the recv_size logic to work! |
0xb4, 0x1, 0x0, 0xc, 0x0 | WRITE, same page, 1 byte after FSP_SIZE, 3 bytes 0x00c00. | We could use MEMMOVE for this, supposedly copying free_limit from FSP_SIZE |
0xb4, 0x83, 0xba, 0x5, 0x16 | WRITE, same page, a bit further away, 2 bytes 0x0516 | |
0x34, 0x0, 0x0, 0x41, 0x58 | WRITE 0:0, offset 0x41, byte 0x58. | The last 2 records could be reordered, to save 2 bytes. |
0xb3, 0x83, 0xaf, 0x6e | WRITE, same page, 1 byte 0x6e | |
0x34, 0x0, 0x0, 0x41, 0x57 | WRITE 0:0 | rewriting byte 0x41 again! (Could this be the length of some list?) |
0xb3, 0x83, 0xaf, 0xc6 | WRITE, same page, 1 byte 0xc6 | Overwriting the 0x6e from the previous write! |
0x34, 0x0, 0x0, 0x41, 0x56 | yet another WRITE of 0x41 | |
0xb6, 0x83, 0xac, 0x10, 0x0, 0x0, 0xf6 | overwriting the 0xc6 from the previous write! | |
0x34, 0x0, 0x0, 0x41, 0x55 | yet another WRITE of 0x41 | |
0xb6, 0x7, 0x0, 0x0, 0x0, 0x4, 0x66 | WRITE, same page | not overwriting anything previously written |
0xb8, 0x83, 0x9e, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0 | WRITE, same page | resetting the 4 bytes to FIL_NULL could be shorter with a MEMSET |
0x35, 0x0, 0x0, 0x40, 0x0, 0x8 | yet another WRITE of 0x41 (and 0x40) | |
0xb5, 0x12, 0x0, 0x0, 0x1, 0xa6 | WRITE, same page | looks OK |
0xb8, 0x80, 0xce, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0 | WRITE, same page | OK, but could use MEMSET |
0x34, 0x0, 0x0, 0x51, 0x1 | WRITE of 0x51 | |
0xc4, 0x84, 0xec, 0x58, 0x0 | MEMSET, same page, 0x58 bytes to 0x00 | |
0x46, 0x0, 0x0, 0x84, 0xe6, 0x58, 0x0 | MEMSET, page 0:0, 0x58 bytes to 0x00 | an ordered offset would save 2 bytes! |
0x46, 0x0, 0x0, 0x84, 0x8e, 0x58, 0x0 | MEMSET, page 0:0, 0x58 bytes to 0x00 | an ordered offset would save 2 bytes! |
0x46, 0x0, 0x0, 0x84, 0x36, 0x58, 0x0 | MEMSET, page 0:0, 0x58 bytes to 0x00 | an ordered offset would save 2 bytes! |
0x24, 0x0, 0x8b, 0x80, 0xa | EXTENDED, page 0xc00, TRIM_PAGES | OK |
If we are ever going to modify at most 2 extent pages, then we could keep copies of both those pages and emit the difference, like we do in btr_page_reorganize_low() for B-tree pages.
I will conduct another test where the system tablespace will be shrunk to something above 0x1000 or 0x2000 pages, so that two extent descriptor pages 0 and 0x1000 or 0x2000 will be modified.
Another test:
#!/bin/sh
|
set -eux
|
rm -fr /dev/shm/data /dev/shm/rr
|
mkdir /dev/shm/data
|
echo "create database test;
|
use test;
|
create table t(a int,b char(255) not null default'')
|
engine=innodb stats_persistent=0;
|
set unique_checks=0,foreign_key_checks=0;
|
set global innodb_limit_optimistic_insert_debug=2;
|
insert into t(a) select * from seq_1_to_100000;
|
create table u like t;
|
insert into u(a) select * from seq_1_to_1000000;
|
drop table u;"|
|
./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4g --datadir /dev/shm/data --innodb-fast-shutdown=0 --innodb-file-per-table=0 --bootstrap
|
_RR_TRACE_DIR=/dev/shm/rr rr record \
|
./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4m --datadir /dev/shm/data --bootstrap < /dev/null
|
2023-07-07 16:51:57 0 [Note] Truncating system tablespaces from 117760 to 10240 pages
|
2023-07-07 16:51:58 0 [Note] System tablespace truncated successful
|
When debugging the rr replay, set watch fil_system.sys_space.free_limit to see the mtr contents. In this case, we expect at least the pages 0 and 0x2000 to be modified. Somewhat unexpectedly, the m_memo also contains a latch for page 0x1000, which was not modified. Maybe if the allocation patterns would have been different, also this page would have been modified. Now there is 209 bytes of mtr.m_log produced.
I am interested in these, because I want to make sure that we will not run out of buffer pool when shrinking a very large system tablespace to very small size, or we do not exceed the maximum size of a mini-transaction, which I think is something like 4 megabytes. Let me decode the log:
record | explanation | notes |
---|---|---|
0x36, 0x0, 0x0, 0x2f, 0x0, 0x28, 0x0 | updating FSP_SIZE to 0x2800 (10240) | must start at 0x2e |
0xb4, 0x1, 0x0, 0x28, 0x0 | probably updating free_limit | this and the preceding could use a single MEMSET record, copying the same 4 bytes twice |
0xb4, 0xd, 0x0, 0x20, 0x0 | WRITE, same page | |
0x3b, 0x0, 0x9f, 0x80, 0x82, 0xde, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0 | write, page 0:0x2000, offset 0x35e | could use MEMSET, and better logged after logging all changes to page 0:0 |
0x34, 0x0, 0x0, 0x41, 0x59 | WRITE, page 0:0 | |
0xb2, 0x5, 0xb6 | WRITE, same page | |
0x34, 0x0, 0x0, 0x41, 0x58 | WRITE, page 0:0 | overwriting byte 0x41 |
0xb3, 0x4, 0x4, 0xe | WRITE, same page | overwriting 0x__b6 with 0x040e |
0x34, 0x0, 0x0, 0x41, 0x57 | WRITE, page 0:0 | overwriting byte 0x41 |
0xb2, 0x5, 0x66 | WRITE, same page | overwriting 0x040e with 0x0466 |
0x34, 0x0, 0x0, 0x41, 0x56 | WRITE, page 0:0 | overwriting byte 0x41 |
0xb2, 0x5, 0xbe | WRITE, same page | overwriting 0x0466 with 0x04be |
0x34, 0x0, 0x0, 0x41, 0x55 | WRITE, page 0:0 | overwriting byte 0x41 |
0xb3, 0x4, 0x5, 0x16 | WRITE, same page | overwriting 0x04be with 0x0516 |
0x34, 0x0, 0x0, 0x41, 0x54 | WRITE, page 0:0 | overwriting byte 0x41 |
0xb2, 0x5, 0x6e | WRITE, same page | overwriting 0x0516 with 0x056e |
0x34, 0x0, 0x0, 0x41, 0x53 | WRITE, page 0:0 | overwriting byte 0x41 |
0xb2, 0x5, 0xc6 | WRITE, same page | overwriting 0x056e with 0x05c6 |
0x34, 0x0, 0x0, 0x41, 0x52 | WRITE, page 0:0 | overwriting byte 0x41 |
0xb5, 0x2, 0x30, 0x0, 0x0, 0xf6 | WRITE, same page | overwriting 0x____05c6 with 0x3000f6 |
0x34, 0x0, 0x0, 0x41, 0x51 | WRITE, page 0:0 | overwriting byte 0x41 |
0xc3, 0x0, 0x4, 0xff | MEMSET, same page, 4 bytes of 0xff | this is how FIL_NULL should be written |
0xc3, 0x2, 0x4, 0xff | MEMSET, 2 bytes after the previous one | also writing FIL_NULL |
0x35, 0x0, 0x0, 0x40, 0x0, 0x0 | write, page 0:0 | overwriting bytes 0x40 and 0x41 |
0x39, 0x0, 0x9f, 0x80, 0x80, 0x26, 0x0, 0x0, 0x1, 0xa6 | WRITE, page 0:0x2000, offset 0xa6, 0x000001a6 | also the writes to page 0:0x2000 should be sorted by offset |
0x37, 0x0, 0x0, 0x81, 0x27, 0x0, 0x20, 0x0 | WRITE, page 0:0, offset 0x1a7 | |
0x34, 0x0, 0x0, 0x51, 0x3 | WRITE, offset 0x51 | 2 bytes wasted again due to misordered offset |
0x47, 0x0, 0x9f, 0x80, 0x85, 0x3e, 0x58, 0x0 | MEMSET, page 0:0x2000, offset 0x5be, 0x58 bytes 0x00 | |
0x47, 0x0, 0x9f, 0x80, 0x84, 0xe6, 0x58, 0x0 | MEMSET, page 0:0x2000 | 3 bytes wasted again |
0x47, 0x0, 0x9f, 0x80, 0x84, 0x8e, 0x58, 0x0 | MEMSET, page 0:0x2000 | 3 bytes wasted again |
0x47, 0x0, 0x9f, 0x80, 0x84, 0x36, 0x58, 0x0 | MEMSET, page 0:0x2000 | 3 bytes wasted again |
0x47, 0x0, 0x9f, 0x80, 0x83, 0xde, 0x58, 0x0 | MEMSET, page 0:0x2000 | 3 bytes wasted again |
0x47, 0x0, 0x9f, 0x80, 0x83, 0x86, 0x58, 0x0 | MEMSET, page 0:0x2000 | 3 bytes wasted again |
0x47, 0x0, 0x9f, 0x80, 0x83, 0x2e, 0x58, 0x0 | MEMSET, page 0:0x2000 | 3 bytes wasted again |
0x47, 0x0, 0x9f, 0x80, 0x82, 0xd6, 0x58, 0x0 | MEMSET, page 0:0x2000 | 3 bytes wasted again |
0x24, 0x0, 0xa7, 0x80, 0xa | EXTENDED, page 0x2800, TRIM_PAGES | OK |
This is not too bad, but I would like to improve this to have a known worst case. If I understood correctly, the worst case should be as follows:
number of pages modified | 2 (page 0 and the last surviving extent descriptor) |
---|---|
number of additional pages accessed | at most 1 at a time (one extent descriptor page) |
amount of redo log written | 2*innodb_page_size (to rewrite both extent descriptor pages) |
I don’t think that there currently is any clear limit of the amount of redo log written. The easiest would be to keep copies of the 1 or 2 modified pages and optimize the log writes by comparing the pages. Why 2 pages? We assume that no extent descriptor page before the last surviving one can contain pointers to extent descriptor pages that would be discarded by shrinking the tablespace. I hope that thiru can add some debug assertions to prove this, or point to existing debug assertions for checking this.
One more thing that we discussed is something to avoid truncation and expansion on every startup. I would not want to add any start-up parameter for that. I think that a good rule would be that if the size of the system tablespace could be shrunk by more than 1/8 (12.5%) during startup, then it will be shrunk. Currently this is missing: we always try to shrink, no matter what. It could be useful to add some DBUG_EXECUTE_IF to enable the "always shrink" logic for testing.
Even though the extent descriptor page 0x1000 was not modified in my above test, thiru says that if the page allocation patterns had been different, this could be needed.
The actual worst-case limit of the number of pages latched (and possibly modified) is truncated_size_in_pages/innodb_page_size.
The worst case would be shrinking a innodb_page_size=4k tablespace so that the last possible descriptor page 0xfffff000 is removed. All 1,048,575 descriptor pages (every 0x1000 bytes) from 0x00000000 to 0xffffe000 would have to be accessed. If we held latches to all these pages at the same time, we would need 4GiB of buffer pool, which is more than 1,000 times the minimum innodb_buffer_pool_size. So, we should avoid doing that.
The typical amount of redo log written for one page should be much smaller than the page size, hopefully something like 50 bytes. If the maximum mini-transaction size is 4MiB, that should be enough for modifying some 80,000 extent descriptor pages.
I do not think that the worst case would involve modifying all 1,048,575 descriptor pages. There is only a limited number of list nodes in a descriptor page, let’s call it N. So, at most N pages can have pointers to the last descriptor page and may have to be modified. If we shrink the tablespace so that the M last descriptor pages ending at 0xfffff000 are modified, then we may have to modify N*M surviving descriptor pages. What is the maximum possible M*N?
Because there are 256 extent descriptor list elements (XDES_FLST_NODE) per extent descriptor page, shrinking the system tablespace by 1 extent will involve modifying at most 256 other extent descriptor pages. Shrinking by 2 extents may require modifying up to 2*256=512 extent descriptor pages.
Close to the worst case should be to shrink a tablespace with innodb_page_size=4k from the maximum number of extents (1,048,576) by something like 3,584 extents to 1,044,992 extents. If everything is laid out in the pessimal way, this would involve modifying 3584*256=917504 extent descriptor pages, requiring 3.5 GiB of buffer pool. If the maximum mini-transaction log size is 2MiB (before MDEV-14425, it might be even less), we would only have some 2.28 bytes per page available for logging the modifications to each page. That is clearly not enough, since the minimum size of the record header is 3 bytes (4 bytes for logging changes to any other extent descriptor page than the first one).
We clearly must implement some logic to determine if the shrinking operation is possible at all (if we have enough buffer pool and log space available). Only if the system tablespace will be shrunk little enough or to a small enough final size, it can succeed. We must first validate everything, before committing to shrink the tablespace.
Furthermore, to avoid any bugs or extra overhead at startup related to this, it would be good to make this feature opt-in, that is, only attempt shrinking the system tablespace if innodb_data_file_path contains an extra attribute :autoshrink to enable the attempt.
I think that implementing the following will address the biggest concerns. I verified that the first two of my list were done today:
- Write all 32 bits of FSP_SIZE at 0x2e.
- Make this opt-in by requiring an :autoshrink attribute to be added to innodb_data_file_path.
- Check that we will not run out of buffer pool.
- Check that we will not write too much redo log: the maximum mini-transaction size in the
MDEV-14425format should be 2MiB, including the 5 bytes of terminator and checksum. - If either limit is exceeded, revert any changes to the buffer pool and discard the mini-transaction log.
- For test purposes, introduce some debug parameter or DBUG_EXECUTE_IF to make the maximum mini-transaction size smaller, say, 100 to 1000 bytes.
- Add debug code to invoke flst_validate() on all the extent descriptor lists in the system tablespace, before and after attempting truncation.
Optimizing the redo log size is optional and might be done later, if there is real customer demand for it. I would anticipate other demands for follow-up improvements first, such as these:
- Defragmenting or optimizing the system tables, to ensure that all data will be stored near the start of the system tablespace.
- Repairing a system tablespace to minimal size by copying the actual needed contents, exclusing things like SYS_TABLESPACES or SYS_DATAFILES tables that existed between MariaDB Server 10.0 and 10.5 until
MDEV-22343removed them.
origin/bb-11.2-MDEV-14795_1 1779cb1af979cad1abe9d5c3ed4d37b6503175aa 2023-07-11T20:25:49+05:30 + some patch (build with debug+asan+ optimization Og)
|
behaved relative well in RQG testing. I observed many bad effects but they occured/occur in preview-11.2,
|
11.2 b8c039fad16adf9cbc6dac2e34cdeb772b9065a0 2023-07-07T09:05:33+10:00
|
and sometimes other trees too.
|
|
The main testing campaign consisted of ~ 9900 tests which all used the server
|
startup option "--mysqld=--innodb_data_file_path=ibdata1:1M:autoextend:autoshrink".
|
Statistics from the remaining test results:
|
smallest system tablespace ever shrinked 1024 pages
|
biggest system tablespace ever shrinked 41152 pages
|
biggest shrink operation ~ 16000 pages truncated
|
smallest shrink operation ~ 250 pages truncated
|
|
origin/bb-11.2-MDEV-14795_1 6575d969aef4d06fb529f467f6c9d3ca60b0bb53 2023-07-17T10:43:15+05:30 (build type RelWithDebInfo)
performed well in RQG testing. No new problems.
Statistics from the remaining test results:
smallest system tablespace ever shrinked 1024 pages
biggest system tablespace ever shrinked 16640 pages
biggest shrink operation ~ 16000 pages truncated
smallest shrink operation ~ 832 pages truncated
I assume that the difference to the statistics from the previous run
seems caused by the different source tree.
A build with debug+asan+Optimization Og showed similar numbers like the
RelWithDebInfo build.
I posted some review comments, mostly about the formatting of messages, and code that is not covered by mtr tests. I think that most code should be covered if you create a debug-instrumented big_test that will use innodb_page_size=4k and extend the system tablespace to slightly more than 16 MiB (4096*4096 bytes). With innodb_limit_optimistic_insert_debug=2 (like my previously posted script does) it should be relatively fast. I think that the mandatory MSAN builder should run such a test.
Tomorrow, I will review the logic to "roll back" a mini-transaction when the size constraints are exceeded.
The general logic looks good to me. I think that this needs one more review round after the code coverage in the regression test suite has been improved.
The code coverage looks acceptable to me now. I made some minor suggestions on improving the out-of-memory detection. This should be OK to push after addressing the last round of my comments.
origin/bb-11.2-MDEV-14795 f9003c73a15d892423845a2df96c8bd73bbe96ca 2023-08-01T19:43:04+05:30
|
performed well in RQG testing. The failures observed occur on other trees too, are in JIRA or are not caused by MDEV-14795.
|
Together with MDEV-19229, this allows the storage space consumed by the InnoDB system tablespace to be reclaimed.
The defragmentation that is mentioned in the Description never shrunk any files; it only tried to move some B-tree pages within a data file in an attempt to improve performance on HDDs. That code was removed in MDEV-30545.
Does this change mean that there will be a delay in startup for particularly large ibdata1 files?
i.e. I've got a Galera cluster where ibdata1 has over time grown to > 300GB, and it definitely doesn't need to be this big with `file_per_table` having been set since database creation.
psumner, shrinking the system tablespace involves:
- traversing every allocation bitmap page (pages 0, innodb_page_size, 2*innodb_page_size…) within the old size,
- adjusting the bitmap pages within the shrunk size,
- durably writing log for the change, and
- shrinking the last system tablespace file by ftruncate()
When using the default innodb_page_size=16k, there will be an allocation bitmap page every 16384*16384 bytes (256 megabytes, or quarter gigabyte). Shrinking a 300GiB file would involve traversing 1,200 bitmap pages. In an extreme case, the file could be shrunk to 256 megabytes or less, that is, we would write log covering just the first page of the file. In such an extreme case, the time should be dominated by the file system operation. Truncating or deleting a heavily fragmented file can take considerable time. How much, depends on the type of storage and the file system, and the operating system kernel version.
By popular demand, I filed MDEV-32452 so that this cleanup can be executed on shutdown rather than startup.
I don't think it has ever been claimed that the ibdata1 would shrink as the result of defragmentation; but I find that our current documentation is not at all straightforward in regard to this question, so I'm assigning it to jplindst to clarify, especially in the KB.