[MDEV-14795] InnoDB system tablespace cannot be shrunk - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Fix Version/s: 11.2.1
Component/s: Documentation, Storage Engine - InnoDB
Labels:
- Preview_11.2

Description

hello
i used to have all tables on ibdata1 before, then i switched to one file per table, but the ibdata1 file keep its size.
I recently switched to mariadb 10.1, and hoped that the defragmenting function would help to shrink the file, but it didn't worked.
I optimised all tables, but the ibdata1 file did not changed at all, it is still 197gb

MariaDB [(none)]> show status like '%defragment%';

+----------------------------------------+--------+

| Variable_name                          | Value  |

+----------------------------------------+--------+

| Innodb_defragment_compression_failures | 0      |

| Innodb_defragment_failures             | 32872  |

| Innodb_defragment_count                | 113744 |

+----------------------------------------+--------+

3 rows in set (0.01 sec)

MariaDB [(none)]> show variables like '%defragment%';

+--------------------------------------+----------+

| Variable_name                        | Value    |

+--------------------------------------+----------+

| innodb_defragment                    | ON       |

| innodb_defragment_fill_factor        | 0.900000 |

| innodb_defragment_fill_factor_n_recs | 20       |

| innodb_defragment_frequency          | 40       |

| innodb_defragment_n_pages            | 7        |

| innodb_defragment_stats_accuracy     | 0        |

+--------------------------------------+----------+

Attachments

Issue Links

blocks

MDEV-28699 Shrink temporary tablespaces without restart

Closed

MDEV-32452 InnoDB system tablespace is not shrunk on slow shutdown

Closed

causes

MDEV-34216 Possible corruption when shrinking the system tablespace on innodb_fast_shutdown=0

Closed

is blocked by

MDEV-31373 innodb_undo_log_truncate=ON recovery results in a corrupted undo log

Closed

is duplicated by

MDEV-31462 ibdata1 shrinking

Closed

relates to

MDEV-19229 Allow innodb_undo_tablespaces to be changed after database creation

Closed

MDEV-28699 Shrink temporary tablespaces without restart

Closed

MDEV-29983 Deprecate innodb_file_per_table

Closed

MDEV-29999 innodb_undo_log_truncate=ON is not crash safe

Closed

MDEV-33112 innodb_undo_log_truncate=ON is blocking page writes

Closed

MDEV-35398 Improve shrinking of system tablespace

Closed

MDEV-35689 InnoDB system tables cannot be optimized or defragmented

Stalled

MDEV-21952 ibdata1 file size growing in MariaDB

Closed

MDEV-29986 Set innodb_undo_tablespaces=3 by default

Closed

MDEV-32622 [mariadb-galera] The undo log is very large, how can there be fewer undo log logs

Closed

MDEV-34529 Shrink the system tablespace when system tablespace contains MDEV-30671 leaked undo pages

Closed

(11 relates to)

Activity

Ascending order - Click to sort in descending order

Elena Stepanova added a comment - 2017-12-29 00:35

I don't think it has ever been claimed that the ibdata1 would shrink as the result of defragmentation; but I find that our current documentation is not at all straightforward in regard to this question, so I'm assigning it to jplindst to clarify, especially in the KB.

Elena Stepanova added a comment - 2017-12-29 00:35 I don't think it has ever been claimed that the ibdata1 would shrink as the result of defragmentation; but I find that our current documentation is not at all straightforward in regard to this question, so I'm assigning it to jplindst to clarify, especially in the KB.

Jan Lindström (Inactive) added a comment - 2017-12-29 08:17

greenman Can you clarify documentation that tablespace files (including ibdata1) will not shrink as the result of defragmentation.

Jan Lindström (Inactive) added a comment - 2017-12-29 08:17 greenman Can you clarify documentation that tablespace files (including ibdata1) will not shrink as the result of defragmentation.

martina342 added a comment - 2017-12-29 11:00

thanks for the reply. Is the data actually moved to the beginning of the ibdata file ? is it difficult to add the possibility to shrink the file ibdata1 ?

martina342 added a comment - 2017-12-29 11:00 thanks for the reply. Is the data actually moved to the beginning of the ibdata file ? is it difficult to add the possibility to shrink the file ibdata1 ?

Ian Gilfillan added a comment - 2018-01-25 07:45

Documentation now specifically mentions that the file size does not shrink. martina342 the only way to shrink the existing file size is with dumping and reloading. The article at https://www.thegeekstuff.com/2016/02/mysql-innodb-file-per-table/ seems to describe your scenario of moving from innodb_file_per_table = 0 to innodb_file_per_table=1

Ian Gilfillan added a comment - 2018-01-25 07:45 Documentation now specifically mentions that the file size does not shrink. martina342 the only way to shrink the existing file size is with dumping and reloading. The article at https://www.thegeekstuff.com/2016/02/mysql-innodb-file-per-table/ seems to describe your scenario of moving from innodb_file_per_table = 0 to innodb_file_per_table=1

Marko Mäkelä added a comment - 2022-11-09 09:10

I think that it is a reasonable request to shrink the system tablespace after all tables have been moved out of it (innodb_file_per_table=1), dedicated undo tablespaces have been created (~~MDEV-19229~~) and the change buffer has been disabled (~~MDEV-27734~~).

Marko Mäkelä added a comment - 2022-11-09 09:10 I think that it is a reasonable request to shrink the system tablespace after all tables have been moved out of it ( innodb_file_per_table=1 ), dedicated undo tablespaces have been created ( MDEV-19229 ) and the change buffer has been disabled ( MDEV-27734 ).

Marko Mäkelä added a comment - 2022-11-11 13:32

We have to be careful about this in order to avoid introducing the equivalent of ~~MDEV-29999~~. We may have to track changes not only to FSP_SIZE but also to FSP_FREE_LIMIT or something else that reliably tells the last actually used page offset within the file.

Marko Mäkelä added a comment - 2022-11-11 13:32 We have to be careful about this in order to avoid introducing the equivalent of MDEV-29999 . We may have to track changes not only to FSP_SIZE but also to FSP_FREE_LIMIT or something else that reliably tells the last actually used page offset within the file.

Marko Mäkelä added a comment - 2023-05-23 10:58

Because some code that is shared with undo tablespace truncation is being modified here, I think that we need to pay special attention to test coverage.

Marko Mäkelä added a comment - 2023-05-23 10:58 Because some code that is shared with undo tablespace truncation is being modified here, I think that we need to pay special attention to test coverage.

Matthias Leich added a comment - 2023-06-08 15:19 - edited

Preliminary results of RQG testing on origin/bb-10.11-MDEV-14795 5751dbd168147b1294d0dbde018adcab0fd02cb4 2023-06-08T15:31:09+05:30

1. make load, SIGKILL the server, restart fails

[rr 1719523 12457]2023-06-08 04:51:49 0x79d72909b640[rr 1719523 12460]  InnoDB: Assertion failure in file /data/Server/bb-10.11-MDEV-14795C/storage/innobase/buf/buf0dblwr.cc line 618

[rr 1719523 12462]InnoDB: Failing assertion: fil_system.sys_space->acquire()

sdp:/data1/results/1686220665/TBR-1933$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio

2. make load, SIGKILL the server, restart passes, some tables get checked and the server crashes

[rr 3469512 6800]mysqld: /data/Server/bb-10.11-MDEV-14795C/storage/innobase/include/fil0fil.h:653: bool fil_space_t::set_needs_flush(): Assertion `n & PENDING' failed.

sdp:/data1/results/1686220665/TBR-1934$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio

There were more hits of sometimes other asserts around crash recovery.

But they have sometimes no rr trace or serious bigger rr traces.

3. make load, SIGKILL the server, restart, check certain tables with success, shutdown which ends with server crash

 [rr 2614894 79103]2023-06-08  5:21:23 0 [Note] InnoDB: Starting shutdown...

 [rr 2614894 79122]2023-06-08  5:21:23 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/rqg/1686220665/93/1/data/ib_buffer_pool

 [rr 2614894 79132]2023-06-08  5:21:23 0 [Note] InnoDB: Restricted to 185 pages due to innodb_buf_pool_dump_pct=25

 [rr 2614894 79139]2023-06-08  5:21:23 0 [ERROR] InnoDB: Summed data size 1859, returned by func 30316

 [rr 2614894 79160]2023-06-08  5:21:23 0 [Note] InnoDB: Buffer pool(s) dump completed at 230608  5:21:23

 [rr 2614894 79169]2023-06-08  5:21:23 0 [ERROR] InnoDB: Apparent corruption in space 0 page 1460 of index `IBUF_DUMMY` of table `IBUF_DUMMY`

 [rr 2614894 79173]2023-06-08  5:21:23 0 [ERROR] InnoDB: Summed data size 1859, returned by func 30316

 [rr 2614894 79177]2023-06-08  5:21:23 0 [ERROR] InnoDB: Apparent corruption in space 0 page 1460 of index `IBUF_DUMMY` of table `IBUF_DUMMY`

 [rr 2614894 79182]mysqld: /data/Server/bb-10.11-MDEV-14795C/storage/innobase/ibuf/ibuf0ibuf.cc:4327: dberr_t ibuf_merge_or_delete_for_page(buf_block_t*, page_id_t, ulint): Assertion `page_validate(block->page.frame, dummy_index)' failed.

  sdp:/data1/results/1686220665/TBR-1932$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio

Matthias Leich added a comment - 2023-06-08 15:19 - edited Preliminary results of RQG testing on origin/bb-10.11-MDEV-14795 5751dbd168147b1294d0dbde018adcab0fd02cb4 2023-06-08T15:31:09+05:30 1. make load, SIGKILL the server, restart fails [rr 1719523 12457]2023-06-08 04:51:49 0x79d72909b640[rr 1719523 12460] InnoDB: Assertion failure in file /data/Server/bb-10.11-MDEV-14795C/storage/innobase/buf/buf0dblwr.cc line 618 [rr 1719523 12462]InnoDB: Failing assertion: fil_system.sys_space->acquire() sdp:/data1/results/1686220665/TBR-1933$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio 2. make load, SIGKILL the server, restart passes, some tables get checked and the server crashes [rr 3469512 6800]mysqld: /data/Server/bb-10.11-MDEV-14795C/storage/innobase/include/fil0fil.h:653: bool fil_space_t::set_needs_flush(): Assertion `n & PENDING' failed. sdp:/data1/results/1686220665/TBR-1934$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio There were more hits of sometimes other asserts around crash recovery. But they have sometimes no rr trace or serious bigger rr traces. 3. make load, SIGKILL the server, restart, check certain tables with success, shutdown which ends with server crash [rr 2614894 79103]2023-06-08 5:21:23 0 [Note] InnoDB: Starting shutdown... [rr 2614894 79122]2023-06-08 5:21:23 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/rqg/1686220665/93/1/data/ib_buffer_pool [rr 2614894 79132]2023-06-08 5:21:23 0 [Note] InnoDB: Restricted to 185 pages due to innodb_buf_pool_dump_pct=25 [rr 2614894 79139]2023-06-08 5:21:23 0 [ERROR] InnoDB: Summed data size 1859, returned by func 30316 [rr 2614894 79160]2023-06-08 5:21:23 0 [Note] InnoDB: Buffer pool(s) dump completed at 230608 5:21:23 [rr 2614894 79169]2023-06-08 5:21:23 0 [ERROR] InnoDB: Apparent corruption in space 0 page 1460 of index `IBUF_DUMMY` of table `IBUF_DUMMY` [rr 2614894 79173]2023-06-08 5:21:23 0 [ERROR] InnoDB: Summed data size 1859, returned by func 30316 [rr 2614894 79177]2023-06-08 5:21:23 0 [ERROR] InnoDB: Apparent corruption in space 0 page 1460 of index `IBUF_DUMMY` of table `IBUF_DUMMY` [rr 2614894 79182]mysqld: /data/Server/bb-10.11-MDEV-14795C/storage/innobase/ibuf/ibuf0ibuf.cc:4327: dberr_t ibuf_merge_or_delete_for_page(buf_block_t*, page_id_t, ulint): Assertion `page_validate(block->page.frame, dummy_index)' failed. sdp:/data1/results/1686220665/TBR-1932$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio

Matthias Leich added a comment - 2023-06-15 09:59

Results of testing on origin/bb-10.11-MDEV-14795 873798f9c3052cdeea253f5bb398ba835346a63c 2023-06-12T16:07:47+05:30

1. Some of the problems mentioned above were fixed.

2. Other problems which showed up during testing turned out to be on

   - origin/10.11 582d0cf5b062a9aa30f555f7f6c4f6e775bca408 2023-06-10T11:14:15+03:00

   - origin/preview-11.2-preview 79f9e94195dc8a007debd27ec4e8d2c15f094a97 2023-06-13T17:23:05+10:00

   too. Some of them have to reported.

Matthias Leich added a comment - 2023-06-15 09:59 Results of testing on origin/bb-10.11-MDEV-14795 873798f9c3052cdeea253f5bb398ba835346a63c 2023-06-12T16:07:47+05:30 1. Some of the problems mentioned above were fixed. 2. Other problems which showed up during testing turned out to be on - origin/10.11 582d0cf5b062a9aa30f555f7f6c4f6e775bca408 2023-06-10T11:14:15+03:00 - origin/preview-11.2-preview 79f9e94195dc8a007debd27ec4e8d2c15f094a97 2023-06-13T17:23:05+10:00 too. Some of them have to reported.

Matthias Leich added a comment - 2023-06-16 15:02 - edited

Preliminary result of testing on origin/bb-10.11-MDEV-14795 e9e34b7b6c1409de26a84bd6bdbca45257b6263b 2023-06-16T18:28:00+05:30

The failure MENT-189

    mariabackup --backup fails with

    InnoDB: Assertion failure in file /data/Server/bb-10.11-MDEV-14795D/extra/mariabackup/xtrabackup.cc line 992

    InnoDB: Failing assertion: opt_no_lock

shows up again.

sdp:/data1/results/1686921174/MENT-189_Rebirth$ _RR_TRACE_DIR=./1_clone/rr rr replay --mark-stdio

The reason is currently unknown.

origin/preview-11.2-preview 393bad0621ace7242f55e2cc5a6fb89ea9dfb8c6 2023-06-16T18:12:23+07:00 shows the same problem.

Matthias Leich added a comment - 2023-06-16 15:02 - edited Preliminary result of testing on origin/bb-10.11-MDEV-14795 e9e34b7b6c1409de26a84bd6bdbca45257b6263b 2023-06-16T18:28:00+05:30 The failure MENT-189 mariabackup --backup fails with InnoDB: Assertion failure in file /data/Server/bb-10.11-MDEV-14795D/extra/mariabackup/xtrabackup.cc line 992 InnoDB: Failing assertion: opt_no_lock shows up again. sdp:/data1/results/1686921174/MENT-189_Rebirth$ _RR_TRACE_DIR=./1_clone/rr rr replay --mark-stdio The reason is currently unknown. origin/preview-11.2-preview 393bad0621ace7242f55e2cc5a6fb89ea9dfb8c6 2023-06-16T18:12:23+07:00 shows the same problem.

Marko Mäkelä added a comment - 2023-06-28 13:33

~~MDEV-29694~~ in 11.0 removed the InnoDB change buffer. If this change were ever applied to anything that supports the change buffer, then we should also think about deleting the change buffer records and clearing change buffer bitmap bits for the pages that will be trimmed off the end of the system tablespace.

Marko Mäkelä added a comment - 2023-06-28 13:33 MDEV-29694 in 11.0 removed the InnoDB change buffer. If this change were ever applied to anything that supports the change buffer, then we should also think about deleting the change buffer records and clearing change buffer bitmap bits for the pages that will be trimmed off the end of the system tablespace.

Marko Mäkelä added a comment - 2023-06-29 17:56

I happened to notice a related-looking test failure in an IA-32 environment:

innodb.log_data_file_size '64k,innodb'   w1 [ fail ]

…

2023-06-29  3:40:49 0 [Note] InnoDB: Truncating system tablespaces from 1344 to 320 pages

2023-06-29 03:40:49 0xa96f9b40  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-11.2.0/storage/innobase/fil/fil0fil.cc line 2988

InnoDB: Failing assertion: space->size == check.size

This might be fixed by acquiring fil_system.mutex earlier in the resizing code.

Marko Mäkelä added a comment - 2023-06-29 17:56 I happened to notice a related-looking test failure in an IA-32 environment : innodb.log_data_file_size '64k,innodb' w1 [ fail ] … 2023-06-29 3:40:49 0 [Note] InnoDB: Truncating system tablespaces from 1344 to 320 pages 2023-06-29 03:40:49 0xa96f9b40 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-11.2.0/storage/innobase/fil/fil0fil.cc line 2988 InnoDB: Failing assertion: space->size == check.size This might be fixed by acquiring fil_system.mutex earlier in the resizing code.

Marko Mäkelä added a comment - 2023-07-04 07:38

To better understand the maximum number of pages accessed and log records written by the shrinking operation, I came up with the following test:

rm -fr /dev/shm/data /dev/shm/rr && mkdir /dev/shm/data &&

./mariadbd --innodb-page-size=4k --innodb-log-file-size=4m --innodb-buffer-pool-size=4m --datadir /dev/shm/data --innodb-data-file-path=ibdata1:56G --bootstrap < /dev/null &&

_RR_TRACE_DIR=/dev/shm/rr rr record \

./mariadbd --innodb-page-size=4k --innodb-log-file-size=4m --innodb-buffer-pool-size=4m --datadir /dev/shm/data --innodb-data-file-path=ibdata1:12M:autoextend --bootstrap < /dev/null

rr replay /dev/shm/rr/latest-trace

There will be extent descriptor pages every innodb_page_size pages. There can be at most 2³² pages, and with the minimum innodb_page_size=4k we have the maximum number of extent descriptor pages: 2^(32-12)=2²⁰=1,048,576 pages. If these are the only pages that need to be accessed during the shrinking, then for any innodb_page_size, the worst case buffer pool consumption of the mini-transaction should be innodb_buffer_pool_size=4G, when shrinking from the maximum size (16 to 256 TiB depending on the page size) to the minimum size.

Based on the output of the above test, it looks like currently, no log writing whatsoever is taking place, and the shrinking may not be crash-safe at the moment?

2023-07-04 10:25:30 0 [Note] InnoDB: Initializing buffer pool, total size = 4.000MiB, chunk size = 1.000MiB

2023-07-04 10:25:30 0 [Note] InnoDB: Completed initialization of buffer pool

2023-07-04 10:25:30 0 [Note] InnoDB: End of log at LSN=47046

2023-07-04 10:25:30 0 [Note] InnoDB: 128 rollback segments are active.

2023-07-04 10:25:30 0 [Note] Truncating system tablespaces from 14680064 to 3072 pages

2023-07-04 10:25:36 0 [Note] System tablespace truncated successful

2023-07-04 10:25:36 0 [Note] InnoDB: Setting O_DIRECT on file ./ibtmp1 failed

2023-07-04 10:25:36 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...

2023-07-04 10:25:36 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.

2023-07-04 10:25:36 0 [Note] InnoDB: log sequence number 47046 (memory-mapped); transaction id 4

I think that we need some additional testing along these lines. For the record, the above truncation only accessed one page of the system tablespace:

Thread 1 hit Breakpoint 2, sql_print_information (format=format@entry=0x560400f53238 "Truncating system tablespaces from %d to %d pages\n") at /mariadb/11/sql/log.cc:9447

9447	{

(rr) continue

Continuing.

2023-07-04 10:25:30 0 [Note] Truncating system tablespaces from 14680064 to 3072 pages

Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=4, guess=0x0, mode=16, mtr=0x7ffdae517b80, err=0x7ffdae517b74) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270

2270	{

Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=4, guess=0x0, mode=16, mtr=0x7ffdae517b80, err=0x7ffdae517ae0) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270

2270	{

Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=4, guess=0x0, mode=16, mtr=0x7ffdae517b80, err=0x7ffdae517ae0) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270

2270	{

Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=2, guess=0x0, mode=10, mtr=0x7ffdae517b80, err=0x7ffdae517b78) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270

2270	{

Thread 1 hit Breakpoint 2, sql_print_information (format=format@entry=0x560400f60720 "System tablespace truncated successful") at /mariadb/11/sql/log.cc:9447

9447	{

Marko Mäkelä added a comment - 2023-07-04 07:38 To better understand the maximum number of pages accessed and log records written by the shrinking operation, I came up with the following test: rm -fr /dev/shm/data /dev/shm/rr && mkdir /dev/shm/data && ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4m --innodb-buffer-pool-size=4m --datadir /dev/shm/data --innodb-data-file-path=ibdata1:56G --bootstrap < /dev/null && _RR_TRACE_DIR=/dev/shm/rr rr record \ ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4m --innodb-buffer-pool-size=4m --datadir /dev/shm/data --innodb-data-file-path=ibdata1:12M:autoextend --bootstrap < /dev/null rr replay /dev/shm/rr/latest-trace There will be extent descriptor pages every innodb_page_size pages. There can be at most 2³² pages, and with the minimum innodb_page_size=4k we have the maximum number of extent descriptor pages: 2^(32-12)=2²⁰=1,048,576 pages. If these are the only pages that need to be accessed during the shrinking, then for any innodb_page_size , the worst case buffer pool consumption of the mini-transaction should be innodb_buffer_pool_size=4G , when shrinking from the maximum size (16 to 256 TiB depending on the page size) to the minimum size. Based on the output of the above test, it looks like currently, no log writing whatsoever is taking place, and the shrinking may not be crash-safe at the moment? 2023-07-04 10:25:30 0 [Note] InnoDB: Initializing buffer pool, total size = 4.000MiB, chunk size = 1.000MiB 2023-07-04 10:25:30 0 [Note] InnoDB: Completed initialization of buffer pool 2023-07-04 10:25:30 0 [Note] InnoDB: End of log at LSN=47046 2023-07-04 10:25:30 0 [Note] InnoDB: 128 rollback segments are active. 2023-07-04 10:25:30 0 [Note] Truncating system tablespaces from 14680064 to 3072 pages 2023-07-04 10:25:36 0 [Note] System tablespace truncated successful 2023-07-04 10:25:36 0 [Note] InnoDB: Setting O_DIRECT on file ./ibtmp1 failed 2023-07-04 10:25:36 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ... 2023-07-04 10:25:36 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB. 2023-07-04 10:25:36 0 [Note] InnoDB: log sequence number 47046 (memory-mapped); transaction id 4 I think that we need some additional testing along these lines. For the record, the above truncation only accessed one page of the system tablespace: Thread 1 hit Breakpoint 2, sql_print_information (format=format@entry=0x560400f53238 "Truncating system tablespaces from %d to %d pages\n") at /mariadb/11/sql/log.cc:9447 9447 { (rr) continue Continuing. 2023-07-04 10:25:30 0 [Note] Truncating system tablespaces from 14680064 to 3072 pages Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=4, guess=0x0, mode=16, mtr=0x7ffdae517b80, err=0x7ffdae517b74) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270 2270 { Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=4, guess=0x0, mode=16, mtr=0x7ffdae517b80, err=0x7ffdae517ae0) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270 2270 { Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=4, guess=0x0, mode=16, mtr=0x7ffdae517b80, err=0x7ffdae517ae0) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270 2270 { Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=2, guess=0x0, mode=10, mtr=0x7ffdae517b80, err=0x7ffdae517b78) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270 2270 { Thread 1 hit Breakpoint 2, sql_print_information (format=format@entry=0x560400f60720 "System tablespace truncated successful") at /mariadb/11/sql/log.cc:9447 9447 {

Marko Mäkelä added a comment - 2023-07-04 08:32

Actually, the above test did write 12 bytes of log record payload (growing the LSN by 12+5 bytes), consisting of 2 log records:

bytes	explanation
`0x24, 0x00, 0x8b, 0x80, 0x0a`	`EXTENDED`, space 0, page 0xc00, `TRIM_PAGES` (shrinking the system tablespace to 0xc00*0x1000 bytes = 12MiB)
`0x36, 0x00, 0x00, 0x2f, 0x00, 0x0c, 0x00`	`WRITE`, space 0, page 0, offset 0x2f, bytes 0x000c00, updating the 24 least significant bits of `FSP_SIZE`

In fact, I think that the log record size optimization for updating FSP_SIZE must be disabled, so that the file size change may be crash-safe. That is, the second log record should have been one byte longer:

0x37, 0x00, 0x00, 0x2e, 0x00, 0x00, 0x0c, 0x00 WRITE, space 0, page 0, offset 0x2e (FSP_SIZE), 32 bits

Marko Mäkelä added a comment - 2023-07-04 08:32 Actually, the above test did write 12 bytes of log record payload (growing the LSN by 12+5 bytes), consisting of 2 log records: bytes explanation 0x24, 0x00, 0x8b, 0x80, 0x0a EXTENDED , space 0, page 0xc00, TRIM_PAGES (shrinking the system tablespace to 0xc00*0x1000 bytes = 12MiB) 0x36, 0x00, 0x00, 0x2f, 0x00, 0x0c, 0x00 WRITE , space 0, page 0, offset 0x2f, bytes 0x000c00, updating the 24 least significant bits of FSP_SIZE In fact, I think that the log record size optimization for updating FSP_SIZE must be disabled, so that the file size change may be crash-safe. That is, the second log record should have been one byte longer: 0x37, 0x00, 0x00, 0x2e, 0x00, 0x00, 0x0c, 0x00 WRITE , space 0, page 0, offset 0x2e ( FSP_SIZE ), 32 bits

Marko Mäkelä added a comment - 2023-07-07 13:48

Here is a script that extends the system tablespace with actual data, so that fil_system.sys_space->free_limit will grow from its minimum value. It works on a debug build of the server:

#!/bin/sh

set -eux

rm -fr /dev/shm/data /dev/shm/rr

mkdir /dev/shm/data

echo "create database test;

use test;

create table t(a int,b char(255) not null default'')

engine=innodb stats_persistent=0;

set unique_checks=0,foreign_key_checks=0;

set global innodb_limit_optimistic_insert_debug=2;

insert into t(a) select * from seq_1_to_1000000;

drop table t;"|

./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4g --datadir /dev/shm/data --innodb-fast-shutdown=0 --innodb-file-per-table=0 --bootstrap

_RR_TRACE_DIR=/dev/shm/rr rr record \

./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4m --datadir /dev/shm/data --bootstrap < /dev/null

I tested today’s revised version with that:

23aadffa0ed8ffb037ebbb5f44ce8ea76c38fe6c
2023-07-07 16:02:49 0 [Note] Truncating system tablespaces from 101376 to 3072 pages
2023-07-07 16:02:49 0 [Note] System tablespace truncated successful

Now the buffer pool usage has been optimized, and the mini-transaction only comprises the tablespace latch and a latch on page 0 of the system tablespace. It will write 125 bytes of redo log for the shrinking, as follows:

record	explanation	notes
0x36, 0x0, 0x0, 0x2f, 0x0, 0xc, 0x0	WRITE 0:0, offset 0x2f: 0x00c00	still wrong: all 4 bytes of `FSP_SIZE` must be written for the `recv_size` logic to work!
0xb4, 0x1, 0x0, 0xc, 0x0	WRITE, same page, 1 byte after `FSP_SIZE`, 3 bytes 0x00c00.	We could use `MEMMOVE` for this, supposedly copying `free_limit` from `FSP_SIZE`
0xb4, 0x83, 0xba, 0x5, 0x16	WRITE, same page, a bit further away, 2 bytes 0x0516
0x34, 0x0, 0x0, 0x41, 0x58	WRITE 0:0, offset 0x41, byte 0x58.	The last 2 records could be reordered, to save 2 bytes.
0xb3, 0x83, 0xaf, 0x6e	WRITE, same page, 1 byte 0x6e
0x34, 0x0, 0x0, 0x41, 0x57	WRITE 0:0	rewriting byte 0x41 again! (Could this be the length of some list?)
0xb3, 0x83, 0xaf, 0xc6	WRITE, same page, 1 byte 0xc6	Overwriting the 0x6e from the previous write!
0x34, 0x0, 0x0, 0x41, 0x56		yet another WRITE of 0x41
0xb6, 0x83, 0xac, 0x10, 0x0, 0x0, 0xf6		overwriting the 0xc6 from the previous write!
0x34, 0x0, 0x0, 0x41, 0x55		yet another WRITE of 0x41
0xb6, 0x7, 0x0, 0x0, 0x0, 0x4, 0x66	WRITE, same page	not overwriting anything previously written
0xb8, 0x83, 0x9e, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0	WRITE, same page	resetting the 4 bytes to `FIL_NULL` could be shorter with a `MEMSET`
0x35, 0x0, 0x0, 0x40, 0x0, 0x8		yet another WRITE of 0x41 (and 0x40)
0xb5, 0x12, 0x0, 0x0, 0x1, 0xa6	WRITE, same page	looks OK
0xb8, 0x80, 0xce, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0	WRITE, same page	OK, but could use `MEMSET`
0x34, 0x0, 0x0, 0x51, 0x1	WRITE of 0x51
0xc4, 0x84, 0xec, 0x58, 0x0	MEMSET, same page, 0x58 bytes to 0x00
0x46, 0x0, 0x0, 0x84, 0xe6, 0x58, 0x0	MEMSET, page 0:0, 0x58 bytes to 0x00	an ordered offset would save 2 bytes!
0x46, 0x0, 0x0, 0x84, 0x8e, 0x58, 0x0	MEMSET, page 0:0, 0x58 bytes to 0x00	an ordered offset would save 2 bytes!
0x46, 0x0, 0x0, 0x84, 0x36, 0x58, 0x0	MEMSET, page 0:0, 0x58 bytes to 0x00	an ordered offset would save 2 bytes!
0x24, 0x0, 0x8b, 0x80, 0xa	`EXTENDED`, page 0xc00, `TRIM_PAGES`	OK

If we are ever going to modify at most 2 extent pages, then we could keep copies of both those pages and emit the difference, like we do in btr_page_reorganize_low() for B-tree pages.

I will conduct another test where the system tablespace will be shrunk to something above 0x1000 or 0x2000 pages, so that two extent descriptor pages 0 and 0x1000 or 0x2000 will be modified.

Marko Mäkelä added a comment - 2023-07-07 13:48 Here is a script that extends the system tablespace with actual data, so that fil_system.sys_space->free_limit will grow from its minimum value. It works on a debug build of the server: #!/bin/sh set -eux rm -fr /dev/shm/data /dev/shm/rr mkdir /dev/shm/data echo "create database test; use test; create table t(a int,b char(255) not null default'') engine=innodb stats_persistent=0; set unique_checks=0,foreign_key_checks=0; set global innodb_limit_optimistic_insert_debug=2; insert into t(a) select * from seq_1_to_1000000; drop table t;"| ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4g --datadir /dev/shm/data --innodb-fast-shutdown=0 --innodb-file-per-table=0 --bootstrap _RR_TRACE_DIR=/dev/shm/rr rr record \ ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4m --datadir /dev/shm/data --bootstrap < /dev/null I tested today’s revised version with that: 23aadffa0ed8ffb037ebbb5f44ce8ea76c38fe6c 2023-07-07 16:02:49 0 [Note] Truncating system tablespaces from 101376 to 3072 pages 2023-07-07 16:02:49 0 [Note] System tablespace truncated successful Now the buffer pool usage has been optimized, and the mini-transaction only comprises the tablespace latch and a latch on page 0 of the system tablespace. It will write 125 bytes of redo log for the shrinking, as follows: record explanation notes 0x36, 0x0, 0x0, 0x2f, 0x0, 0xc, 0x0 WRITE 0:0, offset 0x2f: 0x00c00 still wrong : all 4 bytes of FSP_SIZE must be written for the recv_size logic to work! 0xb4, 0x1, 0x0, 0xc, 0x0 WRITE, same page, 1 byte after FSP_SIZE , 3 bytes 0x00c00. We could use MEMMOVE for this, supposedly copying free_limit from FSP_SIZE 0xb4, 0x83, 0xba, 0x5, 0x16 WRITE, same page, a bit further away, 2 bytes 0x0516 0x34, 0x0, 0x0, 0x41, 0x58 WRITE 0:0, offset 0x41, byte 0x58. The last 2 records could be reordered, to save 2 bytes. 0xb3, 0x83, 0xaf, 0x6e WRITE, same page, 1 byte 0x6e 0x34, 0x0, 0x0, 0x41, 0x57 WRITE 0:0 rewriting byte 0x41 again! (Could this be the length of some list?) 0xb3, 0x83, 0xaf, 0xc6 WRITE, same page, 1 byte 0xc6 Overwriting the 0x6e from the previous write! 0x34, 0x0, 0x0, 0x41, 0x56 yet another WRITE of 0x41 0xb6, 0x83, 0xac, 0x10, 0x0, 0x0, 0xf6 overwriting the 0xc6 from the previous write! 0x34, 0x0, 0x0, 0x41, 0x55 yet another WRITE of 0x41 0xb6, 0x7, 0x0, 0x0, 0x0, 0x4, 0x66 WRITE, same page not overwriting anything previously written 0xb8, 0x83, 0x9e, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0 WRITE, same page resetting the 4 bytes to FIL_NULL could be shorter with a MEMSET 0x35, 0x0, 0x0, 0x40, 0x0, 0x8 yet another WRITE of 0x41 (and 0x40) 0xb5, 0x12, 0x0, 0x0, 0x1, 0xa6 WRITE, same page looks OK 0xb8, 0x80, 0xce, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0 WRITE, same page OK, but could use MEMSET 0x34, 0x0, 0x0, 0x51, 0x1 WRITE of 0x51 0xc4, 0x84, 0xec, 0x58, 0x0 MEMSET, same page, 0x58 bytes to 0x00 0x46, 0x0, 0x0, 0x84, 0xe6, 0x58, 0x0 MEMSET, page 0:0, 0x58 bytes to 0x00 an ordered offset would save 2 bytes! 0x46, 0x0, 0x0, 0x84, 0x8e, 0x58, 0x0 MEMSET, page 0:0, 0x58 bytes to 0x00 an ordered offset would save 2 bytes! 0x46, 0x0, 0x0, 0x84, 0x36, 0x58, 0x0 MEMSET, page 0:0, 0x58 bytes to 0x00 an ordered offset would save 2 bytes! 0x24, 0x0, 0x8b, 0x80, 0xa EXTENDED , page 0xc00, TRIM_PAGES OK If we are ever going to modify at most 2 extent pages, then we could keep copies of both those pages and emit the difference, like we do in btr_page_reorganize_low() for B-tree pages. I will conduct another test where the system tablespace will be shrunk to something above 0x1000 or 0x2000 pages, so that two extent descriptor pages 0 and 0x1000 or 0x2000 will be modified.

Marko Mäkelä added a comment - 2023-07-07 14:39

Another test:

#!/bin/sh

set -eux

rm -fr /dev/shm/data /dev/shm/rr

mkdir /dev/shm/data

echo "create database test;

use test;

create table t(a int,b char(255) not null default'')

engine=innodb stats_persistent=0;

set unique_checks=0,foreign_key_checks=0;

set global innodb_limit_optimistic_insert_debug=2;

insert into t(a) select * from seq_1_to_100000;

create table u like t;

insert into u(a) select * from seq_1_to_1000000;

drop table u;"|

./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4g --datadir /dev/shm/data --innodb-fast-shutdown=0 --innodb-file-per-table=0 --bootstrap

_RR_TRACE_DIR=/dev/shm/rr rr record \

./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4m --datadir /dev/shm/data --bootstrap < /dev/null

2023-07-07 16:51:57 0 [Note] Truncating system tablespaces from 117760 to 10240 pages

2023-07-07 16:51:58 0 [Note] System tablespace truncated successful

When debugging the rr replay, set watch fil_system.sys_space.free_limit to see the mtr contents. In this case, we expect at least the pages 0 and 0x2000 to be modified. Somewhat unexpectedly, the m_memo also contains a latch for page 0x1000, which was not modified. Maybe if the allocation patterns would have been different, also this page would have been modified. Now there is 209 bytes of mtr.m_log produced.

I am interested in these, because I want to make sure that we will not run out of buffer pool when shrinking a very large system tablespace to very small size, or we do not exceed the maximum size of a mini-transaction, which I think is something like 4 megabytes. Let me decode the log:

record	explanation	notes
0x36, 0x0, 0x0, 0x2f, 0x0, 0x28, 0x0	updating `FSP_SIZE` to 0x2800 (10240)	must start at 0x2e
0xb4, 0x1, 0x0, 0x28, 0x0	probably updating free_limit	this and the preceding could use a single `MEMSET` record, copying the same 4 bytes twice
0xb4, 0xd, 0x0, 0x20, 0x0	WRITE, same page
0x3b, 0x0, 0x9f, 0x80, 0x82, 0xde, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0	write, page 0:0x2000, offset 0x35e	could use `MEMSET`, and better logged after logging all changes to page 0:0
0x34, 0x0, 0x0, 0x41, 0x59	WRITE, page 0:0
0xb2, 0x5, 0xb6	WRITE, same page
0x34, 0x0, 0x0, 0x41, 0x58	WRITE, page 0:0	overwriting byte 0x41
0xb3, 0x4, 0x4, 0xe	WRITE, same page	overwriting 0x__b6 with 0x040e
0x34, 0x0, 0x0, 0x41, 0x57	WRITE, page 0:0	overwriting byte 0x41
0xb2, 0x5, 0x66	WRITE, same page	overwriting 0x040e with 0x0466
0x34, 0x0, 0x0, 0x41, 0x56	WRITE, page 0:0	overwriting byte 0x41
0xb2, 0x5, 0xbe	WRITE, same page	overwriting 0x0466 with 0x04be
0x34, 0x0, 0x0, 0x41, 0x55	WRITE, page 0:0	overwriting byte 0x41
0xb3, 0x4, 0x5, 0x16	WRITE, same page	overwriting 0x04be with 0x0516
0x34, 0x0, 0x0, 0x41, 0x54	WRITE, page 0:0	overwriting byte 0x41
0xb2, 0x5, 0x6e	WRITE, same page	overwriting 0x0516 with 0x056e
0x34, 0x0, 0x0, 0x41, 0x53	WRITE, page 0:0	overwriting byte 0x41
0xb2, 0x5, 0xc6	WRITE, same page	overwriting 0x056e with 0x05c6
0x34, 0x0, 0x0, 0x41, 0x52	WRITE, page 0:0	overwriting byte 0x41
0xb5, 0x2, 0x30, 0x0, 0x0, 0xf6	WRITE, same page	overwriting 0x____05c6 with 0x3000f6
0x34, 0x0, 0x0, 0x41, 0x51	WRITE, page 0:0	overwriting byte 0x41
0xc3, 0x0, 0x4, 0xff	MEMSET, same page, 4 bytes of 0xff	this is how `FIL_NULL` should be written
0xc3, 0x2, 0x4, 0xff	MEMSET, 2 bytes after the previous one	also writing `FIL_NULL`
0x35, 0x0, 0x0, 0x40, 0x0, 0x0	write, page 0:0	overwriting bytes 0x40 and 0x41
0x39, 0x0, 0x9f, 0x80, 0x80, 0x26, 0x0, 0x0, 0x1, 0xa6	WRITE, page 0:0x2000, offset 0xa6, 0x000001a6	also the writes to page 0:0x2000 should be sorted by offset
0x37, 0x0, 0x0, 0x81, 0x27, 0x0, 0x20, 0x0	WRITE, page 0:0, offset 0x1a7
0x34, 0x0, 0x0, 0x51, 0x3	WRITE, offset 0x51	2 bytes wasted again due to misordered offset
0x47, 0x0, 0x9f, 0x80, 0x85, 0x3e, 0x58, 0x0	MEMSET, page 0:0x2000, offset 0x5be, 0x58 bytes 0x00
0x47, 0x0, 0x9f, 0x80, 0x84, 0xe6, 0x58, 0x0	MEMSET, page 0:0x2000	3 bytes wasted again
0x47, 0x0, 0x9f, 0x80, 0x84, 0x8e, 0x58, 0x0	MEMSET, page 0:0x2000	3 bytes wasted again
0x47, 0x0, 0x9f, 0x80, 0x84, 0x36, 0x58, 0x0	MEMSET, page 0:0x2000	3 bytes wasted again
0x47, 0x0, 0x9f, 0x80, 0x83, 0xde, 0x58, 0x0	MEMSET, page 0:0x2000	3 bytes wasted again
0x47, 0x0, 0x9f, 0x80, 0x83, 0x86, 0x58, 0x0	MEMSET, page 0:0x2000	3 bytes wasted again
0x47, 0x0, 0x9f, 0x80, 0x83, 0x2e, 0x58, 0x0	MEMSET, page 0:0x2000	3 bytes wasted again
0x47, 0x0, 0x9f, 0x80, 0x82, 0xd6, 0x58, 0x0	MEMSET, page 0:0x2000	3 bytes wasted again
0x24, 0x0, 0xa7, 0x80, 0xa	`EXTENDED`, page 0x2800, `TRIM_PAGES`	OK

This is not too bad, but I would like to improve this to have a known worst case. If I understood correctly, the worst case should be as follows:

number of pages modified	2 (page 0 and the last surviving extent descriptor)
number of additional pages accessed	at most 1 at a time (one extent descriptor page)
amount of redo log written	2*`innodb_page_size` (to rewrite both extent descriptor pages)

I don’t think that there currently is any clear limit of the amount of redo log written. The easiest would be to keep copies of the 1 or 2 modified pages and optimize the log writes by comparing the pages. Why 2 pages? We assume that no extent descriptor page before the last surviving one can contain pointers to extent descriptor pages that would be discarded by shrinking the tablespace. I hope that thiru can add some debug assertions to prove this, or point to existing debug assertions for checking this.

One more thing that we discussed is something to avoid truncation and expansion on every startup. I would not want to add any start-up parameter for that. I think that a good rule would be that if the size of the system tablespace could be shrunk by more than 1/8 (12.5%) during startup, then it will be shrunk. Currently this is missing: we always try to shrink, no matter what. It could be useful to add some DBUG_EXECUTE_IF to enable the "always shrink" logic for testing.

Marko Mäkelä added a comment - 2023-07-07 14:39 Another test: #!/bin/sh set -eux rm -fr /dev/shm/data /dev/shm/rr mkdir /dev/shm/data echo "create database test; use test; create table t(a int,b char(255) not null default'') engine=innodb stats_persistent=0; set unique_checks=0,foreign_key_checks=0; set global innodb_limit_optimistic_insert_debug=2; insert into t(a) select * from seq_1_to_100000; create table u like t; insert into u(a) select * from seq_1_to_1000000; drop table u;"| ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4g --datadir /dev/shm/data --innodb-fast-shutdown=0 --innodb-file-per-table=0 --bootstrap _RR_TRACE_DIR=/dev/shm/rr rr record \ ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4m --datadir /dev/shm/data --bootstrap < /dev/null 2023-07-07 16:51:57 0 [Note] Truncating system tablespaces from 117760 to 10240 pages 2023-07-07 16:51:58 0 [Note] System tablespace truncated successful When debugging the rr replay , set watch fil_system.sys_space.free_limit to see the mtr contents. In this case, we expect at least the pages 0 and 0x2000 to be modified. Somewhat unexpectedly, the m_memo also contains a latch for page 0x1000, which was not modified. Maybe if the allocation patterns would have been different, also this page would have been modified. Now there is 209 bytes of mtr.m_log produced. I am interested in these, because I want to make sure that we will not run out of buffer pool when shrinking a very large system tablespace to very small size, or we do not exceed the maximum size of a mini-transaction, which I think is something like 4 megabytes. Let me decode the log: record explanation notes 0x36, 0x0, 0x0, 0x2f, 0x0, 0x28, 0x0 updating FSP_SIZE to 0x2800 (10240) must start at 0x2e 0xb4, 0x1, 0x0, 0x28, 0x0 probably updating free_limit this and the preceding could use a single MEMSET record, copying the same 4 bytes twice 0xb4, 0xd, 0x0, 0x20, 0x0 WRITE, same page 0x3b, 0x0, 0x9f, 0x80, 0x82, 0xde, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0 write, page 0:0x2000, offset 0x35e could use MEMSET , and better logged after logging all changes to page 0:0 0x34, 0x0, 0x0, 0x41, 0x59 WRITE, page 0:0 0xb2, 0x5, 0xb6 WRITE, same page 0x34, 0x0, 0x0, 0x41, 0x58 WRITE, page 0:0 overwriting byte 0x41 0xb3, 0x4, 0x4, 0xe WRITE, same page overwriting 0x__b6 with 0x040e 0x34, 0x0, 0x0, 0x41, 0x57 WRITE, page 0:0 overwriting byte 0x41 0xb2, 0x5, 0x66 WRITE, same page overwriting 0x040e with 0x0466 0x34, 0x0, 0x0, 0x41, 0x56 WRITE, page 0:0 overwriting byte 0x41 0xb2, 0x5, 0xbe WRITE, same page overwriting 0x0466 with 0x04be 0x34, 0x0, 0x0, 0x41, 0x55 WRITE, page 0:0 overwriting byte 0x41 0xb3, 0x4, 0x5, 0x16 WRITE, same page overwriting 0x04be with 0x0516 0x34, 0x0, 0x0, 0x41, 0x54 WRITE, page 0:0 overwriting byte 0x41 0xb2, 0x5, 0x6e WRITE, same page overwriting 0x0516 with 0x056e 0x34, 0x0, 0x0, 0x41, 0x53 WRITE, page 0:0 overwriting byte 0x41 0xb2, 0x5, 0xc6 WRITE, same page overwriting 0x056e with 0x05c6 0x34, 0x0, 0x0, 0x41, 0x52 WRITE, page 0:0 overwriting byte 0x41 0xb5, 0x2, 0x30, 0x0, 0x0, 0xf6 WRITE, same page overwriting 0x____05c6 with 0x3000f6 0x34, 0x0, 0x0, 0x41, 0x51 WRITE, page 0:0 overwriting byte 0x41 0xc3, 0x0, 0x4, 0xff MEMSET, same page, 4 bytes of 0xff this is how FIL_NULL should be written 0xc3, 0x2, 0x4, 0xff MEMSET, 2 bytes after the previous one also writing FIL_NULL 0x35, 0x0, 0x0, 0x40, 0x0, 0x0 write, page 0:0 overwriting bytes 0x40 and 0x41 0x39, 0x0, 0x9f, 0x80, 0x80, 0x26, 0x0, 0x0, 0x1, 0xa6 WRITE, page 0:0x2000, offset 0xa6, 0x000001a6 also the writes to page 0:0x2000 should be sorted by offset 0x37, 0x0, 0x0, 0x81, 0x27, 0x0, 0x20, 0x0 WRITE, page 0:0, offset 0x1a7 0x34, 0x0, 0x0, 0x51, 0x3 WRITE, offset 0x51 2 bytes wasted again due to misordered offset 0x47, 0x0, 0x9f, 0x80, 0x85, 0x3e, 0x58, 0x0 MEMSET, page 0:0x2000, offset 0x5be, 0x58 bytes 0x00 0x47, 0x0, 0x9f, 0x80, 0x84, 0xe6, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again 0x47, 0x0, 0x9f, 0x80, 0x84, 0x8e, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again 0x47, 0x0, 0x9f, 0x80, 0x84, 0x36, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again 0x47, 0x0, 0x9f, 0x80, 0x83, 0xde, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again 0x47, 0x0, 0x9f, 0x80, 0x83, 0x86, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again 0x47, 0x0, 0x9f, 0x80, 0x83, 0x2e, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again 0x47, 0x0, 0x9f, 0x80, 0x82, 0xd6, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again 0x24, 0x0, 0xa7, 0x80, 0xa EXTENDED , page 0x2800, TRIM_PAGES OK This is not too bad, but I would like to improve this to have a known worst case. If I understood correctly, the worst case should be as follows: number of pages modified 2 (page 0 and the last surviving extent descriptor) number of additional pages accessed at most 1 at a time (one extent descriptor page) amount of redo log written 2* innodb_page_size (to rewrite both extent descriptor pages) I don’t think that there currently is any clear limit of the amount of redo log written. The easiest would be to keep copies of the 1 or 2 modified pages and optimize the log writes by comparing the pages. Why 2 pages? We assume that no extent descriptor page before the last surviving one can contain pointers to extent descriptor pages that would be discarded by shrinking the tablespace. I hope that thiru can add some debug assertions to prove this, or point to existing debug assertions for checking this. One more thing that we discussed is something to avoid truncation and expansion on every startup. I would not want to add any start-up parameter for that. I think that a good rule would be that if the size of the system tablespace could be shrunk by more than 1/8 (12.5%) during startup, then it will be shrunk. Currently this is missing: we always try to shrink, no matter what. It could be useful to add some DBUG_EXECUTE_IF to enable the "always shrink" logic for testing.

Marko Mäkelä added a comment - 2023-07-07 14:57

Even though the extent descriptor page 0x1000 was not modified in my above test, thiru says that if the page allocation patterns had been different, this could be needed.

The actual worst-case limit of the number of pages latched (and possibly modified) is truncated_size_in_pages/innodb_page_size.

The worst case would be shrinking a innodb_page_size=4k tablespace so that the last possible descriptor page 0xfffff000 is removed. All 1,048,575 descriptor pages (every 0x1000 bytes) from 0x00000000 to 0xffffe000 would have to be accessed. If we held latches to all these pages at the same time, we would need 4GiB of buffer pool, which is more than 1,000 times the minimum innodb_buffer_pool_size. So, we should avoid doing that.

The typical amount of redo log written for one page should be much smaller than the page size, hopefully something like 50 bytes. If the maximum mini-transaction size is 4MiB, that should be enough for modifying some 80,000 extent descriptor pages.

I do not think that the worst case would involve modifying all 1,048,575 descriptor pages. There is only a limited number of list nodes in a descriptor page, let’s call it N. So, at most N pages can have pointers to the last descriptor page and may have to be modified. If we shrink the tablespace so that the M last descriptor pages ending at 0xfffff000 are modified, then we may have to modify N*M surviving descriptor pages. What is the maximum possible M*N?

Marko Mäkelä added a comment - 2023-07-07 14:57 Even though the extent descriptor page 0x1000 was not modified in my above test, thiru says that if the page allocation patterns had been different, this could be needed. The actual worst-case limit of the number of pages latched (and possibly modified) is truncated_size_in_pages/ innodb_page_size . The worst case would be shrinking a innodb_page_size=4k tablespace so that the last possible descriptor page 0xfffff000 is removed. All 1,048,575 descriptor pages (every 0x1000 bytes) from 0x00000000 to 0xffffe000 would have to be accessed. If we held latches to all these pages at the same time, we would need 4GiB of buffer pool, which is more than 1,000 times the minimum innodb_buffer_pool_size . So, we should avoid doing that. The typical amount of redo log written for one page should be much smaller than the page size, hopefully something like 50 bytes. If the maximum mini-transaction size is 4MiB, that should be enough for modifying some 80,000 extent descriptor pages. I do not think that the worst case would involve modifying all 1,048,575 descriptor pages. There is only a limited number of list nodes in a descriptor page, let’s call it N. So, at most N pages can have pointers to the last descriptor page and may have to be modified. If we shrink the tablespace so that the M last descriptor pages ending at 0xfffff000 are modified, then we may have to modify N*M surviving descriptor pages. What is the maximum possible M*N?

Marko Mäkelä added a comment - 2023-07-10 06:34 - edited

Because there are 256 extent descriptor list elements (XDES_FLST_NODE) per extent descriptor page, shrinking the system tablespace by 1 extent will involve modifying at most 256 other extent descriptor pages. Shrinking by 2 extents may require modifying up to 2*256=512 extent descriptor pages.

Close to the worst case should be to shrink a tablespace with innodb_page_size=4k from the maximum number of extents (1,048,576) by something like 3,584 extents to 1,044,992 extents. If everything is laid out in the pessimal way, this would involve modifying 3584*256=917504 extent descriptor pages, requiring 3.5 GiB of buffer pool. If the maximum mini-transaction log size is 2MiB (before ~~MDEV-14425~~, it might be even less), we would only have some 2.28 bytes per page available for logging the modifications to each page. That is clearly not enough, since the minimum size of the record header is 3 bytes (4 bytes for logging changes to any other extent descriptor page than the first one).

We clearly must implement some logic to determine if the shrinking operation is possible at all (if we have enough buffer pool and log space available). Only if the system tablespace will be shrunk little enough or to a small enough final size, it can succeed. We must first validate everything, before committing to shrink the tablespace.

Furthermore, to avoid any bugs or extra overhead at startup related to this, it would be good to make this feature opt-in, that is, only attempt shrinking the system tablespace if innodb_data_file_path contains an extra attribute :autoshrink to enable the attempt.

Marko Mäkelä added a comment - 2023-07-10 06:34 - edited Because there are 256 extent descriptor list elements ( XDES_FLST_NODE ) per extent descriptor page, shrinking the system tablespace by 1 extent will involve modifying at most 256 other extent descriptor pages. Shrinking by 2 extents may require modifying up to 2*256=512 extent descriptor pages. Close to the worst case should be to shrink a tablespace with innodb_page_size=4k from the maximum number of extents (1,048,576) by something like 3,584 extents to 1,044,992 extents. If everything is laid out in the pessimal way, this would involve modifying 3584*256=917504 extent descriptor pages, requiring 3.5 GiB of buffer pool. If the maximum mini-transaction log size is 2MiB (before MDEV-14425 , it might be even less), we would only have some 2.28 bytes per page available for logging the modifications to each page. That is clearly not enough, since the minimum size of the record header is 3 bytes (4 bytes for logging changes to any other extent descriptor page than the first one). We clearly must implement some logic to determine if the shrinking operation is possible at all (if we have enough buffer pool and log space available). Only if the system tablespace will be shrunk little enough or to a small enough final size, it can succeed. We must first validate everything, before committing to shrink the tablespace. Furthermore, to avoid any bugs or extra overhead at startup related to this, it would be good to make this feature opt-in, that is, only attempt shrinking the system tablespace if innodb_data_file_path contains an extra attribute :autoshrink to enable the attempt.

Marko Mäkelä added a comment - 2023-07-10 13:23

I think that implementing the following will address the biggest concerns. I verified that the first two of my list were done today:

Write all 32 bits of FSP_SIZE at 0x2e.
Make this opt-in by requiring an :autoshrink attribute to be added to innodb_data_file_path.
Check that we will not run out of buffer pool.
Check that we will not write too much redo log: the maximum mini-transaction size in the ~~MDEV-14425~~ format should be 2MiB, including the 5 bytes of terminator and checksum.
If either limit is exceeded, revert any changes to the buffer pool and discard the mini-transaction log.
For test purposes, introduce some debug parameter or DBUG_EXECUTE_IF to make the maximum mini-transaction size smaller, say, 100 to 1000 bytes.
Add debug code to invoke flst_validate() on all the extent descriptor lists in the system tablespace, before and after attempting truncation.

Optimizing the redo log size is optional and might be done later, if there is real customer demand for it. I would anticipate other demands for follow-up improvements first, such as these:

Defragmenting or optimizing the system tables, to ensure that all data will be stored near the start of the system tablespace.
Repairing a system tablespace to minimal size by copying the actual needed contents, exclusing things like SYS_TABLESPACES or SYS_DATAFILES tables that existed between MariaDB Server 10.0 and 10.5 until ~~MDEV-22343~~ removed them.

Marko Mäkelä added a comment - 2023-07-10 13:23 I think that implementing the following will address the biggest concerns. I verified that the first two of my list were done today: Write all 32 bits of FSP_SIZE at 0x2e. Make this opt-in by requiring an :autoshrink attribute to be added to innodb_data_file_path . Check that we will not run out of buffer pool. Check that we will not write too much redo log: the maximum mini-transaction size in the MDEV-14425 format should be 2MiB, including the 5 bytes of terminator and checksum. If either limit is exceeded, revert any changes to the buffer pool and discard the mini-transaction log. For test purposes, introduce some debug parameter or DBUG_EXECUTE_IF to make the maximum mini-transaction size smaller, say, 100 to 1000 bytes. Add debug code to invoke flst_validate() on all the extent descriptor lists in the system tablespace, before and after attempting truncation. Optimizing the redo log size is optional and might be done later, if there is real customer demand for it. I would anticipate other demands for follow-up improvements first, such as these: Defragmenting or optimizing the system tables, to ensure that all data will be stored near the start of the system tablespace. Repairing a system tablespace to minimal size by copying the actual needed contents, exclusing things like SYS_TABLESPACES or SYS_DATAFILES tables that existed between MariaDB Server 10.0 and 10.5 until MDEV-22343 removed them.

Matthias Leich added a comment - 2023-07-13 09:03 - edited

origin/bb-11.2-MDEV-14795_1 1779cb1af979cad1abe9d5c3ed4d37b6503175aa 2023-07-11T20:25:49+05:30 + some patch  (build with debug+asan+ optimization Og)

behaved relative well in RQG testing. I observed many bad effects but they occured/occur in preview-11.2,

11.2 b8c039fad16adf9cbc6dac2e34cdeb772b9065a0 2023-07-07T09:05:33+10:00

and sometimes other trees too.

The main testing campaign consisted of ~ 9900 tests which all used the server

startup option  "--mysqld=--innodb_data_file_path=ibdata1:1M:autoextend:autoshrink".

Statistics from the remaining test results:

smallest system tablespace ever shrinked 1024 pages

biggest system tablespace ever shrinked 41152 pages

biggest shrink operation ~ 16000 pages truncated

smallest shrink operation ~ 250 pages truncated

Matthias Leich added a comment - 2023-07-13 09:03 - edited origin/bb-11.2-MDEV-14795_1 1779cb1af979cad1abe9d5c3ed4d37b6503175aa 2023-07-11T20:25:49+05:30 + some patch (build with debug+asan+ optimization Og) behaved relative well in RQG testing. I observed many bad effects but they occured/occur in preview-11.2, 11.2 b8c039fad16adf9cbc6dac2e34cdeb772b9065a0 2023-07-07T09:05:33+10:00 and sometimes other trees too. The main testing campaign consisted of ~ 9900 tests which all used the server startup option "--mysqld=--innodb_data_file_path=ibdata1:1M:autoextend:autoshrink". Statistics from the remaining test results: smallest system tablespace ever shrinked 1024 pages biggest system tablespace ever shrinked 41152 pages biggest shrink operation ~ 16000 pages truncated smallest shrink operation ~ 250 pages truncated

Thirunarayanan Balathandayuthapani added a comment - 2023-07-17 05:14

Complete patch is in bb-11.2-MDEV-14795_1

Thirunarayanan Balathandayuthapani added a comment - 2023-07-17 05:14 Complete patch is in bb-11.2- MDEV-14795 _1

Matthias Leich added a comment - 2023-07-18 12:04 - edited

origin/bb-11.2-~~MDEV-14795~~_1 6575d969aef4d06fb529f467f6c9d3ca60b0bb53 2023-07-17T10:43:15+05:30 (build type RelWithDebInfo)
performed well in RQG testing. No new problems.

Statistics from the remaining test results:
smallest system tablespace ever shrinked 1024 pages
biggest system tablespace ever shrinked 16640 pages
biggest shrink operation ~ 16000 pages truncated
smallest shrink operation ~ 832 pages truncated
I assume that the difference to the statistics from the previous run
seems caused by the different source tree.
A build with debug+asan+Optimization Og showed similar numbers like the
RelWithDebInfo build.

Matthias Leich added a comment - 2023-07-18 12:04 - edited origin/bb-11.2- MDEV-14795 _1 6575d969aef4d06fb529f467f6c9d3ca60b0bb53 2023-07-17T10:43:15+05:30 (build type RelWithDebInfo) performed well in RQG testing. No new problems. Statistics from the remaining test results: smallest system tablespace ever shrinked 1024 pages biggest system tablespace ever shrinked 16640 pages biggest shrink operation ~ 16000 pages truncated smallest shrink operation ~ 832 pages truncated I assume that the difference to the statistics from the previous run seems caused by the different source tree. A build with debug+asan+Optimization Og showed similar numbers like the RelWithDebInfo build.

Marko Mäkelä added a comment - 2023-07-25 13:37

I posted some review comments, mostly about the formatting of messages, and code that is not covered by mtr tests. I think that most code should be covered if you create a debug-instrumented big_test that will use innodb_page_size=4k and extend the system tablespace to slightly more than 16 MiB (4096*4096 bytes). With innodb_limit_optimistic_insert_debug=2 (like my previously posted script does) it should be relatively fast. I think that the mandatory MSAN builder should run such a test.

Tomorrow, I will review the logic to "roll back" a mini-transaction when the size constraints are exceeded.

Marko Mäkelä added a comment - 2023-07-25 13:37 I posted some review comments, mostly about the formatting of messages, and code that is not covered by mtr tests. I think that most code should be covered if you create a debug-instrumented big_test that will use innodb_page_size=4k and extend the system tablespace to slightly more than 16 MiB (4096*4096 bytes). With innodb_limit_optimistic_insert_debug=2 (like my previously posted script does) it should be relatively fast. I think that the mandatory MSAN builder should run such a test. Tomorrow, I will review the logic to "roll back" a mini-transaction when the size constraints are exceeded.

Marko Mäkelä added a comment - 2023-07-26 07:19

The general logic looks good to me. I think that this needs one more review round after the code coverage in the regression test suite has been improved.

Marko Mäkelä added a comment - 2023-07-26 07:19 The general logic looks good to me. I think that this needs one more review round after the code coverage in the regression test suite has been improved.

Marko Mäkelä added a comment - 2023-07-31 14:11

The code coverage looks acceptable to me now. I made some minor suggestions on improving the out-of-memory detection. This should be OK to push after addressing the last round of my comments.

Marko Mäkelä added a comment - 2023-07-31 14:11 The code coverage looks acceptable to me now. I made some minor suggestions on improving the out-of-memory detection. This should be OK to push after addressing the last round of my comments.

Matthias Leich added a comment - 2023-08-01 17:21

origin/bb-11.2-MDEV-14795 f9003c73a15d892423845a2df96c8bd73bbe96ca 2023-08-01T19:43:04+05:30

performed well in RQG testing. The failures observed occur on other trees too, are in JIRA or are not caused by MDEV-14795.

Matthias Leich added a comment - 2023-08-01 17:21 origin/bb-11.2-MDEV-14795 f9003c73a15d892423845a2df96c8bd73bbe96ca 2023-08-01T19:43:04+05:30 performed well in RQG testing. The failures observed occur on other trees too, are in JIRA or are not caused by MDEV-14795.

Marko Mäkelä added a comment - 2023-08-02 12:44

Together with ~~MDEV-19229~~, this allows the storage space consumed by the InnoDB system tablespace to be reclaimed.

The defragmentation that is mentioned in the Description never shrunk any files; it only tried to move some B-tree pages within a data file in an attempt to improve performance on HDDs. That code was removed in ~~MDEV-30545~~.

Marko Mäkelä added a comment - 2023-08-02 12:44 Together with MDEV-19229 , this allows the storage space consumed by the InnoDB system tablespace to be reclaimed. The defragmentation that is mentioned in the Description never shrunk any files; it only tried to move some B-tree pages within a data file in an attempt to improve performance on HDDs. That code was removed in MDEV-30545 .

Phil Sumner added a comment - 2023-08-25 08:36

Does this change mean that there will be a delay in startup for particularly large ibdata1 files?

i.e. I've got a Galera cluster where ibdata1 has over time grown to > 300GB, and it definitely doesn't need to be this big with `file_per_table` having been set since database creation.

Phil Sumner added a comment - 2023-08-25 08:36 Does this change mean that there will be a delay in startup for particularly large ibdata1 files? i.e. I've got a Galera cluster where ibdata1 has over time grown to > 300GB, and it definitely doesn't need to be this big with `file_per_table` having been set since database creation.

Marko Mäkelä added a comment - 2023-10-12 07:46

psumner, shrinking the system tablespace involves:

traversing every allocation bitmap page (pages 0, innodb_page_size, 2*innodb_page_size…) within the old size,
adjusting the bitmap pages within the shrunk size,
durably writing log for the change, and
shrinking the last system tablespace file by ftruncate()

When using the default innodb_page_size=16k, there will be an allocation bitmap page every 16384*16384 bytes (256 megabytes, or quarter gigabyte). Shrinking a 300GiB file would involve traversing 1,200 bitmap pages. In an extreme case, the file could be shrunk to 256 megabytes or less, that is, we would write log covering just the first page of the file. In such an extreme case, the time should be dominated by the file system operation. Truncating or deleting a heavily fragmented file can take considerable time. How much, depends on the type of storage and the file system, and the operating system kernel version.

By popular demand, I filed ~~MDEV-32452~~ so that this cleanup can be executed on shutdown rather than startup.

Marko Mäkelä added a comment - 2023-10-12 07:46 psumner , shrinking the system tablespace involves: traversing every allocation bitmap page (pages 0, innodb_page_size , 2* innodb_page_size …) within the old size, adjusting the bitmap pages within the shrunk size, durably writing log for the change, and shrinking the last system tablespace file by ftruncate() When using the default innodb_page_size=16k , there will be an allocation bitmap page every 16384*16384 bytes (256 megabytes, or quarter gigabyte). Shrinking a 300GiB file would involve traversing 1,200 bitmap pages. In an extreme case, the file could be shrunk to 256 megabytes or less, that is, we would write log covering just the first page of the file. In such an extreme case, the time should be dominated by the file system operation. Truncating or deleting a heavily fragmented file can take considerable time. How much, depends on the type of storage and the file system, and the operating system kernel version. By popular demand, I filed MDEV-32452 so that this cleanup can be executed on shutdown rather than startup.

People

Assignee:: Thirunarayanan Balathandayuthapani

Reporter:: martina342

Votes:: 2 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 2017-12-29 00:13

Updated:: 2 days ago 15:34

Resolved:: 2023-08-01 17:27

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.