Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-14795

InnoDB system tablespace cannot be shrunk

Details

    Description

      hello
      i used to have all tables on ibdata1 before, then i switched to one file per table, but the ibdata1 file keep its size.
      I recently switched to mariadb 10.1, and hoped that the defragmenting function would help to shrink the file, but it didn't worked.
      I optimised all tables, but the ibdata1 file did not changed at all, it is still 197gb

      MariaDB [(none)]> show status like '%defragment%';
      +----------------------------------------+--------+
      | Variable_name                          | Value  |
      +----------------------------------------+--------+
      | Innodb_defragment_compression_failures | 0      |
      | Innodb_defragment_failures             | 32872  |
      | Innodb_defragment_count                | 113744 |
      +----------------------------------------+--------+
      3 rows in set (0.01 sec)
      

      MariaDB [(none)]> show variables like '%defragment%';
      +--------------------------------------+----------+
      | Variable_name                        | Value    |
      +--------------------------------------+----------+
      | innodb_defragment                    | ON       |
      | innodb_defragment_fill_factor        | 0.900000 |
      | innodb_defragment_fill_factor_n_recs | 20       |
      | innodb_defragment_frequency          | 40       |
      | innodb_defragment_n_pages            | 7        |
      | innodb_defragment_stats_accuracy     | 0        |
      +--------------------------------------+----------+
      

      Attachments

        Issue Links

          Activity

            I don't think it has ever been claimed that the ibdata1 would shrink as the result of defragmentation; but I find that our current documentation is not at all straightforward in regard to this question, so I'm assigning it to jplindst to clarify, especially in the KB.

            elenst Elena Stepanova added a comment - I don't think it has ever been claimed that the ibdata1 would shrink as the result of defragmentation; but I find that our current documentation is not at all straightforward in regard to this question, so I'm assigning it to jplindst to clarify, especially in the KB.

            greenman Can you clarify documentation that tablespace files (including ibdata1) will not shrink as the result of defragmentation.

            jplindst Jan Lindström (Inactive) added a comment - greenman Can you clarify documentation that tablespace files (including ibdata1) will not shrink as the result of defragmentation.
            martina342 martina342 added a comment -

            thanks for the reply. Is the data actually moved to the beginning of the ibdata file ? is it difficult to add the possibility to shrink the file ibdata1 ?

            martina342 martina342 added a comment - thanks for the reply. Is the data actually moved to the beginning of the ibdata file ? is it difficult to add the possibility to shrink the file ibdata1 ?
            greenman Ian Gilfillan added a comment -

            Documentation now specifically mentions that the file size does not shrink. martina342 the only way to shrink the existing file size is with dumping and reloading. The article at https://www.thegeekstuff.com/2016/02/mysql-innodb-file-per-table/ seems to describe your scenario of moving from innodb_file_per_table = 0 to innodb_file_per_table=1

            greenman Ian Gilfillan added a comment - Documentation now specifically mentions that the file size does not shrink. martina342 the only way to shrink the existing file size is with dumping and reloading. The article at https://www.thegeekstuff.com/2016/02/mysql-innodb-file-per-table/ seems to describe your scenario of moving from innodb_file_per_table = 0 to innodb_file_per_table=1

            I think that it is a reasonable request to shrink the system tablespace after all tables have been moved out of it (innodb_file_per_table=1), dedicated undo tablespaces have been created (MDEV-19229) and the change buffer has been disabled (MDEV-27734).

            marko Marko Mäkelä added a comment - I think that it is a reasonable request to shrink the system tablespace after all tables have been moved out of it ( innodb_file_per_table=1 ), dedicated undo tablespaces have been created ( MDEV-19229 ) and the change buffer has been disabled ( MDEV-27734 ).

            We have to be careful about this in order to avoid introducing the equivalent of MDEV-29999. We may have to track changes not only to FSP_SIZE but also to FSP_FREE_LIMIT or something else that reliably tells the last actually used page offset within the file.

            marko Marko Mäkelä added a comment - We have to be careful about this in order to avoid introducing the equivalent of MDEV-29999 . We may have to track changes not only to FSP_SIZE but also to FSP_FREE_LIMIT or something else that reliably tells the last actually used page offset within the file.

            Because some code that is shared with undo tablespace truncation is being modified here, I think that we need to pay special attention to test coverage.

            marko Marko Mäkelä added a comment - Because some code that is shared with undo tablespace truncation is being modified here, I think that we need to pay special attention to test coverage.
            mleich Matthias Leich added a comment - - edited

            Preliminary results of RQG testing on origin/bb-10.11-MDEV-14795 5751dbd168147b1294d0dbde018adcab0fd02cb4 2023-06-08T15:31:09+05:30
             
            1. make load, SIGKILL the server, restart fails
            [rr 1719523 12457]2023-06-08 04:51:49 0x79d72909b640[rr 1719523 12460]  InnoDB: Assertion failure in file /data/Server/bb-10.11-MDEV-14795C/storage/innobase/buf/buf0dblwr.cc line 618
            [rr 1719523 12462]InnoDB: Failing assertion: fil_system.sys_space->acquire()
            sdp:/data1/results/1686220665/TBR-1933$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio
             
            2. make load, SIGKILL the server, restart passes, some tables get checked and the server crashes
            [rr 3469512 6800]mysqld: /data/Server/bb-10.11-MDEV-14795C/storage/innobase/include/fil0fil.h:653: bool fil_space_t::set_needs_flush(): Assertion `n & PENDING' failed.
            sdp:/data1/results/1686220665/TBR-1934$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio
             
             
            There were more hits of sometimes other asserts around crash recovery.
            But they have sometimes no rr trace or serious bigger rr traces.
             
            3. make load, SIGKILL the server, restart, check certain tables with success, shutdown which ends with server crash
             [rr 2614894 79103]2023-06-08  5:21:23 0 [Note] InnoDB: Starting shutdown...
             [rr 2614894 79122]2023-06-08  5:21:23 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/rqg/1686220665/93/1/data/ib_buffer_pool
             [rr 2614894 79132]2023-06-08  5:21:23 0 [Note] InnoDB: Restricted to 185 pages due to innodb_buf_pool_dump_pct=25
             [rr 2614894 79139]2023-06-08  5:21:23 0 [ERROR] InnoDB: Summed data size 1859, returned by func 30316
             [rr 2614894 79160]2023-06-08  5:21:23 0 [Note] InnoDB: Buffer pool(s) dump completed at 230608  5:21:23
             [rr 2614894 79169]2023-06-08  5:21:23 0 [ERROR] InnoDB: Apparent corruption in space 0 page 1460 of index `IBUF_DUMMY` of table `IBUF_DUMMY`
             [rr 2614894 79173]2023-06-08  5:21:23 0 [ERROR] InnoDB: Summed data size 1859, returned by func 30316
             [rr 2614894 79177]2023-06-08  5:21:23 0 [ERROR] InnoDB: Apparent corruption in space 0 page 1460 of index `IBUF_DUMMY` of table `IBUF_DUMMY`
             [rr 2614894 79182]mysqld: /data/Server/bb-10.11-MDEV-14795C/storage/innobase/ibuf/ibuf0ibuf.cc:4327: dberr_t ibuf_merge_or_delete_for_page(buf_block_t*, page_id_t, ulint): Assertion `page_validate(block->page.frame, dummy_index)' failed.
              sdp:/data1/results/1686220665/TBR-1932$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio
            
            

            mleich Matthias Leich added a comment - - edited Preliminary results of RQG testing on origin/bb-10.11-MDEV-14795 5751dbd168147b1294d0dbde018adcab0fd02cb4 2023-06-08T15:31:09+05:30   1. make load, SIGKILL the server, restart fails [rr 1719523 12457]2023-06-08 04:51:49 0x79d72909b640[rr 1719523 12460] InnoDB: Assertion failure in file /data/Server/bb-10.11-MDEV-14795C/storage/innobase/buf/buf0dblwr.cc line 618 [rr 1719523 12462]InnoDB: Failing assertion: fil_system.sys_space->acquire() sdp:/data1/results/1686220665/TBR-1933$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio   2. make load, SIGKILL the server, restart passes, some tables get checked and the server crashes [rr 3469512 6800]mysqld: /data/Server/bb-10.11-MDEV-14795C/storage/innobase/include/fil0fil.h:653: bool fil_space_t::set_needs_flush(): Assertion `n & PENDING' failed. sdp:/data1/results/1686220665/TBR-1934$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio     There were more hits of sometimes other asserts around crash recovery. But they have sometimes no rr trace or serious bigger rr traces.   3. make load, SIGKILL the server, restart, check certain tables with success, shutdown which ends with server crash [rr 2614894 79103]2023-06-08 5:21:23 0 [Note] InnoDB: Starting shutdown... [rr 2614894 79122]2023-06-08 5:21:23 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/rqg/1686220665/93/1/data/ib_buffer_pool [rr 2614894 79132]2023-06-08 5:21:23 0 [Note] InnoDB: Restricted to 185 pages due to innodb_buf_pool_dump_pct=25 [rr 2614894 79139]2023-06-08 5:21:23 0 [ERROR] InnoDB: Summed data size 1859, returned by func 30316 [rr 2614894 79160]2023-06-08 5:21:23 0 [Note] InnoDB: Buffer pool(s) dump completed at 230608 5:21:23 [rr 2614894 79169]2023-06-08 5:21:23 0 [ERROR] InnoDB: Apparent corruption in space 0 page 1460 of index `IBUF_DUMMY` of table `IBUF_DUMMY` [rr 2614894 79173]2023-06-08 5:21:23 0 [ERROR] InnoDB: Summed data size 1859, returned by func 30316 [rr 2614894 79177]2023-06-08 5:21:23 0 [ERROR] InnoDB: Apparent corruption in space 0 page 1460 of index `IBUF_DUMMY` of table `IBUF_DUMMY` [rr 2614894 79182]mysqld: /data/Server/bb-10.11-MDEV-14795C/storage/innobase/ibuf/ibuf0ibuf.cc:4327: dberr_t ibuf_merge_or_delete_for_page(buf_block_t*, page_id_t, ulint): Assertion `page_validate(block->page.frame, dummy_index)' failed. sdp:/data1/results/1686220665/TBR-1932$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio

            Results of testing on origin/bb-10.11-MDEV-14795 873798f9c3052cdeea253f5bb398ba835346a63c 2023-06-12T16:07:47+05:30
            1. Some of the problems mentioned above were fixed.
            2. Other problems which showed up during testing turned out to be on
               - origin/10.11 582d0cf5b062a9aa30f555f7f6c4f6e775bca408 2023-06-10T11:14:15+03:00
               - origin/preview-11.2-preview 79f9e94195dc8a007debd27ec4e8d2c15f094a97 2023-06-13T17:23:05+10:00
               too. Some of them have to reported.
            

            mleich Matthias Leich added a comment - Results of testing on origin/bb-10.11-MDEV-14795 873798f9c3052cdeea253f5bb398ba835346a63c 2023-06-12T16:07:47+05:30 1. Some of the problems mentioned above were fixed. 2. Other problems which showed up during testing turned out to be on - origin/10.11 582d0cf5b062a9aa30f555f7f6c4f6e775bca408 2023-06-10T11:14:15+03:00 - origin/preview-11.2-preview 79f9e94195dc8a007debd27ec4e8d2c15f094a97 2023-06-13T17:23:05+10:00 too. Some of them have to reported.
            mleich Matthias Leich added a comment - - edited

            Preliminary result of testing on origin/bb-10.11-MDEV-14795 e9e34b7b6c1409de26a84bd6bdbca45257b6263b 2023-06-16T18:28:00+05:30
            The failure MENT-189
                mariabackup --backup fails with
                InnoDB: Assertion failure in file /data/Server/bb-10.11-MDEV-14795D/extra/mariabackup/xtrabackup.cc line 992
                InnoDB: Failing assertion: opt_no_lock
            shows up again.
            sdp:/data1/results/1686921174/MENT-189_Rebirth$ _RR_TRACE_DIR=./1_clone/rr rr replay --mark-stdio
            The reason is currently unknown.
            origin/preview-11.2-preview 393bad0621ace7242f55e2cc5a6fb89ea9dfb8c6 2023-06-16T18:12:23+07:00 shows the same problem.
            

            mleich Matthias Leich added a comment - - edited Preliminary result of testing on origin/bb-10.11-MDEV-14795 e9e34b7b6c1409de26a84bd6bdbca45257b6263b 2023-06-16T18:28:00+05:30 The failure MENT-189 mariabackup --backup fails with InnoDB: Assertion failure in file /data/Server/bb-10.11-MDEV-14795D/extra/mariabackup/xtrabackup.cc line 992 InnoDB: Failing assertion: opt_no_lock shows up again. sdp:/data1/results/1686921174/MENT-189_Rebirth$ _RR_TRACE_DIR=./1_clone/rr rr replay --mark-stdio The reason is currently unknown. origin/preview-11.2-preview 393bad0621ace7242f55e2cc5a6fb89ea9dfb8c6 2023-06-16T18:12:23+07:00 shows the same problem.

            MDEV-29694 in 11.0 removed the InnoDB change buffer. If this change were ever applied to anything that supports the change buffer, then we should also think about deleting the change buffer records and clearing change buffer bitmap bits for the pages that will be trimmed off the end of the system tablespace.

            marko Marko Mäkelä added a comment - MDEV-29694 in 11.0 removed the InnoDB change buffer. If this change were ever applied to anything that supports the change buffer, then we should also think about deleting the change buffer records and clearing change buffer bitmap bits for the pages that will be trimmed off the end of the system tablespace.

            I happened to notice a related-looking test failure in an IA-32 environment:

            innodb.log_data_file_size '64k,innodb'   w1 [ fail ]
            2023-06-29  3:40:49 0 [Note] InnoDB: Truncating system tablespaces from 1344 to 320 pages
            2023-06-29 03:40:49 0xa96f9b40  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-11.2.0/storage/innobase/fil/fil0fil.cc line 2988
            InnoDB: Failing assertion: space->size == check.size
            

            This might be fixed by acquiring fil_system.mutex earlier in the resizing code.

            marko Marko Mäkelä added a comment - I happened to notice a related-looking test failure in an IA-32 environment : innodb.log_data_file_size '64k,innodb' w1 [ fail ] … 2023-06-29 3:40:49 0 [Note] InnoDB: Truncating system tablespaces from 1344 to 320 pages 2023-06-29 03:40:49 0xa96f9b40 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-11.2.0/storage/innobase/fil/fil0fil.cc line 2988 InnoDB: Failing assertion: space->size == check.size This might be fixed by acquiring fil_system.mutex earlier in the resizing code.

            To better understand the maximum number of pages accessed and log records written by the shrinking operation, I came up with the following test:

            rm -fr /dev/shm/data /dev/shm/rr && mkdir /dev/shm/data &&
            ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4m --innodb-buffer-pool-size=4m --datadir /dev/shm/data --innodb-data-file-path=ibdata1:56G --bootstrap < /dev/null &&
            _RR_TRACE_DIR=/dev/shm/rr rr record \
            ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4m --innodb-buffer-pool-size=4m --datadir /dev/shm/data --innodb-data-file-path=ibdata1:12M:autoextend --bootstrap < /dev/null
            rr replay /dev/shm/rr/latest-trace
            

            There will be extent descriptor pages every innodb_page_size pages. There can be at most 2³² pages, and with the minimum innodb_page_size=4k we have the maximum number of extent descriptor pages: 2^(32-12)=2²⁰=1,048,576 pages. If these are the only pages that need to be accessed during the shrinking, then for any innodb_page_size, the worst case buffer pool consumption of the mini-transaction should be innodb_buffer_pool_size=4G, when shrinking from the maximum size (16 to 256 TiB depending on the page size) to the minimum size.

            Based on the output of the above test, it looks like currently, no log writing whatsoever is taking place, and the shrinking may not be crash-safe at the moment?

            2023-07-04 10:25:30 0 [Note] InnoDB: Initializing buffer pool, total size = 4.000MiB, chunk size = 1.000MiB
            2023-07-04 10:25:30 0 [Note] InnoDB: Completed initialization of buffer pool
            2023-07-04 10:25:30 0 [Note] InnoDB: End of log at LSN=47046
            2023-07-04 10:25:30 0 [Note] InnoDB: 128 rollback segments are active.
            2023-07-04 10:25:30 0 [Note] Truncating system tablespaces from 14680064 to 3072 pages
             
            2023-07-04 10:25:36 0 [Note] System tablespace truncated successful
            2023-07-04 10:25:36 0 [Note] InnoDB: Setting O_DIRECT on file ./ibtmp1 failed
            2023-07-04 10:25:36 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
            2023-07-04 10:25:36 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
            2023-07-04 10:25:36 0 [Note] InnoDB: log sequence number 47046 (memory-mapped); transaction id 4
            

            I think that we need some additional testing along these lines. For the record, the above truncation only accessed one page of the system tablespace:

            Thread 1 hit Breakpoint 2, sql_print_information (format=format@entry=0x560400f53238 "Truncating system tablespaces from %d to %d pages\n") at /mariadb/11/sql/log.cc:9447
            9447	{
            (rr) continue
            Continuing.
            2023-07-04 10:25:30 0 [Note] Truncating system tablespaces from 14680064 to 3072 pages
             
             
            Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=4, guess=0x0, mode=16, mtr=0x7ffdae517b80, err=0x7ffdae517b74) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270
            2270	{
             
            Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=4, guess=0x0, mode=16, mtr=0x7ffdae517b80, err=0x7ffdae517ae0) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270
            2270	{
             
            Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=4, guess=0x0, mode=16, mtr=0x7ffdae517b80, err=0x7ffdae517ae0) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270
            2270	{
             
            Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=2, guess=0x0, mode=10, mtr=0x7ffdae517b80, err=0x7ffdae517b78) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270
            2270	{
             
            Thread 1 hit Breakpoint 2, sql_print_information (format=format@entry=0x560400f60720 "System tablespace truncated successful") at /mariadb/11/sql/log.cc:9447
            9447	{
            

            marko Marko Mäkelä added a comment - To better understand the maximum number of pages accessed and log records written by the shrinking operation, I came up with the following test: rm -fr /dev/shm/data /dev/shm/rr && mkdir /dev/shm/data && ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4m --innodb-buffer-pool-size=4m --datadir /dev/shm/data --innodb-data-file-path=ibdata1:56G --bootstrap < /dev/null && _RR_TRACE_DIR=/dev/shm/rr rr record \ ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4m --innodb-buffer-pool-size=4m --datadir /dev/shm/data --innodb-data-file-path=ibdata1:12M:autoextend --bootstrap < /dev/null rr replay /dev/shm/rr/latest-trace There will be extent descriptor pages every innodb_page_size pages. There can be at most 2³² pages, and with the minimum innodb_page_size=4k we have the maximum number of extent descriptor pages: 2^(32-12)=2²⁰=1,048,576 pages. If these are the only pages that need to be accessed during the shrinking, then for any innodb_page_size , the worst case buffer pool consumption of the mini-transaction should be innodb_buffer_pool_size=4G , when shrinking from the maximum size (16 to 256 TiB depending on the page size) to the minimum size. Based on the output of the above test, it looks like currently, no log writing whatsoever is taking place, and the shrinking may not be crash-safe at the moment? 2023-07-04 10:25:30 0 [Note] InnoDB: Initializing buffer pool, total size = 4.000MiB, chunk size = 1.000MiB 2023-07-04 10:25:30 0 [Note] InnoDB: Completed initialization of buffer pool 2023-07-04 10:25:30 0 [Note] InnoDB: End of log at LSN=47046 2023-07-04 10:25:30 0 [Note] InnoDB: 128 rollback segments are active. 2023-07-04 10:25:30 0 [Note] Truncating system tablespaces from 14680064 to 3072 pages   2023-07-04 10:25:36 0 [Note] System tablespace truncated successful 2023-07-04 10:25:36 0 [Note] InnoDB: Setting O_DIRECT on file ./ibtmp1 failed 2023-07-04 10:25:36 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ... 2023-07-04 10:25:36 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB. 2023-07-04 10:25:36 0 [Note] InnoDB: log sequence number 47046 (memory-mapped); transaction id 4 I think that we need some additional testing along these lines. For the record, the above truncation only accessed one page of the system tablespace: Thread 1 hit Breakpoint 2, sql_print_information (format=format@entry=0x560400f53238 "Truncating system tablespaces from %d to %d pages\n") at /mariadb/11/sql/log.cc:9447 9447 { (rr) continue Continuing. 2023-07-04 10:25:30 0 [Note] Truncating system tablespaces from 14680064 to 3072 pages     Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=4, guess=0x0, mode=16, mtr=0x7ffdae517b80, err=0x7ffdae517b74) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270 2270 {   Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=4, guess=0x0, mode=16, mtr=0x7ffdae517b80, err=0x7ffdae517ae0) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270 2270 {   Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=4, guess=0x0, mode=16, mtr=0x7ffdae517b80, err=0x7ffdae517ae0) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270 2270 {   Thread 1 hit Breakpoint 1, buf_page_get_low (page_id={m_id = 0}, zip_size=0, rw_latch=2, guess=0x0, mode=10, mtr=0x7ffdae517b80, err=0x7ffdae517b78) at /mariadb/11/storage/innobase/buf/buf0buf.cc:2270 2270 {   Thread 1 hit Breakpoint 2, sql_print_information (format=format@entry=0x560400f60720 "System tablespace truncated successful") at /mariadb/11/sql/log.cc:9447 9447 {

            Actually, the above test did write 12 bytes of log record payload (growing the LSN by 12+5 bytes), consisting of 2 log records:

            bytes explanation
            0x24, 0x00, 0x8b, 0x80, 0x0a EXTENDED, space 0, page 0xc00, TRIM_PAGES (shrinking the system tablespace to 0xc00*0x1000 bytes = 12MiB)
            0x36, 0x00, 0x00, 0x2f, 0x00, 0x0c, 0x00 WRITE, space 0, page 0, offset 0x2f, bytes 0x000c00, updating the 24 least significant bits of FSP_SIZE

            In fact, I think that the log record size optimization for updating FSP_SIZE must be disabled, so that the file size change may be crash-safe. That is, the second log record should have been one byte longer:

            0x37, 0x00, 0x00, 0x2e, 0x00, 0x00, 0x0c, 0x00 WRITE, space 0, page 0, offset 0x2e (FSP_SIZE), 32 bits
            marko Marko Mäkelä added a comment - Actually, the above test did write 12 bytes of log record payload (growing the LSN by 12+5 bytes), consisting of 2 log records: bytes explanation 0x24, 0x00, 0x8b, 0x80, 0x0a EXTENDED , space 0, page 0xc00, TRIM_PAGES (shrinking the system tablespace to 0xc00*0x1000 bytes = 12MiB) 0x36, 0x00, 0x00, 0x2f, 0x00, 0x0c, 0x00 WRITE , space 0, page 0, offset 0x2f, bytes 0x000c00, updating the 24 least significant bits of FSP_SIZE In fact, I think that the log record size optimization for updating FSP_SIZE must be disabled, so that the file size change may be crash-safe. That is, the second log record should have been one byte longer: 0x37, 0x00, 0x00, 0x2e, 0x00, 0x00, 0x0c, 0x00 WRITE , space 0, page 0, offset 0x2e ( FSP_SIZE ), 32 bits

            Here is a script that extends the system tablespace with actual data, so that fil_system.sys_space->free_limit will grow from its minimum value. It works on a debug build of the server:

            #!/bin/sh
            set -eux
            rm -fr /dev/shm/data /dev/shm/rr
            mkdir /dev/shm/data
            echo "create database test;
            use test;
            create table t(a int,b char(255) not null default'')
            engine=innodb stats_persistent=0;
            set unique_checks=0,foreign_key_checks=0;
            set global innodb_limit_optimistic_insert_debug=2;
            insert into t(a) select * from seq_1_to_1000000;
            drop table t;"|
            ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4g --datadir /dev/shm/data --innodb-fast-shutdown=0 --innodb-file-per-table=0 --bootstrap
            _RR_TRACE_DIR=/dev/shm/rr rr record \
            ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4m --datadir /dev/shm/data --bootstrap < /dev/null
            

            I tested today’s revised version with that:

            23aadffa0ed8ffb037ebbb5f44ce8ea76c38fe6c

            2023-07-07 16:02:49 0 [Note] Truncating system tablespaces from 101376 to 3072 pages
            2023-07-07 16:02:49 0 [Note] System tablespace truncated successful
            

            Now the buffer pool usage has been optimized, and the mini-transaction only comprises the tablespace latch and a latch on page 0 of the system tablespace. It will write 125 bytes of redo log for the shrinking, as follows:

            record explanation notes
            0x36, 0x0, 0x0, 0x2f, 0x0, 0xc, 0x0 WRITE 0:0, offset 0x2f: 0x00c00 still wrong: all 4 bytes of FSP_SIZE must be written for the recv_size logic to work!
            0xb4, 0x1, 0x0, 0xc, 0x0 WRITE, same page, 1 byte after FSP_SIZE, 3 bytes 0x00c00. We could use MEMMOVE for this, supposedly copying free_limit from FSP_SIZE
            0xb4, 0x83, 0xba, 0x5, 0x16 WRITE, same page, a bit further away, 2 bytes 0x0516
            0x34, 0x0, 0x0, 0x41, 0x58 WRITE 0:0, offset 0x41, byte 0x58. The last 2 records could be reordered, to save 2 bytes.
            0xb3, 0x83, 0xaf, 0x6e WRITE, same page, 1 byte 0x6e
            0x34, 0x0, 0x0, 0x41, 0x57 WRITE 0:0 rewriting byte 0x41 again! (Could this be the length of some list?)
            0xb3, 0x83, 0xaf, 0xc6 WRITE, same page, 1 byte 0xc6 Overwriting the 0x6e from the previous write!
            0x34, 0x0, 0x0, 0x41, 0x56   yet another WRITE of 0x41
            0xb6, 0x83, 0xac, 0x10, 0x0, 0x0, 0xf6   overwriting the 0xc6 from the previous write!
            0x34, 0x0, 0x0, 0x41, 0x55   yet another WRITE of 0x41
            0xb6, 0x7, 0x0, 0x0, 0x0, 0x4, 0x66 WRITE, same page not overwriting anything previously written
            0xb8, 0x83, 0x9e, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0 WRITE, same page resetting the 4 bytes to FIL_NULL could be shorter with a MEMSET
            0x35, 0x0, 0x0, 0x40, 0x0, 0x8   yet another WRITE of 0x41 (and 0x40)
            0xb5, 0x12, 0x0, 0x0, 0x1, 0xa6 WRITE, same page looks OK
            0xb8, 0x80, 0xce, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0 WRITE, same page OK, but could use MEMSET
            0x34, 0x0, 0x0, 0x51, 0x1 WRITE of 0x51  
            0xc4, 0x84, 0xec, 0x58, 0x0 MEMSET, same page, 0x58 bytes to 0x00  
            0x46, 0x0, 0x0, 0x84, 0xe6, 0x58, 0x0 MEMSET, page 0:0, 0x58 bytes to 0x00 an ordered offset would save 2 bytes!
            0x46, 0x0, 0x0, 0x84, 0x8e, 0x58, 0x0 MEMSET, page 0:0, 0x58 bytes to 0x00 an ordered offset would save 2 bytes!
            0x46, 0x0, 0x0, 0x84, 0x36, 0x58, 0x0 MEMSET, page 0:0, 0x58 bytes to 0x00 an ordered offset would save 2 bytes!
            0x24, 0x0, 0x8b, 0x80, 0xa EXTENDED, page 0xc00, TRIM_PAGES OK

            If we are ever going to modify at most 2 extent pages, then we could keep copies of both those pages and emit the difference, like we do in btr_page_reorganize_low() for B-tree pages.

            I will conduct another test where the system tablespace will be shrunk to something above 0x1000 or 0x2000 pages, so that two extent descriptor pages 0 and 0x1000 or 0x2000 will be modified.

            marko Marko Mäkelä added a comment - Here is a script that extends the system tablespace with actual data, so that fil_system.sys_space->free_limit will grow from its minimum value. It works on a debug build of the server: #!/bin/sh set -eux rm -fr /dev/shm/data /dev/shm/rr mkdir /dev/shm/data echo "create database test; use test; create table t(a int,b char(255) not null default'') engine=innodb stats_persistent=0; set unique_checks=0,foreign_key_checks=0; set global innodb_limit_optimistic_insert_debug=2; insert into t(a) select * from seq_1_to_1000000; drop table t;"| ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4g --datadir /dev/shm/data --innodb-fast-shutdown=0 --innodb-file-per-table=0 --bootstrap _RR_TRACE_DIR=/dev/shm/rr rr record \ ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4m --datadir /dev/shm/data --bootstrap < /dev/null I tested today’s revised version with that: 23aadffa0ed8ffb037ebbb5f44ce8ea76c38fe6c 2023-07-07 16:02:49 0 [Note] Truncating system tablespaces from 101376 to 3072 pages 2023-07-07 16:02:49 0 [Note] System tablespace truncated successful Now the buffer pool usage has been optimized, and the mini-transaction only comprises the tablespace latch and a latch on page 0 of the system tablespace. It will write 125 bytes of redo log for the shrinking, as follows: record explanation notes 0x36, 0x0, 0x0, 0x2f, 0x0, 0xc, 0x0 WRITE 0:0, offset 0x2f: 0x00c00 still wrong : all 4 bytes of FSP_SIZE must be written for the recv_size logic to work! 0xb4, 0x1, 0x0, 0xc, 0x0 WRITE, same page, 1 byte after FSP_SIZE , 3 bytes 0x00c00. We could use MEMMOVE for this, supposedly copying free_limit from FSP_SIZE 0xb4, 0x83, 0xba, 0x5, 0x16 WRITE, same page, a bit further away, 2 bytes 0x0516 0x34, 0x0, 0x0, 0x41, 0x58 WRITE 0:0, offset 0x41, byte 0x58. The last 2 records could be reordered, to save 2 bytes. 0xb3, 0x83, 0xaf, 0x6e WRITE, same page, 1 byte 0x6e 0x34, 0x0, 0x0, 0x41, 0x57 WRITE 0:0 rewriting byte 0x41 again! (Could this be the length of some list?) 0xb3, 0x83, 0xaf, 0xc6 WRITE, same page, 1 byte 0xc6 Overwriting the 0x6e from the previous write! 0x34, 0x0, 0x0, 0x41, 0x56   yet another WRITE of 0x41 0xb6, 0x83, 0xac, 0x10, 0x0, 0x0, 0xf6   overwriting the 0xc6 from the previous write! 0x34, 0x0, 0x0, 0x41, 0x55   yet another WRITE of 0x41 0xb6, 0x7, 0x0, 0x0, 0x0, 0x4, 0x66 WRITE, same page not overwriting anything previously written 0xb8, 0x83, 0x9e, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0 WRITE, same page resetting the 4 bytes to FIL_NULL could be shorter with a MEMSET 0x35, 0x0, 0x0, 0x40, 0x0, 0x8   yet another WRITE of 0x41 (and 0x40) 0xb5, 0x12, 0x0, 0x0, 0x1, 0xa6 WRITE, same page looks OK 0xb8, 0x80, 0xce, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0 WRITE, same page OK, but could use MEMSET 0x34, 0x0, 0x0, 0x51, 0x1 WRITE of 0x51   0xc4, 0x84, 0xec, 0x58, 0x0 MEMSET, same page, 0x58 bytes to 0x00   0x46, 0x0, 0x0, 0x84, 0xe6, 0x58, 0x0 MEMSET, page 0:0, 0x58 bytes to 0x00 an ordered offset would save 2 bytes! 0x46, 0x0, 0x0, 0x84, 0x8e, 0x58, 0x0 MEMSET, page 0:0, 0x58 bytes to 0x00 an ordered offset would save 2 bytes! 0x46, 0x0, 0x0, 0x84, 0x36, 0x58, 0x0 MEMSET, page 0:0, 0x58 bytes to 0x00 an ordered offset would save 2 bytes! 0x24, 0x0, 0x8b, 0x80, 0xa EXTENDED , page 0xc00, TRIM_PAGES OK If we are ever going to modify at most 2 extent pages, then we could keep copies of both those pages and emit the difference, like we do in btr_page_reorganize_low() for B-tree pages. I will conduct another test where the system tablespace will be shrunk to something above 0x1000 or 0x2000 pages, so that two extent descriptor pages 0 and 0x1000 or 0x2000 will be modified.

            Another test:

            #!/bin/sh
            set -eux
            rm -fr /dev/shm/data /dev/shm/rr
            mkdir /dev/shm/data
            echo "create database test;
            use test;
            create table t(a int,b char(255) not null default'')
            engine=innodb stats_persistent=0;
            set unique_checks=0,foreign_key_checks=0;
            set global innodb_limit_optimistic_insert_debug=2;
            insert into t(a) select * from seq_1_to_100000;
            create table u like t;
            insert into u(a) select * from seq_1_to_1000000;
            drop table u;"|
            ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4g --datadir /dev/shm/data --innodb-fast-shutdown=0 --innodb-file-per-table=0 --bootstrap
            _RR_TRACE_DIR=/dev/shm/rr rr record \
            ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4m --datadir /dev/shm/data --bootstrap < /dev/null
            

            2023-07-07 16:51:57 0 [Note] Truncating system tablespaces from 117760 to 10240 pages
            2023-07-07 16:51:58 0 [Note] System tablespace truncated successful
            

            When debugging the rr replay, set watch fil_system.sys_space.free_limit to see the mtr contents. In this case, we expect at least the pages 0 and 0x2000 to be modified. Somewhat unexpectedly, the m_memo also contains a latch for page 0x1000, which was not modified. Maybe if the allocation patterns would have been different, also this page would have been modified. Now there is 209 bytes of mtr.m_log produced.

            I am interested in these, because I want to make sure that we will not run out of buffer pool when shrinking a very large system tablespace to very small size, or we do not exceed the maximum size of a mini-transaction, which I think is something like 4 megabytes. Let me decode the log:

            record explanation notes
            0x36, 0x0, 0x0, 0x2f, 0x0, 0x28, 0x0 updating FSP_SIZE to 0x2800 (10240) must start at 0x2e
            0xb4, 0x1, 0x0, 0x28, 0x0 probably updating free_limit this and the preceding could use a single MEMSET record, copying the same 4 bytes twice
            0xb4, 0xd, 0x0, 0x20, 0x0 WRITE, same page  
            0x3b, 0x0, 0x9f, 0x80, 0x82, 0xde, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0 write, page 0:0x2000, offset 0x35e could use MEMSET, and better logged after logging all changes to page 0:0
            0x34, 0x0, 0x0, 0x41, 0x59 WRITE, page 0:0  
            0xb2, 0x5, 0xb6 WRITE, same page  
            0x34, 0x0, 0x0, 0x41, 0x58 WRITE, page 0:0 overwriting byte 0x41
            0xb3, 0x4, 0x4, 0xe WRITE, same page overwriting 0x__b6 with 0x040e
            0x34, 0x0, 0x0, 0x41, 0x57 WRITE, page 0:0 overwriting byte 0x41
            0xb2, 0x5, 0x66 WRITE, same page overwriting 0x040e with 0x0466
            0x34, 0x0, 0x0, 0x41, 0x56 WRITE, page 0:0 overwriting byte 0x41
            0xb2, 0x5, 0xbe WRITE, same page overwriting 0x0466 with 0x04be
            0x34, 0x0, 0x0, 0x41, 0x55 WRITE, page 0:0 overwriting byte 0x41
            0xb3, 0x4, 0x5, 0x16 WRITE, same page overwriting 0x04be with 0x0516
            0x34, 0x0, 0x0, 0x41, 0x54 WRITE, page 0:0 overwriting byte 0x41
            0xb2, 0x5, 0x6e WRITE, same page overwriting 0x0516 with 0x056e
            0x34, 0x0, 0x0, 0x41, 0x53 WRITE, page 0:0 overwriting byte 0x41
            0xb2, 0x5, 0xc6 WRITE, same page overwriting 0x056e with 0x05c6
            0x34, 0x0, 0x0, 0x41, 0x52 WRITE, page 0:0 overwriting byte 0x41
            0xb5, 0x2, 0x30, 0x0, 0x0, 0xf6 WRITE, same page overwriting 0x____05c6 with 0x3000f6
            0x34, 0x0, 0x0, 0x41, 0x51 WRITE, page 0:0 overwriting byte 0x41
            0xc3, 0x0, 0x4, 0xff MEMSET, same page, 4 bytes of 0xff this is how FIL_NULL should be written
            0xc3, 0x2, 0x4, 0xff MEMSET, 2 bytes after the previous one also writing FIL_NULL
            0x35, 0x0, 0x0, 0x40, 0x0, 0x0 write, page 0:0 overwriting bytes 0x40 and 0x41
            0x39, 0x0, 0x9f, 0x80, 0x80, 0x26, 0x0, 0x0, 0x1, 0xa6 WRITE, page 0:0x2000, offset 0xa6, 0x000001a6 also the writes to page 0:0x2000 should be sorted by offset
            0x37, 0x0, 0x0, 0x81, 0x27, 0x0, 0x20, 0x0 WRITE, page 0:0, offset 0x1a7
            0x34, 0x0, 0x0, 0x51, 0x3 WRITE, offset 0x51 2 bytes wasted again due to misordered offset
            0x47, 0x0, 0x9f, 0x80, 0x85, 0x3e, 0x58, 0x0 MEMSET, page 0:0x2000, offset 0x5be, 0x58 bytes 0x00  
            0x47, 0x0, 0x9f, 0x80, 0x84, 0xe6, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again
            0x47, 0x0, 0x9f, 0x80, 0x84, 0x8e, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again
            0x47, 0x0, 0x9f, 0x80, 0x84, 0x36, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again
            0x47, 0x0, 0x9f, 0x80, 0x83, 0xde, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again
            0x47, 0x0, 0x9f, 0x80, 0x83, 0x86, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again
            0x47, 0x0, 0x9f, 0x80, 0x83, 0x2e, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again
            0x47, 0x0, 0x9f, 0x80, 0x82, 0xd6, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again
            0x24, 0x0, 0xa7, 0x80, 0xa EXTENDED, page 0x2800, TRIM_PAGES OK

            This is not too bad, but I would like to improve this to have a known worst case. If I understood correctly, the worst case should be as follows:

            number of pages modified 2 (page 0 and the last surviving extent descriptor)
            number of additional pages accessed at most 1 at a time (one extent descriptor page)
            amount of redo log written 2*innodb_page_size (to rewrite both extent descriptor pages)

            I don’t think that there currently is any clear limit of the amount of redo log written. The easiest would be to keep copies of the 1 or 2 modified pages and optimize the log writes by comparing the pages. Why 2 pages? We assume that no extent descriptor page before the last surviving one can contain pointers to extent descriptor pages that would be discarded by shrinking the tablespace. I hope that thiru can add some debug assertions to prove this, or point to existing debug assertions for checking this.

            One more thing that we discussed is something to avoid truncation and expansion on every startup. I would not want to add any start-up parameter for that. I think that a good rule would be that if the size of the system tablespace could be shrunk by more than 1/8 (12.5%) during startup, then it will be shrunk. Currently this is missing: we always try to shrink, no matter what. It could be useful to add some DBUG_EXECUTE_IF to enable the "always shrink" logic for testing.

            marko Marko Mäkelä added a comment - Another test: #!/bin/sh set -eux rm -fr /dev/shm/data /dev/shm/rr mkdir /dev/shm/data echo "create database test; use test; create table t(a int,b char(255) not null default'') engine=innodb stats_persistent=0; set unique_checks=0,foreign_key_checks=0; set global innodb_limit_optimistic_insert_debug=2; insert into t(a) select * from seq_1_to_100000; create table u like t; insert into u(a) select * from seq_1_to_1000000; drop table u;"| ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4g --datadir /dev/shm/data --innodb-fast-shutdown=0 --innodb-file-per-table=0 --bootstrap _RR_TRACE_DIR=/dev/shm/rr rr record \ ./mariadbd --innodb-page-size=4k --innodb-log-file-size=4g --innodb-buffer-pool-size=4m --datadir /dev/shm/data --bootstrap < /dev/null 2023-07-07 16:51:57 0 [Note] Truncating system tablespaces from 117760 to 10240 pages 2023-07-07 16:51:58 0 [Note] System tablespace truncated successful When debugging the rr replay , set watch fil_system.sys_space.free_limit to see the mtr contents. In this case, we expect at least the pages 0 and 0x2000 to be modified. Somewhat unexpectedly, the m_memo also contains a latch for page 0x1000, which was not modified. Maybe if the allocation patterns would have been different, also this page would have been modified. Now there is 209 bytes of mtr.m_log produced. I am interested in these, because I want to make sure that we will not run out of buffer pool when shrinking a very large system tablespace to very small size, or we do not exceed the maximum size of a mini-transaction, which I think is something like 4 megabytes. Let me decode the log: record explanation notes 0x36, 0x0, 0x0, 0x2f, 0x0, 0x28, 0x0 updating FSP_SIZE to 0x2800 (10240) must start at 0x2e 0xb4, 0x1, 0x0, 0x28, 0x0 probably updating free_limit this and the preceding could use a single MEMSET record, copying the same 4 bytes twice 0xb4, 0xd, 0x0, 0x20, 0x0 WRITE, same page   0x3b, 0x0, 0x9f, 0x80, 0x82, 0xde, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0 write, page 0:0x2000, offset 0x35e could use MEMSET , and better logged after logging all changes to page 0:0 0x34, 0x0, 0x0, 0x41, 0x59 WRITE, page 0:0   0xb2, 0x5, 0xb6 WRITE, same page   0x34, 0x0, 0x0, 0x41, 0x58 WRITE, page 0:0 overwriting byte 0x41 0xb3, 0x4, 0x4, 0xe WRITE, same page overwriting 0x__b6 with 0x040e 0x34, 0x0, 0x0, 0x41, 0x57 WRITE, page 0:0 overwriting byte 0x41 0xb2, 0x5, 0x66 WRITE, same page overwriting 0x040e with 0x0466 0x34, 0x0, 0x0, 0x41, 0x56 WRITE, page 0:0 overwriting byte 0x41 0xb2, 0x5, 0xbe WRITE, same page overwriting 0x0466 with 0x04be 0x34, 0x0, 0x0, 0x41, 0x55 WRITE, page 0:0 overwriting byte 0x41 0xb3, 0x4, 0x5, 0x16 WRITE, same page overwriting 0x04be with 0x0516 0x34, 0x0, 0x0, 0x41, 0x54 WRITE, page 0:0 overwriting byte 0x41 0xb2, 0x5, 0x6e WRITE, same page overwriting 0x0516 with 0x056e 0x34, 0x0, 0x0, 0x41, 0x53 WRITE, page 0:0 overwriting byte 0x41 0xb2, 0x5, 0xc6 WRITE, same page overwriting 0x056e with 0x05c6 0x34, 0x0, 0x0, 0x41, 0x52 WRITE, page 0:0 overwriting byte 0x41 0xb5, 0x2, 0x30, 0x0, 0x0, 0xf6 WRITE, same page overwriting 0x____05c6 with 0x3000f6 0x34, 0x0, 0x0, 0x41, 0x51 WRITE, page 0:0 overwriting byte 0x41 0xc3, 0x0, 0x4, 0xff MEMSET, same page, 4 bytes of 0xff this is how FIL_NULL should be written 0xc3, 0x2, 0x4, 0xff MEMSET, 2 bytes after the previous one also writing FIL_NULL 0x35, 0x0, 0x0, 0x40, 0x0, 0x0 write, page 0:0 overwriting bytes 0x40 and 0x41 0x39, 0x0, 0x9f, 0x80, 0x80, 0x26, 0x0, 0x0, 0x1, 0xa6 WRITE, page 0:0x2000, offset 0xa6, 0x000001a6 also the writes to page 0:0x2000 should be sorted by offset 0x37, 0x0, 0x0, 0x81, 0x27, 0x0, 0x20, 0x0 WRITE, page 0:0, offset 0x1a7 0x34, 0x0, 0x0, 0x51, 0x3 WRITE, offset 0x51 2 bytes wasted again due to misordered offset 0x47, 0x0, 0x9f, 0x80, 0x85, 0x3e, 0x58, 0x0 MEMSET, page 0:0x2000, offset 0x5be, 0x58 bytes 0x00   0x47, 0x0, 0x9f, 0x80, 0x84, 0xe6, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again 0x47, 0x0, 0x9f, 0x80, 0x84, 0x8e, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again 0x47, 0x0, 0x9f, 0x80, 0x84, 0x36, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again 0x47, 0x0, 0x9f, 0x80, 0x83, 0xde, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again 0x47, 0x0, 0x9f, 0x80, 0x83, 0x86, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again 0x47, 0x0, 0x9f, 0x80, 0x83, 0x2e, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again 0x47, 0x0, 0x9f, 0x80, 0x82, 0xd6, 0x58, 0x0 MEMSET, page 0:0x2000 3 bytes wasted again 0x24, 0x0, 0xa7, 0x80, 0xa EXTENDED , page 0x2800, TRIM_PAGES OK This is not too bad, but I would like to improve this to have a known worst case. If I understood correctly, the worst case should be as follows: number of pages modified 2 (page 0 and the last surviving extent descriptor) number of additional pages accessed at most 1 at a time (one extent descriptor page) amount of redo log written 2* innodb_page_size (to rewrite both extent descriptor pages) I don’t think that there currently is any clear limit of the amount of redo log written. The easiest would be to keep copies of the 1 or 2 modified pages and optimize the log writes by comparing the pages. Why 2 pages? We assume that no extent descriptor page before the last surviving one can contain pointers to extent descriptor pages that would be discarded by shrinking the tablespace. I hope that thiru can add some debug assertions to prove this, or point to existing debug assertions for checking this. One more thing that we discussed is something to avoid truncation and expansion on every startup. I would not want to add any start-up parameter for that. I think that a good rule would be that if the size of the system tablespace could be shrunk by more than 1/8 (12.5%) during startup, then it will be shrunk. Currently this is missing: we always try to shrink, no matter what. It could be useful to add some DBUG_EXECUTE_IF to enable the "always shrink" logic for testing.

            Even though the extent descriptor page 0x1000 was not modified in my above test, thiru says that if the page allocation patterns had been different, this could be needed.

            The actual worst-case limit of the number of pages latched (and possibly modified) is truncated_size_in_pages/innodb_page_size.

            The worst case would be shrinking a innodb_page_size=4k tablespace so that the last possible descriptor page 0xfffff000 is removed. All 1,048,575 descriptor pages (every 0x1000 bytes) from 0x00000000 to 0xffffe000 would have to be accessed. If we held latches to all these pages at the same time, we would need 4GiB of buffer pool, which is more than 1,000 times the minimum innodb_buffer_pool_size. So, we should avoid doing that.

            The typical amount of redo log written for one page should be much smaller than the page size, hopefully something like 50 bytes. If the maximum mini-transaction size is 4MiB, that should be enough for modifying some 80,000 extent descriptor pages.

            I do not think that the worst case would involve modifying all 1,048,575 descriptor pages. There is only a limited number of list nodes in a descriptor page, let’s call it N. So, at most N pages can have pointers to the last descriptor page and may have to be modified. If we shrink the tablespace so that the M last descriptor pages ending at 0xfffff000 are modified, then we may have to modify N*M surviving descriptor pages. What is the maximum possible M*N?

            marko Marko Mäkelä added a comment - Even though the extent descriptor page 0x1000 was not modified in my above test, thiru says that if the page allocation patterns had been different, this could be needed. The actual worst-case limit of the number of pages latched (and possibly modified) is truncated_size_in_pages/ innodb_page_size . The worst case would be shrinking a innodb_page_size=4k tablespace so that the last possible descriptor page 0xfffff000 is removed. All 1,048,575 descriptor pages (every 0x1000 bytes) from 0x00000000 to 0xffffe000 would have to be accessed. If we held latches to all these pages at the same time, we would need 4GiB of buffer pool, which is more than 1,000 times the minimum innodb_buffer_pool_size . So, we should avoid doing that. The typical amount of redo log written for one page should be much smaller than the page size, hopefully something like 50 bytes. If the maximum mini-transaction size is 4MiB, that should be enough for modifying some 80,000 extent descriptor pages. I do not think that the worst case would involve modifying all 1,048,575 descriptor pages. There is only a limited number of list nodes in a descriptor page, let’s call it N. So, at most N pages can have pointers to the last descriptor page and may have to be modified. If we shrink the tablespace so that the M last descriptor pages ending at 0xfffff000 are modified, then we may have to modify N*M surviving descriptor pages. What is the maximum possible M*N?
            marko Marko Mäkelä added a comment - - edited

            Because there are 256 extent descriptor list elements (XDES_FLST_NODE) per extent descriptor page, shrinking the system tablespace by 1 extent will involve modifying at most 256 other extent descriptor pages. Shrinking by 2 extents may require modifying up to 2*256=512 extent descriptor pages.

            Close to the worst case should be to shrink a tablespace with innodb_page_size=4k from the maximum number of extents (1,048,576) by something like 3,584 extents to 1,044,992 extents. If everything is laid out in the pessimal way, this would involve modifying 3584*256=917504 extent descriptor pages, requiring 3.5 GiB of buffer pool. If the maximum mini-transaction log size is 2MiB (before MDEV-14425, it might be even less), we would only have some 2.28 bytes per page available for logging the modifications to each page. That is clearly not enough, since the minimum size of the record header is 3 bytes (4 bytes for logging changes to any other extent descriptor page than the first one).

            We clearly must implement some logic to determine if the shrinking operation is possible at all (if we have enough buffer pool and log space available). Only if the system tablespace will be shrunk little enough or to a small enough final size, it can succeed. We must first validate everything, before committing to shrink the tablespace.

            Furthermore, to avoid any bugs or extra overhead at startup related to this, it would be good to make this feature opt-in, that is, only attempt shrinking the system tablespace if innodb_data_file_path contains an extra attribute :autoshrink to enable the attempt.

            marko Marko Mäkelä added a comment - - edited Because there are 256 extent descriptor list elements ( XDES_FLST_NODE ) per extent descriptor page, shrinking the system tablespace by 1 extent will involve modifying at most 256 other extent descriptor pages. Shrinking by 2 extents may require modifying up to 2*256=512 extent descriptor pages. Close to the worst case should be to shrink a tablespace with innodb_page_size=4k from the maximum number of extents (1,048,576) by something like 3,584 extents to 1,044,992 extents. If everything is laid out in the pessimal way, this would involve modifying 3584*256=917504 extent descriptor pages, requiring 3.5 GiB of buffer pool. If the maximum mini-transaction log size is 2MiB (before MDEV-14425 , it might be even less), we would only have some 2.28 bytes per page available for logging the modifications to each page. That is clearly not enough, since the minimum size of the record header is 3 bytes (4 bytes for logging changes to any other extent descriptor page than the first one). We clearly must implement some logic to determine if the shrinking operation is possible at all (if we have enough buffer pool and log space available). Only if the system tablespace will be shrunk little enough or to a small enough final size, it can succeed. We must first validate everything, before committing to shrink the tablespace. Furthermore, to avoid any bugs or extra overhead at startup related to this, it would be good to make this feature opt-in, that is, only attempt shrinking the system tablespace if innodb_data_file_path contains an extra attribute :autoshrink to enable the attempt.

            I think that implementing the following will address the biggest concerns. I verified that the first two of my list were done today:

            • Write all 32 bits of FSP_SIZE at 0x2e.
            • Make this opt-in by requiring an :autoshrink attribute to be added to innodb_data_file_path.
            • Check that we will not run out of buffer pool.
            • Check that we will not write too much redo log: the maximum mini-transaction size in the MDEV-14425 format should be 2MiB, including the 5 bytes of terminator and checksum.
            • If either limit is exceeded, revert any changes to the buffer pool and discard the mini-transaction log.
            • For test purposes, introduce some debug parameter or DBUG_EXECUTE_IF to make the maximum mini-transaction size smaller, say, 100 to 1000 bytes.
            • Add debug code to invoke flst_validate() on all the extent descriptor lists in the system tablespace, before and after attempting truncation.

            Optimizing the redo log size is optional and might be done later, if there is real customer demand for it. I would anticipate other demands for follow-up improvements first, such as these:

            • Defragmenting or optimizing the system tables, to ensure that all data will be stored near the start of the system tablespace.
            • Repairing a system tablespace to minimal size by copying the actual needed contents, exclusing things like SYS_TABLESPACES or SYS_DATAFILES tables that existed between MariaDB Server 10.0 and 10.5 until MDEV-22343 removed them.
            marko Marko Mäkelä added a comment - I think that implementing the following will address the biggest concerns. I verified that the first two of my list were done today: Write all 32 bits of FSP_SIZE at 0x2e. Make this opt-in by requiring an :autoshrink attribute to be added to innodb_data_file_path . Check that we will not run out of buffer pool. Check that we will not write too much redo log: the maximum mini-transaction size in the MDEV-14425 format should be 2MiB, including the 5 bytes of terminator and checksum. If either limit is exceeded, revert any changes to the buffer pool and discard the mini-transaction log. For test purposes, introduce some debug parameter or DBUG_EXECUTE_IF to make the maximum mini-transaction size smaller, say, 100 to 1000 bytes. Add debug code to invoke flst_validate() on all the extent descriptor lists in the system tablespace, before and after attempting truncation. Optimizing the redo log size is optional and might be done later, if there is real customer demand for it. I would anticipate other demands for follow-up improvements first, such as these: Defragmenting or optimizing the system tables, to ensure that all data will be stored near the start of the system tablespace. Repairing a system tablespace to minimal size by copying the actual needed contents, exclusing things like SYS_TABLESPACES or SYS_DATAFILES tables that existed between MariaDB Server 10.0 and 10.5 until MDEV-22343 removed them.
            mleich Matthias Leich added a comment - - edited

            origin/bb-11.2-MDEV-14795_1 1779cb1af979cad1abe9d5c3ed4d37b6503175aa 2023-07-11T20:25:49+05:30 + some patch  (build with debug+asan+ optimization Og)
            behaved relative well in RQG testing. I observed many bad effects but they occured/occur in preview-11.2,
            11.2 b8c039fad16adf9cbc6dac2e34cdeb772b9065a0 2023-07-07T09:05:33+10:00
            and sometimes other trees too.
             
            The main testing campaign consisted of ~ 9900 tests which all used the server
            startup option  "--mysqld=--innodb_data_file_path=ibdata1:1M:autoextend:autoshrink".
            Statistics from the remaining test results:
            smallest system tablespace ever shrinked 1024 pages
            biggest system tablespace ever shrinked 41152 pages
            biggest shrink operation ~ 16000 pages truncated
            smallest shrink operation ~ 250 pages truncated
            
            

            mleich Matthias Leich added a comment - - edited origin/bb-11.2-MDEV-14795_1 1779cb1af979cad1abe9d5c3ed4d37b6503175aa 2023-07-11T20:25:49+05:30 + some patch (build with debug+asan+ optimization Og) behaved relative well in RQG testing. I observed many bad effects but they occured/occur in preview-11.2, 11.2 b8c039fad16adf9cbc6dac2e34cdeb772b9065a0 2023-07-07T09:05:33+10:00 and sometimes other trees too.   The main testing campaign consisted of ~ 9900 tests which all used the server startup option "--mysqld=--innodb_data_file_path=ibdata1:1M:autoextend:autoshrink". Statistics from the remaining test results: smallest system tablespace ever shrinked 1024 pages biggest system tablespace ever shrinked 41152 pages biggest shrink operation ~ 16000 pages truncated smallest shrink operation ~ 250 pages truncated

            Complete patch is in bb-11.2-MDEV-14795_1

            thiru Thirunarayanan Balathandayuthapani added a comment - Complete patch is in bb-11.2- MDEV-14795 _1
            mleich Matthias Leich added a comment - - edited

            origin/bb-11.2-MDEV-14795_1 6575d969aef4d06fb529f467f6c9d3ca60b0bb53 2023-07-17T10:43:15+05:30 (build type RelWithDebInfo)
            performed well in RQG testing. No new problems.

            Statistics from the remaining test results:
            smallest system tablespace ever shrinked 1024 pages
            biggest system tablespace ever shrinked 16640 pages
            biggest shrink operation ~ 16000 pages truncated
            smallest shrink operation ~ 832 pages truncated
            I assume that the difference to the statistics from the previous run
            seems caused by the different source tree.
            A build with debug+asan+Optimization Og showed similar numbers like the
            RelWithDebInfo build.

            mleich Matthias Leich added a comment - - edited origin/bb-11.2- MDEV-14795 _1 6575d969aef4d06fb529f467f6c9d3ca60b0bb53 2023-07-17T10:43:15+05:30 (build type RelWithDebInfo) performed well in RQG testing. No new problems. Statistics from the remaining test results: smallest system tablespace ever shrinked 1024 pages biggest system tablespace ever shrinked 16640 pages biggest shrink operation ~ 16000 pages truncated smallest shrink operation ~ 832 pages truncated I assume that the difference to the statistics from the previous run seems caused by the different source tree. A build with debug+asan+Optimization Og showed similar numbers like the RelWithDebInfo build.

            I posted some review comments, mostly about the formatting of messages, and code that is not covered by mtr tests. I think that most code should be covered if you create a debug-instrumented big_test that will use innodb_page_size=4k and extend the system tablespace to slightly more than 16 MiB (4096*4096 bytes). With innodb_limit_optimistic_insert_debug=2 (like my previously posted script does) it should be relatively fast. I think that the mandatory MSAN builder should run such a test.

            Tomorrow, I will review the logic to "roll back" a mini-transaction when the size constraints are exceeded.

            marko Marko Mäkelä added a comment - I posted some review comments, mostly about the formatting of messages, and code that is not covered by mtr tests. I think that most code should be covered if you create a debug-instrumented big_test that will use innodb_page_size=4k and extend the system tablespace to slightly more than 16 MiB (4096*4096 bytes). With innodb_limit_optimistic_insert_debug=2 (like my previously posted script does) it should be relatively fast. I think that the mandatory MSAN builder should run such a test. Tomorrow, I will review the logic to "roll back" a mini-transaction when the size constraints are exceeded.

            The general logic looks good to me. I think that this needs one more review round after the code coverage in the regression test suite has been improved.

            marko Marko Mäkelä added a comment - The general logic looks good to me. I think that this needs one more review round after the code coverage in the regression test suite has been improved.

            The code coverage looks acceptable to me now. I made some minor suggestions on improving the out-of-memory detection. This should be OK to push after addressing the last round of my comments.

            marko Marko Mäkelä added a comment - The code coverage looks acceptable to me now. I made some minor suggestions on improving the out-of-memory detection. This should be OK to push after addressing the last round of my comments.

            origin/bb-11.2-MDEV-14795 f9003c73a15d892423845a2df96c8bd73bbe96ca 2023-08-01T19:43:04+05:30
            performed well in RQG testing. The failures observed occur on other trees too, are in JIRA or are not caused by MDEV-14795.
            

            mleich Matthias Leich added a comment - origin/bb-11.2-MDEV-14795 f9003c73a15d892423845a2df96c8bd73bbe96ca 2023-08-01T19:43:04+05:30 performed well in RQG testing. The failures observed occur on other trees too, are in JIRA or are not caused by MDEV-14795.

            Together with MDEV-19229, this allows the storage space consumed by the InnoDB system tablespace to be reclaimed.

            The defragmentation that is mentioned in the Description never shrunk any files; it only tried to move some B-tree pages within a data file in an attempt to improve performance on HDDs. That code was removed in MDEV-30545.

            marko Marko Mäkelä added a comment - Together with MDEV-19229 , this allows the storage space consumed by the InnoDB system tablespace to be reclaimed. The defragmentation that is mentioned in the Description never shrunk any files; it only tried to move some B-tree pages within a data file in an attempt to improve performance on HDDs. That code was removed in MDEV-30545 .
            psumner Phil Sumner added a comment -

            Does this change mean that there will be a delay in startup for particularly large ibdata1 files?

            i.e. I've got a Galera cluster where ibdata1 has over time grown to > 300GB, and it definitely doesn't need to be this big with `file_per_table` having been set since database creation.

            psumner Phil Sumner added a comment - Does this change mean that there will be a delay in startup for particularly large ibdata1 files? i.e. I've got a Galera cluster where ibdata1 has over time grown to > 300GB, and it definitely doesn't need to be this big with `file_per_table` having been set since database creation.

            psumner, shrinking the system tablespace involves:

            1. traversing every allocation bitmap page (pages 0, innodb_page_size, 2*innodb_page_size…) within the old size,
            2. adjusting the bitmap pages within the shrunk size,
            3. durably writing log for the change, and
            4. shrinking the last system tablespace file by ftruncate()

            When using the default innodb_page_size=16k, there will be an allocation bitmap page every 16384*16384 bytes (256 megabytes, or quarter gigabyte). Shrinking a 300GiB file would involve traversing 1,200 bitmap pages. In an extreme case, the file could be shrunk to 256 megabytes or less, that is, we would write log covering just the first page of the file. In such an extreme case, the time should be dominated by the file system operation. Truncating or deleting a heavily fragmented file can take considerable time. How much, depends on the type of storage and the file system, and the operating system kernel version.

            By popular demand, I filed MDEV-32452 so that this cleanup can be executed on shutdown rather than startup.

            marko Marko Mäkelä added a comment - psumner , shrinking the system tablespace involves: traversing every allocation bitmap page (pages 0, innodb_page_size , 2* innodb_page_size …) within the old size, adjusting the bitmap pages within the shrunk size, durably writing log for the change, and shrinking the last system tablespace file by ftruncate() When using the default innodb_page_size=16k , there will be an allocation bitmap page every 16384*16384 bytes (256 megabytes, or quarter gigabyte). Shrinking a 300GiB file would involve traversing 1,200 bitmap pages. In an extreme case, the file could be shrunk to 256 megabytes or less, that is, we would write log covering just the first page of the file. In such an extreme case, the time should be dominated by the file system operation. Truncating or deleting a heavily fragmented file can take considerable time. How much, depends on the type of storage and the file system, and the operating system kernel version. By popular demand, I filed MDEV-32452 so that this cleanup can be executed on shutdown rather than startup.

            People

              thiru Thirunarayanan Balathandayuthapani
              martina342 martina342
              Votes:
              2 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.