Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21351

Replace recv_sys.heap with list of buf_block_t*

Details

    Description

      The following suggestion was given to reduce the memory usage during recovery.
      1) Remove recv_sys.buf and replace it with log_sys.buf
      2) Remove recv_sys.heap with list of buf_block_t*

      But I would like to do (2) alone. It doesn't make sense to remove recv_sys.buf completely.
      Because log_sys.buf is being used to read the redo log pages during recovery.

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä added a comment - - edited

            This was originally filed as MDEV-19176, which we ended up implementing in a reduced scope for GA releases.

            I think that we must do the following as part of this task:

            • Remove recv_sys_t::heap and MEM_HEAP_FOR_RECV_SYS.
              • Allocate the recv_sys.pages and its elements directly by the system allocator.
              • Allocate buffer pool blocks directly by buf_block_alloc(). Use block->page.list for keeping track of allocated blocks.
              • Remove mem_heap_t::total_size that was introduced for speeding up mem_heap_get_size() calls during recovery.
              • Append log records to the last block, until the block->frame is filled up. (Similar to the current recv_sys.heap.)
              • Repurpose buf_block_t::modify_clock to count the number of pointers from recv_sys.pages, so that we will know when a block is safe to be freed.
              • Feel free to repurpose other fields, such as buf_page_t::oldest_modification, for storing state information. These were never used on block->page.state == BUF_BLOCK_MEMORY blocks.
            • Ensure that our buf_block_alloc() invocation is at least as robust as with the MDEV-19176 fix. If we cannot find a free block, do one of the following:
              • Flush and evict modified pages. (This is already being done by the current code.)
              • Apply buffered records to pages. Because of the reference counting (see buf_block_t::modify_clock above), this should free up blocks that held redo log records.
              • When allocating a block for redo log records, ensure that some block(s) will remain available for buf_page_create() or buf_page_get_gen(), for applying log to pages, so that the memory for redo log records can eventually be freed.

            The following are not strictly necessary at this point:

            • Remove the fixed-size (2MiB) buffer recv_sys.buf, and instead of it, use log_sys.buf (whose minimum size is 16MiB) for reading from the redo log file.
            • Remove recv_t::len and recv_data_t. Directly store pointers to records.
              • Remove recv_data_copy_to_buf() and RECV_DATA_BLOCK_SIZE. We cannot do this due to MLOG_COMP_REC_UPDATE_IN_PLACE records not being guaranteed to fit in innodb_page_size.
            • If possible, avoid copying records from the ‘read buffer’, as currently done by recv_add_to_hash_table() or recv_sys.add().
              • Only copy or move records when the ‘read buffer’ fills up. Algorithm: Traverse recv_sys.pages. For any pointer that is within the read buffer, determine the length to copy, and then copy the log to another buffer. At the end of the traversal, the entire read buffer can be freed.
              • Note: New entries may be appended to log_sys.buf during recovery, by arbitrary operations (after MDEV-14481 is implemented). We must be prepared to free up space from log_sys.buf at any time.

            thiru, while I agree that it is not strictly necessary to remove recv_sys.buf as part of this task, I would appreciate more detailed reasoning for keeping it.

            marko Marko Mäkelä added a comment - - edited This was originally filed as MDEV-19176 , which we ended up implementing in a reduced scope for GA releases. I think that we must do the following as part of this task: Remove recv_sys_t::heap and MEM_HEAP_FOR_RECV_SYS . Allocate the recv_sys.pages and its elements directly by the system allocator. Allocate buffer pool blocks directly by buf_block_alloc() . Use block->page.list for keeping track of allocated blocks. Remove mem_heap_t::total_size that was introduced for speeding up mem_heap_get_size() calls during recovery. Append log records to the last block, until the block->frame is filled up. (Similar to the current recv_sys.heap .) Repurpose buf_block_t::modify_clock to count the number of pointers from recv_sys.pages , so that we will know when a block is safe to be freed. Feel free to repurpose other fields, such as buf_page_t::oldest_modification , for storing state information. These were never used on block->page.state == BUF_BLOCK_MEMORY blocks. Ensure that our buf_block_alloc() invocation is at least as robust as with the MDEV-19176 fix. If we cannot find a free block, do one of the following: Flush and evict modified pages. (This is already being done by the current code.) Apply buffered records to pages. Because of the reference counting (see buf_block_t::modify_clock above), this should free up blocks that held redo log records. When allocating a block for redo log records, ensure that some block(s) will remain available for buf_page_create() or buf_page_get_gen() , for applying log to pages, so that the memory for redo log records can eventually be freed. The following are not strictly necessary at this point: Remove the fixed-size (2MiB) buffer recv_sys.buf , and instead of it, use log_sys.buf (whose minimum size is 16MiB) for reading from the redo log file. Remove recv_t::len and recv_data_t . Directly store pointers to records. Remove recv_data_copy_to_buf() and RECV_DATA_BLOCK_SIZE . We cannot do this due to MLOG_COMP_REC_UPDATE_IN_PLACE records not being guaranteed to fit in innodb_page_size . If possible, avoid copying records from the ‘read buffer’, as currently done by recv_add_to_hash_table() or recv_sys.add() . Only copy or move records when the ‘read buffer’ fills up. Algorithm: Traverse recv_sys.pages . For any pointer that is within the read buffer, determine the length to copy, and then copy the log to another buffer. At the end of the traversal, the entire read buffer can be freed. Note: New entries may be appended to log_sys.buf during recovery, by arbitrary operations (after MDEV-14481 is implemented). We must be prepared to free up space from log_sys.buf at any time. thiru , while I agree that it is not strictly necessary to remove recv_sys.buf as part of this task, I would appreciate more detailed reasoning for keeping it.

            Regarding the removal of RECV_DATA_BLOCK_SIZE, we should guarantee that the size of an individual log record never exceeds innodb_page_size. In the pre-MDEV-12353 record format, the longest records could be MLOG_UNDO_INSERT (writing an undo log record), MLOG_WRITE_STRING (when used for writing BLOB pages) or MLOG_ZIP_PAGE_COMPRESS (for writing ROW_FORMAT=COMPRESSED index pages), and possibly MLOG_COMP_REC_UPDATE_IN_PLACE (for updating a record in ROW_FORMAT≠REDUNDANT).

            The maximum size of MLOG_UNDO_INSERT is innodb_page_size+11+13-56 bytes or less. No problem there.
            The maximum size of MLOG_ZIP_PAGE_COMPRESS is innodb_page_size+11+2+2+8-24 bytes, that is, 1 byte below the limit.
            The maximum size of MLOG_WRITE_STRING should be innodb_page_size+11-26 bytes for writing ROW_FORMAT=COMPRESSED BLOB, and 16 bytes less for other BLOBs.

            For MLOG_COMP_REC_UPDATE_IN_PLACE, the maximum-size record ought to be 11+2*n_fields+1+1+7+7+2+n_updated_fields*(2+2+len)+1+2 bytes. The worst case must necessarily be a clustered index record. Because the maximum number of columns in a key is 16, a secondary index record can have at most 32+1 fields (1 field for the child page number in a node pointer). The minimum size of a clustered index record in ROW_FORMAT≠REDUNDANT is 5+6+7=18 bytes (at least 5 bytes of header and 6+7 bytes for DB_TRX_ID,DB_ROLL_PTR). The 13 bytes of the system columns are logged separately in the redo log record. So, the maximum log record size can be simplified to 19+n_fields*2+n_updated_fields*4+len bytes. An index page must hold at least 2 records, so the maximum record size is (innodb_page_size-120-8-2*2)/2 bytes, or innodb_page_size/2-66 bytes, including the record header. The minimum record header size of 5 bytes occurs when all columns are fixed-length and NOT NULL. Assuming that, we will get the following limits:

            innodb_page_size maximum index record size maximum log record size
            4096 1982 8104
            8192 4030 10152
            16384 8126 14248
            32768 16318 22440
            65536 16383 22505

            With innodb_page_size<16k it is possible that the MLOG_COMP_REC_UPDATE_IN_PLACE record is longer than a single page, and hence we cannot remove RECV_DATA_BLOCK_SIZE as part of this task. The worst-case overhead factor ranges from 37% to 309%.

            marko Marko Mäkelä added a comment - Regarding the removal of RECV_DATA_BLOCK_SIZE , we should guarantee that the size of an individual log record never exceeds innodb_page_size . In the pre- MDEV-12353 record format, the longest records could be MLOG_UNDO_INSERT (writing an undo log record), MLOG_WRITE_STRING (when used for writing BLOB pages) or MLOG_ZIP_PAGE_COMPRESS (for writing ROW_FORMAT=COMPRESSED index pages), and possibly MLOG_COMP_REC_UPDATE_IN_PLACE (for updating a record in ROW_FORMAT≠REDUNDANT ). The maximum size of MLOG_UNDO_INSERT is innodb_page_size +11+13-56 bytes or less. No problem there. The maximum size of MLOG_ZIP_PAGE_COMPRESS is innodb_page_size +11+2+2+8-24 bytes, that is, 1 byte below the limit. The maximum size of MLOG_WRITE_STRING should be innodb_page_size +11-26 bytes for writing ROW_FORMAT=COMPRESSED BLOB, and 16 bytes less for other BLOBs. For MLOG_COMP_REC_UPDATE_IN_PLACE , the maximum-size record ought to be 11+2* n_fields +1+1+7+7+2+ n_updated_fields *(2+2+len)+1+2 bytes. The worst case must necessarily be a clustered index record. Because the maximum number of columns in a key is 16, a secondary index record can have at most 32+1 fields (1 field for the child page number in a node pointer). The minimum size of a clustered index record in ROW_FORMAT≠REDUNDANT is 5+6+7=18 bytes (at least 5 bytes of header and 6+7 bytes for DB_TRX_ID,DB_ROLL_PTR ). The 13 bytes of the system columns are logged separately in the redo log record. So, the maximum log record size can be simplified to 19+ n_fields *2+ n_updated_fields *4+len bytes. An index page must hold at least 2 records, so the maximum record size is ( innodb_page_size -120-8-2*2)/2 bytes, or innodb_page_size /2-66 bytes, including the record header. The minimum record header size of 5 bytes occurs when all columns are fixed-length and NOT NULL . Assuming that, we will get the following limits: innodb_page_size maximum index record size maximum log record size 4096 1982 8104 8192 4030 10152 16384 8126 14248 32768 16318 22440 65536 16383 22505 With innodb_page_size<16k it is possible that the MLOG_COMP_REC_UPDATE_IN_PLACE record is longer than a single page, and hence we cannot remove RECV_DATA_BLOCK_SIZE as part of this task. The worst-case overhead factor ranges from 37% to 309%.

            RQG test batterie with broad range coverage on bb-10.5-MDEV-21351 commit 90bd99909d2e42dac8af75b24d27fee411c0def1
            - no unknown issue hit
            - there is the usual share of failing tests like on actual 10.5
                  hitting open bugs   Deadlock (RQG's opinion), crash during shutdown, MDEV-20775 etc.
            

            mleich Matthias Leich added a comment - RQG test batterie with broad range coverage on bb-10.5-MDEV-21351 commit 90bd99909d2e42dac8af75b24d27fee411c0def1 - no unknown issue hit - there is the usual share of failing tests like on actual 10.5 hitting open bugs Deadlock (RQG's opinion), crash during shutdown, MDEV-20775 etc.
            mleich Matthias Leich added a comment - - edited

            bb-10.5-MDEV-21351 commit f67c9b67044fe0931bc96371718b02e58797ee51 2020-01-28
            Results of RQG testing with test battery for broad range functional coverage
            1051 finished RQG runs
            - 1020 * "pass"
            - certain asserts/crashes because of open bugs or fix has not yet reached the source tree  like MDEV-20775
            - 2 * [ERROR] [FATAL] InnoDB: ######################################## Deadlock Detected!
              Test analysis followed by simplification -> Replay test -> This replays in actual 10.5 too.
              So its not a problem introduced by MDEV-21351 code.
               to be inspected
            - 1 * mysqld: storage/innobase/include/rem0rec.ic:1393: ulint rec_get_converted_size(dict_index_t*, const dtuple_t*, ulint):
                     Assertion `n_ext == dtuple_get_n_ext(dtuple)' failed.      Maybe fixed by MDEV-21511
            

            mleich Matthias Leich added a comment - - edited bb-10.5-MDEV-21351 commit f67c9b67044fe0931bc96371718b02e58797ee51 2020-01-28 Results of RQG testing with test battery for broad range functional coverage 1051 finished RQG runs - 1020 * "pass" - certain asserts/crashes because of open bugs or fix has not yet reached the source tree like MDEV-20775 - 2 * [ERROR] [FATAL] InnoDB: ######################################## Deadlock Detected! Test analysis followed by simplification -> Replay test -> This replays in actual 10.5 too. So its not a problem introduced by MDEV-21351 code. to be inspected - 1 * mysqld: storage/innobase/include/rem0rec.ic:1393: ulint rec_get_converted_size(dict_index_t*, const dtuple_t*, ulint): Assertion `n_ext == dtuple_get_n_ext(dtuple)' failed. Maybe fixed by MDEV-21511
            marko Marko Mäkelä added a comment - - edited

            I pushed a follow-up improvement that frees the blocks that store redo log records as soon as the records have been processed. It also implements ASAN and Valgrind instrumentation to catch out-of-bounds access.

            marko Marko Mäkelä added a comment - - edited I pushed a follow-up improvement that frees the blocks that store redo log records as soon as the records have been processed. It also implements ASAN and Valgrind instrumentation to catch out-of-bounds access.

            People

              thiru Thirunarayanan Balathandayuthapani
              thiru Thirunarayanan Balathandayuthapani
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.