Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-40033

HEAP: free-list contiguous block coalescing

    XMLWordPrintable

Details

    Description

      Summary

      The HEAP engine's delete free list is a singly-linked list of individual records. Each deleted record stores a uchar* pointer (8 bytes) to the next deleted record. All allocation and traversal is O(N) per record, even when records are physically contiguous in HP_BLOCK memory. The free list has no way to express contiguity, forcing record-by-record pointer chasing.

      This affects:

      • Single-record allocation (next_free_record_pos()): pops one record per call, one pointer dereference each
      • Blob continuation chain allocation (hp_write_one_blob() Step 1): walks the entire free list checking pos == run_start - recbuffer to detect contiguity – O(free_list_length) per blob write
      • Sequential scan (heap_scan()): steps through deleted records one by one, checking pos[visible] on each
      • Table validation (heap_check_heap()): same one-by-one walk

      For a table with 1M deleted contiguous records, this means up to 1M pointer dereferences for operations that could be answered with simple pointer arithmetic.

      Solution: Contiguous Block Entries on the Free List

      Transform the free list from a chain of individual records into a chain of contiguous blocks. Each block uses two records (first and last) to represent an arbitrarily large contiguous group. Middle records are "dark" – not on the free list, implicitly covered by the block's address range.

      Record Layout

      Deleted records already have a del_link pointer at bytes 0-7 and a flags byte at pos[visible] (zeroed for deleted records). Between them, bytes 8-10 are currently unused garbage (record data area of a deleted record). These are repurposed as block metadata:

      • Byte 8 (del_flag): flag field with two bits defined:
        • Bit 0 (HP_DEL_BLOCK_END): this record is the last (highest address) of a contiguous free block
        • Bit 1 (HP_DEL_BLOCK_START): this record is the first (lowest address) of a contiguous free block
      • Bytes 9-10 (del_count): uint16 record count, stored on block-start records only

      Current deleted record layout (single record, unchanged for N=1):

      +-----------------------+---------------+-------------+----- ... ------+---------+
      | del_link              | del_flag      | garbage     | garbage        | flags   |
      | (8 bytes, -> next)    | = 0           | (2 bytes)   |                | = 0     |
      +-----------------------+---------------+-------------+----- ... ------+---------+
        byte 0                7               8             10               visible
      

      Block-start record (first record of a contiguous block, rec[0]):

      +-----------------------+---------------+-------------+----- ... ------+---------+
      | del_link              | del_flag      | del_count   | garbage        | flags   |
      | (8 bytes, -> next)    | = 0x02        | (uint16, N) |                | = 0     |
      +-----------------------+---------------+-------------+----- ... ------+---------+
        byte 0                7 BLOCK_START   8             10               visible
      

      Block-end record (last record of a contiguous block, rec[N-1]):

      +-----------------------+---------------+-------------+----- ... ------+---------+
      | del_link              | del_flag      | garbage     | garbage        | flags   |
      | (8 bytes, -> rec[0])  | = 0x01        | (2 bytes)   |                | = 0     |
      +-----------------------+---------------+-------------+----- ... ------+---------+
        byte 0                7 BLOCK_END     8             10               visible
      

      Dark record (middle record, rec[1..N-2]):

      +-----------------------+---------------+-------------+----- ... ------+---------+
      | NULL                  | del_flag      | 0x0000      | garbage        | flags   |
      | (8 bytes, zeroed)     | = 0           | (zeroed)    |                | = 0     |
      +-----------------------+---------------+-------------+----- ... ------+---------+
        byte 0  poisoned      7               8             10               visible
      

      Block Structure

      Free-list chain for a contiguous block of N deleted records:

      share->del_link
            |
            v
        rec[N-1]  ----del_link----->  rec[0]  ----del_link----->  (next free-list entry)
        (BLOCK_END)                   (BLOCK_START, count=N)
       
        rec[1] .. rec[N-2] are "dark" -- not on the free list,
        bytes 0-10 zeroed, pos[visible] = 0
      

      For single records (N=1): del_flag = 0, current behavior unchanged.

      Two Traversal Directions

      Free-list direction (descending addresses – allocation and blob write):

      Hit rec[N-1] (block-end), read pointer to rec[0], compute block size via pointer arithmetic or read the count from rec[0] bytes 9-10. Follow rec[0]'s del_link to the next entry. Two pointer reads per block instead of N.

      Scan direction (ascending addresses – heap_scan(), heap_check_heap()):

      Hit rec[0] (block-start), read count from bytes 9-10, batch-skip the entire block. Same pattern as the existing blob continuation run batch-skip.

      Dark Records

      Records rec[1] through rec[N-2] are "dark" – not individually chained on the free list. All dark records have bytes 0-10 zeroed (del_link = NULL, del_flag = 0, count = 0) and pos[visible] = 0. This provides:

      • Immediate SIGSEGV on accidental pointer dereference (NULL)
      • Rapid corruption detection: any non-zero value in bytes 0-10 of a dark record indicates memory corruption
      • Safety net in case the scan encounters a dark record (e.g., due to a bug): it sees pos[visible] == 0 and skips

      visible Minimum Bumped to 11

      Block metadata uses 11 bytes: 8 (del_link) + 1 (del_flag) + 2 (del_count). The visible minimum is bumped from sizeof(char*) (8) to 11. This has zero memory impact: recbuffer = ALIGN(visible + 1, 8). For reclength 1 through 10, recbuffer is 16 both before and after the bump. The 8-byte alignment absorbs the shift.

      reclength  visible_old  recbuf_old  visible_new  recbuf_new
      1-8        8            16          11           16          (no change)
      9          9            16          11           16          (no change)
      10         10           16          11           16          (no change)
      11         11           16          11           16          (no change)
      12+        reclength     unchanged   reclength    unchanged   (no change)
      

      No conditional guards or fallback paths needed. Every table gets block coalescing unconditionally. The existing blob-specific bump (MY_MAX(visible_offset, HP_CONT_HEADER_SIZE + 1)) becomes redundant.

      API

      All encapsulated as static inline functions in heapdef.h.

      Push Side (freeing records)

      • hp_push_free_record(share, pos) – push a single record onto the free list. Sets del_link, del_flag = 0, pos[visible] = 0, adjusts counters. Replaces the 3-line pattern currently inlined in heap_delete(), heap_write() error path, and hp_free_run_chain().
      • hp_push_free_block(share, first, count) – push a contiguous block. Sets up block-start, block-end, zeroes dark records. Used by hp_free_run_chain() for multi-record blob continuation runs.

      Traverse Side (read-only)

      Free-list direction:

      • hp_is_free_block_end(pos) – check del_flag & HP_DEL_BLOCK_END
      • hp_free_block_first(pos) – read pointer to first record of block
      • hp_free_block_count(pos, recbuffer) – pointer arithmetic or read from first record
      • hp_free_block_next(pos) – next free-list entry (via first record's del_link)

      Scan direction:

      • hp_is_free_block_start(pos) – check del_flag & HP_DEL_BLOCK_START
      • hp_free_block_start_count(pos) – read uint16 count from bytes 9-10

      Consume Side (taking records from the free list)

      • hp_pop_free_record(share) – pop one record from the free-list head. If the head is a single record, pops it directly. If the head is a block-end, delegates to hp_take_free_block(share, 1).
      • hp_take_free_block(share, count) – take N records from the head block (partial or entire). Three cases:
        • Block consumed entirely (remaining == 0): del_link advances to next entry via rec[0]'s pointer
        • Block collapses to 1 (remaining == 1, new_last == first): del_link = first. Critical: do NOT write first[0..7] – it already holds the chain to the next entry
        • Block shrinks (remaining >= 2): promote dark record to new block-end. Write new_last[0..7] = first, new_last[8] = HP_DEL_BLOCK_END, del_link = new_last. Update rec[0] count.

      Affected Code Paths

      Push paths

      • heap_delete() (hp_delete.c) – single record delete: hp_push_free_record()
      • heap_write() error path (hp_write.c) – failed insert rollback: hp_push_free_record()
      • hp_free_run_chain() (hp_blob.c) – blob continuation chain free: hp_push_free_block() per multi-record run, hp_push_free_record() for single-record runs

      Consume paths

      • next_free_record_pos() (hp_write.c) – single record allocation: hp_pop_free_record()
      • hp_take_free_list_runs() (hp_blob.c) – shared helper for Steps 1 and 3 of hp_write_one_blob(), parameterized by min_avail. Dispatches to hp_take_free_block() for blocks, hp_pop_free_record() for singles. Eliminates the O(N) contiguity detection walk entirely.

      Scan paths (batch-skip deleted blocks)

      • heap_scan() (hp_scan.c) – on deleted record, check hp_is_free_block_start(), read count, batch-skip entire block. Same pattern as continuation run batch-skip.
      • heap_check_heap() (_check.c) – block-aware free-list count and scan; validates block-start/end flags, dark record zeroing (debug), bounds on block count vs total_records + deleted

      Other affected paths

      • hp_shrink_tail() (hp_blob.c) – reclaims entire blocks at the tail in one step when the free-list head is a block-end at the tail boundary

      Not Implemented: Coalescing on heap_delete()

      Neighbor coalescing (merging a newly deleted record with adjacent deleted neighbors) was not implemented. Currently blocks are only formed by hp_free_run_chain() when freeing blob continuation chains. Individual deletes always push singles. The singly-linked free list makes mid-list block merging complex; future options include doubly-linked block list, deferred coalescing, or restricting to head-adjacent cases.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              arcivanov Arcadiy Ivanov
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.