Details
-
New Feature
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Summary
The HEAP engine's delete free list is a singly-linked list of individual records. Each deleted record stores a uchar* pointer (8 bytes) to the next deleted record. All allocation and traversal is O(N) per record, even when records are physically contiguous in HP_BLOCK memory. The free list has no way to express contiguity, forcing record-by-record pointer chasing.
This affects:
- Single-record allocation (next_free_record_pos()): pops one record per call, one pointer dereference each
- Blob continuation chain allocation (hp_write_one_blob() Step 1): walks the entire free list checking pos == run_start - recbuffer to detect contiguity – O(free_list_length) per blob write
- Sequential scan (heap_scan()): steps through deleted records one by one, checking pos[visible] on each
- Table validation (heap_check_heap()): same one-by-one walk
For a table with 1M deleted contiguous records, this means up to 1M pointer dereferences for operations that could be answered with simple pointer arithmetic.
Solution: Contiguous Block Entries on the Free List
Transform the free list from a chain of individual records into a chain of contiguous blocks. Each block uses two records (first and last) to represent an arbitrarily large contiguous group. Middle records are "dark" – not on the free list, implicitly covered by the block's address range.
Record Layout
Deleted records already have a del_link pointer at bytes 0-7 and a flags byte at pos[visible] (zeroed for deleted records). Between them, bytes 8-10 are currently unused garbage (record data area of a deleted record). These are repurposed as block metadata:
- Byte 8 (del_flag): flag field with two bits defined:
- Bit 0 (HP_DEL_BLOCK_END): this record is the last (highest address) of a contiguous free block
- Bit 1 (HP_DEL_BLOCK_START): this record is the first (lowest address) of a contiguous free block
- Bytes 9-10 (del_count): uint16 record count, stored on block-start records only
Current deleted record layout (single record, unchanged for N=1):
+-----------------------+---------------+-------------+----- ... ------+---------+
|
| del_link | del_flag | garbage | garbage | flags |
|
| (8 bytes, -> next) | = 0 | (2 bytes) | | = 0 |
|
+-----------------------+---------------+-------------+----- ... ------+---------+
|
byte 0 7 8 10 visible
|
Block-start record (first record of a contiguous block, rec[0]):
+-----------------------+---------------+-------------+----- ... ------+---------+
|
| del_link | del_flag | del_count | garbage | flags |
|
| (8 bytes, -> next) | = 0x02 | (uint16, N) | | = 0 |
|
+-----------------------+---------------+-------------+----- ... ------+---------+
|
byte 0 7 BLOCK_START 8 10 visible
|
Block-end record (last record of a contiguous block, rec[N-1]):
+-----------------------+---------------+-------------+----- ... ------+---------+
|
| del_link | del_flag | garbage | garbage | flags |
|
| (8 bytes, -> rec[0]) | = 0x01 | (2 bytes) | | = 0 |
|
+-----------------------+---------------+-------------+----- ... ------+---------+
|
byte 0 7 BLOCK_END 8 10 visible
|
Dark record (middle record, rec[1..N-2]):
+-----------------------+---------------+-------------+----- ... ------+---------+
|
| NULL | del_flag | 0x0000 | garbage | flags |
|
| (8 bytes, zeroed) | = 0 | (zeroed) | | = 0 |
|
+-----------------------+---------------+-------------+----- ... ------+---------+
|
byte 0 poisoned 7 8 10 visible
|
Block Structure
Free-list chain for a contiguous block of N deleted records:
share->del_link
|
|
|
v
|
rec[N-1] ----del_link-----> rec[0] ----del_link-----> (next free-list entry)
|
(BLOCK_END) (BLOCK_START, count=N)
|
|
|
rec[1] .. rec[N-2] are "dark" -- not on the free list,
|
bytes 0-10 zeroed, pos[visible] = 0
|
For single records (N=1): del_flag = 0, current behavior unchanged.
Two Traversal Directions
Free-list direction (descending addresses – allocation and blob write):
Hit rec[N-1] (block-end), read pointer to rec[0], compute block size via pointer arithmetic or read the count from rec[0] bytes 9-10. Follow rec[0]'s del_link to the next entry. Two pointer reads per block instead of N.
Scan direction (ascending addresses – heap_scan(), heap_check_heap()):
Hit rec[0] (block-start), read count from bytes 9-10, batch-skip the entire block. Same pattern as the existing blob continuation run batch-skip.
Dark Records
Records rec[1] through rec[N-2] are "dark" – not individually chained on the free list. All dark records have bytes 0-10 zeroed (del_link = NULL, del_flag = 0, count = 0) and pos[visible] = 0. This provides:
- Immediate SIGSEGV on accidental pointer dereference (NULL)
- Rapid corruption detection: any non-zero value in bytes 0-10 of a dark record indicates memory corruption
- Safety net in case the scan encounters a dark record (e.g., due to a bug): it sees pos[visible] == 0 and skips
visible Minimum Bumped to 11
Block metadata uses 11 bytes: 8 (del_link) + 1 (del_flag) + 2 (del_count). The visible minimum is bumped from sizeof(char*) (8) to 11. This has zero memory impact: recbuffer = ALIGN(visible + 1, 8). For reclength 1 through 10, recbuffer is 16 both before and after the bump. The 8-byte alignment absorbs the shift.
reclength visible_old recbuf_old visible_new recbuf_new
|
1-8 8 16 11 16 (no change)
|
9 9 16 11 16 (no change)
|
10 10 16 11 16 (no change)
|
11 11 16 11 16 (no change)
|
12+ reclength unchanged reclength unchanged (no change)
|
No conditional guards or fallback paths needed. Every table gets block coalescing unconditionally. The existing blob-specific bump (MY_MAX(visible_offset, HP_CONT_HEADER_SIZE + 1)) becomes redundant.
API
All encapsulated as static inline functions in heapdef.h.
Push Side (freeing records)
- hp_push_free_record(share, pos) – push a single record onto the free list. Sets del_link, del_flag = 0, pos[visible] = 0, adjusts counters. Replaces the 3-line pattern currently inlined in heap_delete(), heap_write() error path, and hp_free_run_chain().
- hp_push_free_block(share, first, count) – push a contiguous block. Sets up block-start, block-end, zeroes dark records. Used by hp_free_run_chain() for multi-record blob continuation runs.
Traverse Side (read-only)
Free-list direction:
- hp_is_free_block_end(pos) – check del_flag & HP_DEL_BLOCK_END
- hp_free_block_first(pos) – read pointer to first record of block
- hp_free_block_count(pos, recbuffer) – pointer arithmetic or read from first record
- hp_free_block_next(pos) – next free-list entry (via first record's del_link)
Scan direction:
- hp_is_free_block_start(pos) – check del_flag & HP_DEL_BLOCK_START
- hp_free_block_start_count(pos) – read uint16 count from bytes 9-10
Consume Side (taking records from the free list)
- hp_pop_free_record(share) – pop one record from the free-list head. If the head is a single record, pops it directly. If the head is a block-end, delegates to hp_take_free_block(share, 1).
- hp_take_free_block(share, count) – take N records from the head block (partial or entire). Three cases:
- Block consumed entirely (remaining == 0): del_link advances to next entry via rec[0]'s pointer
- Block collapses to 1 (remaining == 1, new_last == first): del_link = first. Critical: do NOT write first[0..7] – it already holds the chain to the next entry
- Block shrinks (remaining >= 2): promote dark record to new block-end. Write new_last[0..7] = first, new_last[8] = HP_DEL_BLOCK_END, del_link = new_last. Update rec[0] count.
Affected Code Paths
Push paths
- heap_delete() (hp_delete.c) – single record delete: hp_push_free_record()
- heap_write() error path (hp_write.c) – failed insert rollback: hp_push_free_record()
- hp_free_run_chain() (hp_blob.c) – blob continuation chain free: hp_push_free_block() per multi-record run, hp_push_free_record() for single-record runs
Consume paths
- next_free_record_pos() (hp_write.c) – single record allocation: hp_pop_free_record()
- hp_take_free_list_runs() (hp_blob.c) – shared helper for Steps 1 and 3 of hp_write_one_blob(), parameterized by min_avail. Dispatches to hp_take_free_block() for blocks, hp_pop_free_record() for singles. Eliminates the O(N) contiguity detection walk entirely.
Scan paths (batch-skip deleted blocks)
- heap_scan() (hp_scan.c) – on deleted record, check hp_is_free_block_start(), read count, batch-skip entire block. Same pattern as continuation run batch-skip.
- heap_check_heap() (_check.c) – block-aware free-list count and scan; validates block-start/end flags, dark record zeroing (debug), bounds on block count vs total_records + deleted
Other affected paths
- hp_shrink_tail() (hp_blob.c) – reclaims entire blocks at the tail in one step when the free-list head is a block-end at the tail boundary
Not Implemented: Coalescing on heap_delete()
Neighbor coalescing (merging a newly deleted record with adjacent deleted neighbors) was not implemented. Currently blocks are only formed by hp_free_run_chain() when freeing blob continuation chains. Individual deletes always push singles. The singly-linked free list makes mid-list block merging complex; future options include doubly-linked block list, deferred coalescing, or restricting to head-adjacent cases.
Attachments
Issue Links
- is blocked by
-
MDEV-38975 BLOBs in MEMORY (HEAP)
-
- In Testing
-
- relates to
-
MDEV-38975 BLOBs in MEMORY (HEAP)
-
- In Testing
-
-
MDEV-40032 Promote wide VARCHAR to BLOB for HEAP internal temp tables
-
- Open
-