XML

Word

Printable

Details

Type: New Feature
Status: In Testing (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: 13.1
Component/s: Data Definition - Temporary, Storage Engine - Memory
Labels:
- Preview_13.1
- contribution

PM Planning:
- FR_CONTRIBUTION
- THEME_PERFORMANCE

Description

HEAP Engine: BLOB/TEXT/JSON/GEOMETRY Column Support

Motivation

The growing adoption of JSON columns (MYSQL_TYPE_BLOB-based), Dynamic Columns (COLUMN_CREATE/COLUMN_GET), and GEOMETRY types means that an increasing number of SQL operations — GROUP BY, DISTINCT, UNION, subqueries, CTEs — materialize their intermediate results into temporary tables that contain BLOB/TEXT fields.

Today, the HEAP (MEMORY) engine unconditionally rejects BLOB/TEXT columns (HA_NO_BLOBS flag in ha_heap::table_flags()). This forces the optimizer to use Aria (or MyISAM) temporary tables whenever any blob-typed field appears in the result set — even when the actual data is small (a 50-byte JSON document, a short TEXT value, etc.). The overhead of Aria internal temporary tables includes: creating and deleting a temporary file on disk (.MAI + .MAD), and managing data through the Aria page cache. Note that Aria temp tables are non-transactional (ci->transactional= FALSE) — there is no transaction log overhead and no fsync(). Under memory pressure, dirty pages may be evicted to disk via pwrite() (a plain write to the OS page cache, not a physical sync).

This situation will only get worse as internally structured columns (JSON, Dynamic Columns, VECTOR, spatial types) gain popularity. Queries over tables with a single JSON column that would otherwise fit entirely in memory are forced to disk simply because the storage engine cannot represent the blob descriptor.

This feature removes the HA_NO_BLOBS restriction from the HEAP engine, allowing the optimizer to keep temporary tables with BLOB/TEXT columns in memory when they fit within max_heap_table_size / tmp_memory_table_size limits.

Additionally, the MEMORY engine now advertises HA_CAN_GEOMETRY, enabling explicit CREATE TABLE ... ENGINE=MEMORY with GEOMETRY/POINT/LINESTRING/POLYGON columns. Previously, GEOMETRY columns were rejected by the MEMORY engine because they are internally represented as Field_blob. With blob support in place, GEOMETRY columns work naturally — spatial data is stored in the same continuation record chains as any other BLOB/TEXT data. All spatial functions (ST_AsText(), MBRWithin(), etc.) work correctly with MEMORY-stored geometry values.

The patch also implements full HASH index key handling for BLOB columns. The HEAP hash functions (hp_rec_hashnr(), hp_rec_key_cmp(), hp_key_cmp(), hp_make_key(), hp_hashnr()) are extended to handle HA_BLOB_PART key segments, reading actual blob data via pointer dereference or chain materialization. This enables GROUP BY, DISTINCT, and UNION deduplication to work correctly on BLOB/TEXT columns entirely within the MEMORY engine. A hash pre-check optimization skips expensive blob materialization when hash values differ, and PAD SPACE collation semantics are preserved for blob key comparisons. Monty's Field_blob_key produces HEAP-native key format directly, eliminating the need for key buffer translation between the SQL layer and HEAP engine.

Architecture

Existing HEAP Row Format

HEAP stores rows as fixed-length records in a hierarchical block tree (HP_BLOCK). Each row occupies exactly recbuffer bytes, computed as MY_ALIGN(MAX(reclength, sizeof(char*)) + 1, sizeof(char*)) — i.e. the record length (minimum sizeof(char*)) plus a flags byte, aligned up to pointer size (8 bytes on 64-bit). share->visible equals MAX(reclength, sizeof(char*)) — the byte offset of the flags byte. A single flags byte at offset share->visible marks the row as active or deleted. Deleted rows store a free-list pointer in their first sizeof(uchar*) bytes, forming a singly-linked free list through share->del_link.

BLOB Representation in the Record Buffer

In the MySQL/MariaDB record format, BLOB fields store only a descriptor in the fixed-width record buffer: a packlength-byte length prefix (1–4 bytes depending on BLOB type) followed by an 8-byte data pointer. The actual BLOB data lives in memory owned by the SQL layer (Field_blob::value). This descriptor is only 9–12 bytes regardless of BLOB data size, so it fits naturally in HEAP's fixed-width rows.

The challenge is: where does the actual BLOB data live once the row is written into the HEAP engine?

Continuation Record Chains

BLOB data is stored using continuation records — additional fixed-size records allocated from the same HP_BLOCK that holds regular rows. This approach was chosen because it:

Reuses existing HP_BLOCK allocation, free list, and size accounting with minimal structural change
Ensures data_length tracking and max_table_size enforcement work automatically (continuation records consume regular row records)
Avoids per-BLOB my_malloc() calls — piggybacks on the amortized block allocation (~1 malloc per records_in_block rows)

Flags Byte

The existing visibility byte is extended into a flags byte:

Bit	Name	Meaning
0	`HP_ROW_ACTIVE`	Record is active (not deleted)
1	`HP_ROW_HAS_CONT`	Primary record has blob continuation chain(s)
2	`HP_ROW_IS_CONT`	This record IS a continuation record
3	`HP_ROW_CONT_ZEROCOPY`	Zero-copy layout: data in rec 1..N-1 only (no data in rec 0 payload)
4	`HP_ROW_SINGLE_REC`	Single-record run, no header — data at offset 0
5	`HP_ROW_MULTIPLE_REC`	Multi-run chain — data needs reassembly into `blob_buff`

Variable-Length Continuation Runs

To minimize per-record overhead, continuation records are grouped into runs — contiguous sequences of recbuffer-sized records within the same leaf block. Only the first record of each run carries a header; subsequent records are pure payload with no flags byte. Runs are linked via next_cont pointers.

Case A single-record run (HP_ROW_SINGLE_REC flag, no header):

byte 0                                                         visible   recbuffer

  |                                                               |         |

  v                                                               v         v

  +---------------------------------------------------------------+---------+

  | BLOB data payload (visible bytes, full width)                 | flags   |

  |                                                               | byte    |

  +---------------------------------------------------------------+---------+

Case B/C first record of a multi-record run (rec 0):

byte 0                                                         visible   recbuffer

  |                                                               |         |

  v                                                               v         v

  +--------------+---------------+--------------------------------+---------+

  | next_cont    | run_rec_count | BLOB data payload              | flags   |

  | (8 bytes)    | (2 bytes)     | (visible - 10 bytes)           | byte    |

  +--------------+---------------+--------------------------------+---------+

Inner records (rec 1..N-1) — no flags byte:

byte 0                                                                    recbuffer

  |                                                                         |

  v                                                                         v

  +-------------------------------------------------------------------------+

  | BLOB data payload (recbuffer bytes, full width)                         |

  +-------------------------------------------------------------------------+

next_cont (8 bytes): pointer to the next run, or NULL if last. Shares position with del_link — when a record is active it's next_cont; when deleted it's del_link. No conflict. Not present in Case A.
run_rec_count (uint16, 2 bytes): number of consecutive records in this run. Maximum 65,535 records per run — blobs requiring more are split into multiple runs automatically. Not present in Case A.
Payload per Case A record: visible bytes.
Payload per Case B/C run of N records: (visible - 10) + (N - 1) × recbuffer bytes.

For multi-record runs, this design amortizes the 10-byte per-run header overhead across many records, achieving near-100% space efficiency. For single-record runs (Case A), the header is eliminated entirely:

recbuffer	visible	Case A (1 rec, no header)	Case B/C 10-record	Case B/C 100-record
16	15	15 bytes (100%)	5 + 9×16 = 149 bytes (93%)	5 + 99×16 = 1,589 bytes (99%)
504	500	500 bytes (100%)	490 + 9×504 = 5,026 bytes (99%)	490 + 99×504 = 49,886 bytes (99.9%)

Blob Storage Format

All write, read, free, scan, and integrity-check paths detect the blob storage format by testing flag bits on the continuation record's flags byte via inline predicates (hp_is_single_rec(), hp_is_zerocopy(), hp_is_multi_run()). Three formats:

Case	Predicate	Flag	Layout	Read
A	`hp_is_single_rec()`	`HP_ROW_SINGLE_REC`	Single record, no header. Data at offset 0, full `visible` bytes.	Zero-copy: `chain`.
B	`hp_is_zerocopy()`	`HP_ROW_CONT_ZEROCOPY`	Single run, multiple records. Header in rec 0, data contiguous in rec 1..N-1. Inner records have no flags byte.	Zero-copy: `chain + recbuffer`.
C	`hp_is_multi_run()`	`HP_ROW_MULTIPLE_REC`	One or more runs linked via `next_cont`. Header + data in each run's rec 0, plus inner records.	Reassembly into `blob_buff`.

The HP_ROW_CONT_ZEROCOPY flag on rec 0's flags byte marks Case B as eligible for zero-copy reads. When set, it signals that rec 0 carries no data payload — all data is contiguous in rec 1..N-1, so the read path returns a pointer directly into the HP_BLOCK (chain + recbuffer) with no reassembly. Case C is identified by its own explicit HP_ROW_MULTIPLE_REC flag. The HP_ROW_CONT_ZEROCOPY flag is set during write when: (1) the blob fits in a single run, (2) the run has multiple records, and (3) the run is allocated entirely from the block tail (not the free list).

For Case B, rec 0's data payload area is unused (wasted) — all data goes into rec 1..N-1. This trade-off wastes at most visible - 10 bytes (typically 5–490 bytes) but enables zero-copy reads for the entire blob without any memcpy into blob_buff.

Run Allocation Strategy

Runs are allocated in two phases:

Free list peek-then-unlink: Walk the free list read-only (following next pointers without unlinking), tracking contiguous groups of adjacent records. Only descending-address adjacency is checked — hp_free_run_chain() frees records in ascending order, so LIFO pushes create descending sequences on the free list. On discontinuity: if the accumulated group qualifies (>= min_run_records), unlink exactly those records and use them for a run, then continue scanning for more data. If the group does not qualify, the free list is fragmented — stop immediately and fall through to tail allocation. The free list is only modified when a qualifying contiguous group is actually found. Since run_rec_count is uint16, individual runs are capped at 65,535 records; larger contiguous groups are automatically split into multiple runs at this boundary.
The min_run_records threshold prevents excessive chain fragmentation for large blobs while still allowing small blobs to reuse free-list slots efficiently. It is computed as: ceil(MAX(blob_length / 10, 128) / recbuffer), with a floor of 2 records. However, both min_run_bytes and min_run_records are capped at the blob's actual size: min_run_bytes is capped at data_len, and min_run_records is capped at total_records_needed (the total records the blob requires). This prevents the 128-byte floor from making free-list reuse impossible for small blobs on narrow tables — e.g. with recbuffer=16, the floor would produce min_run_records=8, but a 32-byte blob only needs 3 records, so the cap lowers the threshold to 3.
Tail allocation: Remaining data is allocated from the block tail (hp_alloc_from_tail()), which is always contiguous within a leaf block. A new run starts at each block boundary. Tail-allocated single-run blobs use Case B (multiple records, zero-copy) when the data fits; otherwise they fall back to Case C. Case A (single record, no header) is only allocated from the block tail. Free-list allocations always use Case C format (with header) even for single-record runs, because the free-list path uses a uniform write routine (hp_unlink_and_write_run()) that always applies HP_ROW_MULTIPLE_REC.

This strategy exploits the fact that when a blob row is deleted, hp_free_blobs() pushes continuation records to the free list consecutively. When re-allocated, they naturally appear as contiguous sequences in the free list (LIFO order). Fragmentation only occurs when non-blob inserts consume some freed records in between.

Multiple BLOB Columns

Each BLOB column has its own independent continuation chain. The primary row stores the standard (packlength, data_pointer) descriptor for each blob field. On write, the data pointer is overwritten in the stored row to point to the first continuation run. On read, each chain is walked independently. Each blob column is independently classified as Case A, B, or C — some may use zero-copy while others require reassembly into blob_buff.

Record Counting

A new counter HP_SHARE::total_records tracks all physical records (primary + continuation). HP_SHARE::records remains the logical (primary-only) count. This separation is critical because HEAP's linear hash uses records for bucket mapping — inflating it with continuation records would corrupt the hash table.

Counter	Meaning	Used by
`share->records`	Logical (primary) record count	Hash bucket mapping, `blength`, statistics, optimizer
`share->total_records`	All physical records (primary + continuation)	Block position arithmetic, scan boundary, `heap_check_heap()`
`share->deleted`	Deleted records on free list (both types)	Free list, both counters

Invariant: total_records = records + continuation_count. For non-blob tables, total_records == records always.

Reassembly Buffer

On read, blob data that requires reassembly (Case C) is copied from continuation chains into a per-handle buffer (HP_INFO::blob_buff). This follows the same pattern as InnoDB's blob_heap and MyISAM's rec_buff: a single buffer holds all blob data for the current row, grown via my_realloc() as needed, and freed on heap_reset() / HA_EXTRA_NORMAL / heap_close().

For zero-copy cases (A and B), blob field pointers are set directly into the HP_BLOCK continuation records — no blob_buff allocation or copy is needed. The HP_INFO::has_zerocopy_blobs flag tracks whether the last hp_read_blobs() call produced any zero-copy pointers. This flag is used by heap_update() to refresh the caller's record buffer after an update frees old continuation chains — without this, zero-copy pointers in the caller's buffer would become dangling pointers into freed records.

BLOB field pointers in record[0] are rewritten to point into either blob_buff (Case C) or directly into HP_BLOCK records (Cases A/B) after reassembly. The buffer/pointers remain valid until the next row fetch — the SQL layer's Field_blob::val_str() dereferences these pointers with no additional copy.

Zero-copy and UPDATE safety: heap_update() uses a write-before-free strategy — new blob chains are written before old ones are freed, so old data survives until the new data is successfully in place. After freeing old chains, heap_update() checks has_zerocopy_blobs on the current handle and refreshes the caller's record buffer with new chain pointers via hp_read_blobs(), preventing dangling zero-copy pointers. Under HEAP's current table-level locking (thr_lock), other handles are blocked by the table lock during UPDATE, so they cannot hold active zero-copy pointers into the freed chains. For internal temporary tables (single handle, single thread), this is definitively safe. If HEAP ever gains row-level or concurrent access in the future, a create option to disable zero-copy (forcing all blobs through blob_buff reassembly) may be needed to prevent cross-handle dangling pointers.

Changes

HEAP Engine Core (`storage/heap/`)

New file: hp_blob.c (~990 lines)

hp_write_blobs() — writes all blob columns for a row into continuation chains. On failure, rolls back any partially written chains. Supports all three formats: Case A (single record, no header), Case B (single run, zero-copy), and Case C (multi-run, with headers).
hp_read_blobs() — two-pass approach: first pass sums data lengths for blobs needing copy (Case C always, Cases A/B when share->write_can_replace forces copy mode) to size blob_buff; second pass sets zero-copy pointers (Cases A/B, normal mode) or copies into blob_buff (Case C always, Cases A/B during REPLACE). Sets has_zerocopy_blobs flag (only when zero-copy pointers are actually used).
hp_free_blobs() — frees all continuation chains for a row. Skips zero-length blobs (avoids chasing NULL chain pointer). Calls hp_shrink_tail() after freeing to reclaim tail-positioned records.
hp_shrink_tail() — reclaims deleted records at the tail of the HP_BLOCK tree by popping them from the delete list and decrementing block.last_allocated. Exposed in heapdef.h (was static) for use by hp_free_blobs() and heap_update() immediate-free paths.
hp_flush_pending_blob_free_impl() — flushes deferred blob chain free from a previous heap_delete(). Iterates saved chain pointers, calls hp_free_run_chain() for each, then hp_shrink_tail() to reclaim tail records.
hp_free_run_chain() — frees a single continuation chain (individual records returned to free list).
hp_materialize_one_blob() — materializes a single blob column from a stored continuation chain. Returns zero-copy pointer for Cases A/B, or reassembles into key_blob_buff for Case C. Uses a separate buffer from blob_buff to avoid overwriting blob data that hp_read_blobs() placed there and that record pointers still reference. Used by hash comparison functions when comparing stored records.
hp_alloc_from_tail() — allocates a single record from the HP_BLOCK tail, bypassing the free list. Used for guaranteed-contiguous allocation.
hp_unlink_and_write_run() — unlinks a contiguous group of records from the free list and writes blob data into them.
Internal helpers for run data writing.
Blob pointer NULL writes use bzero() instead of pointer store to avoid unaligned access on strict-alignment platforms.

hp_create.c

Blob descriptor array (HP_BLOB_DESC) allocated alongside HP_SHARE in a single allocation.
visible_offset minimum enforced for blob tables to ensure at least 1 byte of payload beyond the continuation header (prevents underflow for pathological single-TINYBLOB tables with reclength as low as 9).
Key segment setup: HA_BLOB_PART flag validated against blob_descs[] array — spurious flags (from uninitialized key_part_flag in SJ weedout tables, or BLOB_FLAG on non-Field_blob INFORMATION_SCHEMA temp tables) are stripped. Real blob segments get bit_start set to actual packlength (1–4) instead of the generic VARCHAR bit_start=2.
Key length calculation accounts for blob segments: 4 + sizeof(uchar*) (length + data pointer) in the key buffer.
Key type normalization: VARBINARY4 segments are normalized to VARTEXT4 for consistent handling.

hp_write.c

heap_write() flushes deferred blob chains (hp_flush_pending_blob_free()), then calls hp_write_blobs() after memcpy. Zero-copy corruption during REPLACE is prevented by HP_SHARE::write_can_replace flag which forces copy mode in hp_read_blobs() (MDEV-39825). New err_common: label for correct key rollback when blob write fails (all keys already written — must iterate all keydef entries, not skip the current one). Neither err: nor err_common: calls hp_free_blobs(): err: is reached before memcpy/hp_write_blobs() (stale data in pos); err_common: is reached after hp_write_blobs() fails, but hp_write_blobs() is self-cleaning (frees its own partial chains, NULLs all pointers). Calling hp_free_blobs() from either error path would crash because the flags byte is uninitialized and blob columns after the failed one still have SQL layer data pointers (not chain pointers).
next_free_record_pos() uses block.last_allocated instead of share->records for position arithmetic (since records no longer equals total physical records). Made non-static for use by hp_blob.c.

hp_delete.c

Deferred blob chain free (~~MDEV-39732~~): instead of calling hp_free_blobs() immediately (which would overwrite blob data in continuation records with free-list pointers before binlog_log_row() reads the record buffer), heap_delete() splits by table type. Internal temporary tables (share->internal, never binlogged) free chains immediately via hp_free_blobs(). User-created MEMORY tables defer: chain head pointers are saved into HP_INFO::pending_blob_chains and has_pending_blob_free is set. The saved chains are freed on the next mutating operation (write/update/delete) or on heap_reset()/close via hp_flush_pending_blob_free(). Chain records stay allocated (counted in total_records) until the flush, maintaining the scan boundary invariant.
total_records decremented on delete (for the primary record only; chain records stay in total_records until flush).
NULL chain pointer slot for unchanged zero-length blobs (prevents stale SQL-layer pointers from being misread as chain heads).

hp_update.c

Flushes deferred blob chains at entry.
Per-column unchanged blob detection: before rewriting chains, each blob column's old and new values are compared using length (O(1)), then data pointer (O(1)), then memcmp (O(n) with early exit). Unchanged blobs keep their existing continuation chains with no allocation, copy, or free. Only changed blobs get new chains written. This avoids unnecessary chain churn for UPDATE of non-blob columns, INSERT ... ON DUPLICATE KEY UPDATE with unchanged blob values, and REPLACE with identical blob data.
Write-before-free strategy for changed blobs: new chains are written before old ones are freed. On failure, old record and chain pointers are restored. This ensures crash-safety on OOM/table-full during blob grow.
Old chain free split by table type (~~MDEV-39782~~): internal temporary tables free chains immediately via hp_free_run_chain() + hp_shrink_tail(); user-created MEMORY tables defer via pending_blob_chains, matching the DELETE path. This prevents dangling zero-copy pointers in the before-image record (record[1]) used by binlog_log_row() after update_row() returns.
hp_read_blobs() refresh after update is gated on share->internal: user-created tables skip the refresh since the SQL layer re-materializes on the next fetch, avoiding blob_buff reallocation that could invalidate Case C materialized pointers in record[1] (the before-image).
hp_shrink_tail() called after both the internal-table immediate-free path and the rollback path (free of partially-written new chains) to reclaim tail-positioned records.
After update, when has_zerocopy_blobs is set or any blob changed (internal tables only), refreshes the caller's record buffer by copying chain pointers from the stored row and calling hp_read_blobs(). This prevents dangling zero-copy pointers from pointing into freed old chains (e.g. in the INTERSECT ALL unfold path where the caller reuses the record buffer after update).

hp_scan.c

Continuation records (HP_ROW_IS_CONT) are skipped during table scans. Rec 0 of each run has the flags byte; inner records have no flags byte. The scan reads run_rec_count from the header and batch-skips the entire run (returned as HA_ERR_RECORD_DELETED to the caller, same as actual deleted records).
Scan boundary uses total_records + deleted instead of records + deleted.

hp_hash.c — hash function and comparison for blob key segments

hp_rec_hashnr() — for HA_BLOB_PART segments: reads blob length using actual packlength, dereferences data pointer, hashes actual blob data. Input records always have valid blob pointers.
hp_rec_key_cmp() — blob segment comparison with materialization support. Takes new HP_INFO *info parameter. When info is non-NULL, the second record is treated as stored (blob pointers are continuation chain heads) and materialized via hp_materialize_one_blob(). When info is NULL (e.g. heap_update() comparing two input records), both sides are dereferenced directly. Length-based early exits are gated by MY_CS_NOPAD to preserve PAD SPACE collation semantics where trailing spaces are insignificant (e.g. 'a' and 'a ' compare equal).
hp_key_cmp() — blob segment comparison between stored record and pre-built search key. Stored side is materialized from chain using seg->bit_start (actual packlength) to read the length prefix — not hardcoded packlength=4 (~~MDEV-39723~~, ~~MDEV-39724~~). Key side has 4-byte normalized length + data pointer from hp_make_key(). PAD SPACE semantics handled the same way as hp_rec_key_cmp().
hp_make_key() — for blob segments: stores 4-byte length + data pointer (not inline data) in the key buffer. VARCHAR segments are normalized to always use a 2-byte length prefix in the key buffer (regardless of the record's native 1-byte or 2-byte prefix), ensuring consistency with hp_hashnr() and hp_key_cmp() which always read 2 bytes. Zero-padded to seg->length for sanitizer cleanliness.
hp_hashnr() — pre-built key hashing: blob segments encoded as 4-byte length + data pointer.
NULL handling: key advancement accounts for blob segment size (4 + sizeof(uchar*)) vs varchar (2) when skipping NULL key parts.
Hash pre-check optimization: hp_search(), hp_search_next(), hp_delete_key(), and find_unique_row() now compare HASH_INFO::hash_of_key against the search key's hash before calling the full key comparison. When hashes differ the keys are guaranteed different, skipping expensive hp_materialize_one_blob() calls. This pattern already existed in hp_write_key() for duplicate detection but was missing from the read/delete paths. HP_INFO::last_hash_of_key is added so hp_search_next() can reuse the hash computed by hp_search() without recomputing it.

hp_rrnd.c, hp_rfirst.c, hp_rlast.c, hp_rnext.c, hp_rprev.c, hp_rkey.c, hp_rsame.c

hp_read_blobs() call added after memcpy(record, pos, reclength) in every read path, guarded by share->blob_count.

hp_clear.c — total_records and block.last_allocated reset on TRUNCATE. has_pending_blob_free invalidated without flushing (the HP_BLOCK tree is already destroyed by TRUNCATE, so the saved chain pointers are stale).

hp_close.c — Flushes deferred blob chains only when the share is being destroyed (last handle close with delete_on_close). For shared tables, the flush already happened in external_lock(F_UNLCK). Frees pending_blob_chains, blob_buff, and key_blob_buff on close.

hp_extra.c — heap_extra() handles HA_EXTRA_WRITE_CAN_REPLACE / HA_EXTRA_WRITE_CANNOT_REPLACE to set/clear HP_SHARE::write_can_replace, disabling zero-copy blob reads during REPLACE (MDEV-39825). heap_reset() flushes deferred blob chains, frees blob_buff and key_blob_buff. The write_can_replace flag is NOT cleared by heap_reset() because ha_reset() runs after the table lock is released, which would race with another connection's REPLACE; flag lifecycle is managed by the bracketed HA_EXTRA_WRITE_CAN_REPLACE / HA_EXTRA_WRITE_CANNOT_REPLACE calls under the write lock.

hp_open.c — heap_open_from_share() allocates pending_blob_chains array (blob_count * sizeof(uchar*)) for blob-capable tables.

hp_static.c — hp_key_memory_HP_BLOB PSI memory instrumentation key added.

_check.c — heap_check_heap() updated to iterate using total_records + deleted, count continuation records separately (batch-skipping inner records via run_rec_count), and validate records + cont_count == total_records.

Handler Interface (`ha_heap.h` / `ha_heap.cc`)

HA_NO_BLOBS removed and HA_CAN_GEOMETRY added to ha_heap::table_flags(). GEOMETRY columns (which use Field_blob internally) are now supported in MEMORY tables.
ha_heap::external_lock(F_UNLCK) calls hp_flush_pending_blob_free() at statement end, before thr_lock release. This prevents a race condition where T1 ends a statement (deferred blob chains still pending), T2 acquires the lock and does a write (allocating from del_link), then T1's close/reset flushes the stale pending chains — both threads modify the same delete list concurrently, corrupting it.
heap_prepare_hp_create_info() populates HP_BLOB_DESC array from TABLE_SHARE::blob_fields, filtering non-Field_blob types (e.g. INFORMATION_SCHEMA Field_string with BLOB_FLAG). Both MYSQL_TYPE_BLOB and MYSQL_TYPE_GEOMETRY are included — Field_geom extends Field_blob but returns a different field_type(). Key segment setup uses key_part->key_part_flag with a DBUG_ASSERT verifying consistency against field->key_part_flag(). Blob key segments get seg->length = 4 + sizeof(uchar*) and seg->bit_start set to the field's actual packlength.
remember_rnd_pos() / restart_rnd_next() implemented. Required by remove_dup_with_compare() for DISTINCT deduplication with blobs. Saves and restores both the row position and current_record counter (which heap_rrnd() does not update). restart_rnd_next() caps next_block at total_records + deleted to prevent heap_scan() from reading past the last allocated slot — without this cap, the scan boundary can exceed the allocated HP_BLOCK range when the saved position is in the last block segment, causing SIGSEGV.
find_unique_row() — materialize-then-compare optimization for blob tables. Saves input record, copies stored candidate, calls hp_read_blobs() once to materialize all blobs, then compares via hp_rec_key_cmp() with info=NULL (both records have direct data pointers). This avoids the double chain-walk that occurred when hp_rec_key_cmp() materialized blobs individually during comparison and then hp_read_blobs() re-walked the same chains. Non-blob tables keep the original fast path.
blob_descs memory freed in open() and create() after heap_create().

Data Structures (`include/heap.h` / `heapdef.h`)

HP_BLOB_DESC struct: offset (byte offset within record buffer) + packlength (1–4).
HP_SHARE: added blob_descs, blob_count, total_records, write_can_replace (disables zero-copy reads during REPLACE, MDEV-39825).
HP_KEYDEF: no new fields (blob segment detection is done inline via HA_BLOB_PART flag checks on key segments).
HP_INFO: added blob_buff, blob_buff_len, key_blob_buff, key_blob_buff_len (separate buffer for hp_materialize_one_blob() key comparison – avoids overwriting blob_buff data that record pointers still reference), has_zerocopy_blobs, last_hash_of_key (cached hash for hp_search_next() reuse), pending_blob_chains (saved chain pointers for deferred delete), has_pending_blob_free (flag for pending deferred free).
HP_CREATE_INFO: added blob_descs, blob_count.
Constants: HP_ROW_ACTIVE, HP_ROW_HAS_CONT, HP_ROW_IS_CONT, HP_ROW_CONT_ZEROCOPY, HP_ROW_SINGLE_REC, HP_ROW_MULTIPLE_REC, HP_CONT_HEADER_SIZE, HP_CONT_REC_COUNT_SIZE, HP_CONT_MIN_RUN_BYTES, HP_CONT_RUN_FRACTION_NUM/DEN.
Predicates: hp_is_active(), hp_has_cont(), hp_is_cont(), hp_is_single_rec(), hp_is_zerocopy(), hp_is_multi_run().

`Field_blob_key` — HEAP-Native Blob Key Format (`sql/field.h` / `sql/field.cc`)

Monty's Field_blob_key (commit 30415846402) is a Field_blob subclass for GROUP BY and DISTINCT keys on internal temporary tables. It produces the same key format that HEAP's hash functions (hp_make_key(), hp_hashnr(), hp_key_cmp()) expect natively: a 4-byte length prefix followed by an 8-byte data pointer. This eliminates the key format mismatch that previously required translation between the SQL layer and HEAP engine.

Key properties:

key_pack_length() returns 4 + sizeof(char*) (12 on 64-bit) — the HEAP-native blob key size.
key_length() returns 4 + portable_sizeof_char_ptr — matching seg->length in HA_KEYSEG.
key_type() returns HA_KEYTYPE_VARBINARY4 / HA_KEYTYPE_VARTEXT4 — the 4-byte-prefix variants.
key_part_length_bytes() returns 0 — no separate length prefix in the key buffer (the full 4B+pointer IS the key).
new_key_field() returns another Field_blob_key (not Field_varstring as plain Field_blob does).
key_cmp() compares blob data via the 4B length + pointer encoding directly.
set_key_image() stores blob data via store_length() + pointer memcpy.

Creation: Field_blob::make_new_field() creates Field_blob_key when param->part_of_unique_key() is true — this is the case for GROUP BY keys, DISTINCT keys, and semi-join materialization keys. The ~~MDEV-16699~~ group_concat downgrade (Field_geom → plain Field_blob) takes priority and is checked first. For UNION type resolution, Item_type_holder::create_tmp_field_ex() detects blob types via dynamic_cast<Type_handler_blob_common*> and routes them through type_handler_for_tmp_table() to get blob_key_type_handler() dispatch; pack length is recovered via blob_type_handler(max_length, NULL)->length_bytes() (passing NULL to skip the part_of_unique_key dispatch and get the real blob size class).

Impact: With Field_blob_key, copy_funcs() in end_update() stores expression results directly into the blob field via Item_result_field::save_in_field(), and hp_make_key() reads the data from record[0] in the correct format. This eliminates the need for: (a) a has_blob_seg flag on HP_KEYDEF, (b) a group buffer restoration step after copy_funcs(), and (c) a key-format translation layer. The group buffer restore block in end_update() is retained but disabled (&& 0 guard) as dead code.

SQL Layer (`sql/sql_select.cc`)

choose_engine() / pick_engine() (temp table engine selection):

Removed share->blob_fields from the condition that forces disk-based engine.
(The earlier version also added share->reclength > HA_MAX_REC_LENGTH, but this was removed: HEAP has no internal reclength limit – hp_create.c stores uint reclength (32-bit) and allocates blocks of that size. The upstream choose_engine() never had a reclength check; max_supported_record_length() is only checked in unireg.cc during user-facing CREATE TABLE, never for internal temp tables.)

finalize() (temp table creation):

HEAP tables with blobs still use fixed-width rows (blob data lives in separate continuation records, not inline). The first-byte guard (ensuring byte 0 is non-zero for the delete-link mechanism) is applied to HEAP tables regardless of blob presence.
Packed row format (use_packed_rows) is skipped for HEAP tables — HEAP requires fixed-width rows.
GROUP BY key setup: m_key_part_info->key_part_flag is now set from field->key_part_flag() instead of hardcoded 0 — without HA_BLOB_PART, HEAP never enters blob-specific key handling code paths.
GROUP BY on blob columns: Field_blob::key_length() returns 0 (blobs use prefix indexing) and Field_geom::key_length() returns packlength (4), both too small to hold actual data. The GROUP BY key setup now uses the item's max_length (capped to MAX_BLOB_WIDTH) instead, which matches the buffer size allocated by calc_group_buffer(). A group buffer overflow check is added so that blobs that don't fit fall back to m_using_unique_constraint. key_part_flag is set AFTER the overflow check so that a break-out retains the original varchar flag, ensuring correct key format handling in the HEAP hash functions.
store_length initialized for all GROUP BY key parts — used by create_ref_for_key() for key buffer advancement and by HEAP-to-Aria conversion for key setup. Previously uninitialized because nothing read it before.
DISTINCT key setup: HEAP handles NULLs per-segment in its hash index, so the extra null-bits helper key part (used by MyISAM/Aria for unique blob constraints) is skipped for HEAP. share->uniques is still set so that HEAP-to-disk conversion can build MARIA_UNIQUEDEF. The null-bits helper field uses empty_clex_str instead of null_clex_str to prevent SIGSEGV in Explain_index_use::set() which calls strlen() on the field name.

end_update() (GROUP BY update path):

With Monty's Field_blob_key approach, copy_funcs() correctly stores expression results directly into the blob field via Item_result_field::save_in_field(), so the blob group buffer restore is not needed (disabled with && 0 guard). The code is retained for potential future use if Field_blob_key is ever removed.

calc_group_buffer():

Fixed blob subtype detection: only checked MYSQL_TYPE_BLOB, missing TINY_BLOB, MEDIUM_BLOB, LONG_BLOB, and GEOMETRY. Fixed with is_any_blob_field_type() which includes all five types.

start() / pick_engine() (early engine prediction):

Extracted pick_engine() from choose_engine() and calls it early in start() to set m_heap_expected. When HEAP is predicted, the too_big_for_varchar() and group_length >= MAX_BLOB_WIDTH guards that would force m_using_unique_constraint are skipped — HEAP natively supports blob keys via its hash index, so these bail-outs are unnecessary and would prevent correct GROUP BY/DISTINCT on blob columns.

remove_duplicates() (DISTINCT deduplication):

remove_dup_with_hash_index() cannot handle blobs (copies field data via make_sort_key() + memcmp, which compares pointer values for blob fields). The blob check is moved before the HEAP engine check so HEAP+blob tables fall through to remove_dup_with_compare().

LIMIT adjustment with fanout:

Fixed undefined behavior when select_limit/fanout overflows. With HEAP blob support, the optimizer's temp table engine choice changes, altering cost estimates and sometimes producing near-zero fanout values (from tiny cond_selectivity). Dividing select_limit by such a fanout produces infinity or a value exceeding HA_POS_ERROR; casting that to ha_rows (unsigned long long) is undefined behavior per the C++ standard. Added an explicit cap: if select_limit/fanout >= (double) HA_POS_ERROR, set select_limit = HA_POS_ERROR.

SQL Layer (`sql/item_sum.cc`)

Aggregator_distinct::setup():

The Unique tree path (which compares raw record bytes via simple_raw_key_cmp / composite_key_cmp) is skipped for HEAP tables with blobs, since raw comparison would compare pointer values rather than actual blob data. HEAP+blob tables use the ha_write_tmp_row() path instead.

Aggregator_distinct::add():

Added create_internal_tmp_table_from_heap() conversion when ha_write_tmp_row() fails with a non-duplicate error (e.g. HA_ERR_RECORD_FILE_FULL). Previously, only the Unique tree path handled HEAP overflow conversion. With blobs routed through ha_write_tmp_row(), the overflow-to-disk conversion must happen here too.

SQL Layer — Geometry GROUP_CONCAT Fix (MDEV-39761)

Field_geom::store() asserts !table->blob_storage, but GROUP_CONCAT with ORDER BY or DISTINCT sets blob_storage on its internal temp table. With HEAP blob support, geometry expressions (IF, CASE, COALESCE, DEFAULT, etc.) as GROUP_CONCAT arguments could create Field_geom directly, bypassing the existing downgrade in Field_blob::make_new_field() that converts Field_geom to plain Field_blob for GROUP_CONCAT temp tables.

Fix: downgrade geometry to Field_blob (via type_handler_long_blob) in both field creation paths that expressions can take:

create_tmp_field_ex_from_handler() (sql/sql_select.cc) for Item_result_field items
tmp_table_field_from_field_type() (sql/item.h) for Item_default_value and other callers that bypass the handler path

Additionally, Type_handler_geometry::type_handler_for_tmp_table() is added as a virtual method so that the geometry-to-blob downgrade is handled at the type handler level, removing the need for explicit MYSQL_TYPE_GEOMETRY testing in Item::tmp_table_field_from_field_type(). The param argument is ensured to always be set for type_handler_for_tmp_table() calls.

SQL Layer — Geometry GROUP BY Key Format Fix (MDEV-39871)

Field_geom inherits Field_blob::new_key_field() which returns Field_varstring (2-byte length + inline data). For GROUP BY, finalize() stores GROUP BY values via this key field into group_buff. HEAP's heap_prepare_hp_create_info() promotes blob key segments from HA_KEYTYPE_VARBINARY2 to HA_KEYTYPE_VARTEXT4 (4-byte length + pointer format). When end_update() passes the group buffer to hp_hashnr(), it misreads the inline WKB geometry bytes as a memory pointer, causing SIGSEGV.

This only affects Field_geom because its temp table creation path (Type_handler_geometry::make_table_field_ex()) creates Field_geom directly rather than going through Field_blob::make_new_field() which would create Field_blob_key when part_of_unique_key() is true. Plain Field_blob subclasses already produce Field_blob_key in make_new_field().

Fix: in Create_tmp_table::finalize(), after new_key_field() returns, detect when a blob field produced a non-blob Field_varstring and replace it with Field_blob_key that stores in the correct 4-byte length + pointer format. Update KEY_PART_INFO::length and type to match the replacement field so heap_prepare_hp_create_info() takes the direct VARTEXT4 path (no silent normalization, no HA_NO_KEY_READ). {{DBUG_ASSERT}}s are added in both the GROUP BY and DISTINCT key setup paths to verify blob key segment consistency.

SQL Layer — Expression Cache (`sql/sql_expression_cache.cc`)

Expression_cache_tmptable::init():

Disables the expression cache when the HEAP temp table has blob fields. HEAP hash indexes on blob columns use a pointer-based key format (4-byte length + data pointer from hp_make_key()) that is incompatible with the SQL layer's key format (2-byte length + inline data from Field_blob::new_key_field() which returns Field_varstring). This restores the pre-blob-support behavior: previously blob items forced conversion to Aria, which failed the db_type() == heap_hton check and disabled the cache anyway. No performance impact.

SQL Layer — FULLTEXT Early Detection (`sql/sql_derived.cc`)

mysql_derived_prepare():

Forces a disk-based tmp engine when the outer query uses FULLTEXT functions. Before the derived table's result table is created, checks derived->select_lex->ftfunc_list->elements and sets TMP_TABLE_FORCE_MYISAM in the create options if FULLTEXT is present. This ensures the table is created with Aria/MyISAM from the start, avoiding the need for a handler swap later. Follows the same pattern as st_select_lex_unit::prepare() in sql_union.cc for UNION tmp tables. Item_func_match::fix_fields() retains only a safety-net error (ER_TABLE_CANT_HANDLE_FT) for engines that lack FULLTEXT support.

Performance Impact

Non-Blob Tables: Zero Regression

The most important design goal was no measurable impact on existing workloads. For tables without BLOB/TEXT columns:

No new allocations: blob_buff, blob_descs, and continuation chains are never allocated. blob_count is 0, total_records tracks records identically, and blob_buff is never touched.
No new code in hot paths: Every blob-specific code path is guarded by if (share->blob_count). This is a single branch on a struct member that lives in the same cache line as other frequently accessed fields (reclength, visible, records). The branch predictor learns always-false within a few iterations — the branch itself is effectively free.
No changes to row format: The flags byte at offset visible continues to be written as 1 (active) or 0 (deleted) exactly as before. The new flag bits (HP_ROW_HAS_CONT, HP_ROW_IS_CONT, HP_ROW_CONT_ZEROCOPY, HP_ROW_SINGLE_REC, HP_ROW_MULTIPLE_REC) are never set.
No changes to hash functions for non-blob keys: The new HA_BLOB_PART branches in hp_rec_hashnr(), hp_rec_key_cmp(), hp_key_cmp(), and hp_make_key() are only entered for key segments with the HA_BLOB_PART flag set — which never happens for non-blob tables.
No change to next_free_record_pos() semantics: The switch from share->records to block.last_allocated for position arithmetic is equivalent when there are no continuation records (both values track the same quantity).

In summary, for non-blob tables, the entire feature reduces to a handful of never-taken branches per row operation. There is no additional memory usage, no additional function calls, and no change to the data layout.

Blob Tables: Performance Characteristics

For tables with BLOB/TEXT columns, the primary benefit is avoiding disk I/O entirely. The tradeoff:

Write path: Each blob column requires allocating continuation records and copying blob data into them. The allocation is amortized (HP_BLOCK allocates leaf blocks of ~1000 records at a time), so the per-row overhead is dominated by the memcpy of blob data — which is unavoidable regardless of storage approach. For the common case of small blobs (JSON documents, short TEXT values under ~150 KB), this is a single contiguous run with one memcpy. Single-run blobs from tail allocation use the zero-copy layout (Cases A/B), enabling the read path to avoid copying entirely.
Read path: For single-run blobs (Cases A/B — the common case in temporary tables), the read path returns a pointer directly into HP_BLOCK with no memcpy at all. For multi-run blobs (Case C), data is reassembled from continuation chains into blob_buff via memcpy — the same pattern as InnoDB's blob_heap. The reassembly buffer is allocated once and reused for every row. The reassembly cost is proportional to blob data size, not to the number of runs (runs within a leaf block are contiguous, so the memcpy loop processes large chunks).
Hash operations: When DISTINCT or UNION deduplication requires hashing/comparing blob key segments, the stored blob must be materialized from its continuation chain before comparison. For zero-copy blobs (Cases A/B), materialization returns a direct pointer with no copy. For Case C, one reassembly is needed per hash probe for the stored side. The input side (the record being inserted) always has a valid direct pointer and requires no materialization. In practice, hash collisions are rare, so most insertions require only hashing (no comparison).
Memory: For Case C blobs, data exists in continuation records (persistent) plus the reassembly buffer (transient, per-handle, freed on heap_reset()). This is a 2× peak for the currently-read row's blob data — identical to InnoDB's approach. For Cases A/B (zero-copy), no reassembly buffer is needed — 1× memory only. The reassembly buffer is not allocated for non-blob tables, and is freed between statement executions to avoid holding large buffers across queries.
Free list reuse: When blob rows are deleted and reinserted, the peek-then-unlink algorithm walks the free list read-only, detecting contiguous groups of adjacent records (descending address order only — matching LIFO push order). On the first non-qualifying (too short) group, the scan stops entirely and falls through to tail allocation — no wasted time scanning a fragmented list. Only qualifying groups are actually unlinked. Since deleted records are pushed consecutively (LIFO), they naturally appear as contiguous sequences, so recovery has high probability. Individual runs are capped at 65,535 records (uint16 format limit); larger groups are split into multiple runs at this boundary.

Net Effect on Temporary Table Workloads

For the target use case — implicit/internal temporary tables created by GROUP BY, DISTINCT, UNION, and subqueries — the expected net effect is a significant performance improvement whenever blob-typed columns are present. Instead of:

SQL layer ==> Aria temp table ==> page cache management ==> (eviction to OS under memory pressure)

the data path becomes:

SQL layer ==> HEAP temp table ==> memcpy into HP_BLOCK ==> zero-copy pointer out (Cases A/B)

                                                       ==> memcpy out of HP_BLOCK (Case C)

This eliminates file creation/deletion overhead and page cache management. For the common case of small-to-medium blobs fitting in a single run, the read path is entirely zero-copy — no blob_buff allocation, no memcpy. The max_heap_table_size / tmp_memory_table_size limits continue to act as a safety valve: when a temporary table with blobs exceeds the memory limit, the existing overflow-to-disk conversion (create_internal_tmp_table_from_heap()) triggers transparently, falling back to Aria exactly as it does today for non-blob tables that outgrow memory.

Limitations

BTREE indexes on blob columns are not supported — BTREE key packing (hp_rb_pack_key / hp_rb_var_key_length) is not updated for blob segments. HASH indexes on blob columns work correctly. In practice this is not a limitation for internal temporary tables, which exclusively use HASH indexes for deduplication.
No partial-key prefix indexing for blobs — the entire blob value is hashed/compared. This matches the behavior of other engines for HASH indexes and is correct for DISTINCT/UNION deduplication.
2x memory for blob data during reads (Case C only) — for multi-run blobs, data exists both in continuation records and in the reassembly buffer simultaneously. The reassembly buffer is freed on heap_reset() / handler close. For single-run blobs (Cases A/B), zero-copy pointers are used and no reassembly buffer is needed — 1x memory only.
No compression — blob data is stored uncompressed in continuation records. For the target use case (small JSON documents, short TEXT values in temporary tables), this is acceptable.
65,535 records per run — run_rec_count is stored as uint16, limiting a single multi-record continuation run to 65,535 records (Case A single-record runs have no run_rec_count field and are always exactly 1 record). Blobs requiring more are automatically split into multiple runs at this boundary. For the smallest possible recbuffer (16 bytes, visible=15), one run holds up to 5 + 65534 × 16 ≈ 1,024 KB of payload. Larger blobs use multiple runs with negligible additional overhead (10 bytes per extra run header).
max_heap_table_size enforcement — continuation records consume the same space as regular rows and are counted toward data_length. The existing max_heap_table_size / tmp_memory_table_size limits apply naturally. When a blob insert would exceed the limit, the row is rolled back and the error propagated — the SQL layer's overflow-to-disk conversion (create_internal_tmp_table_from_heap()) handles this transparently.
Expression cache disabled for blob fields — HEAP hash indexes on blob columns use a pointer-based key format (4-byte length + data pointer) that is incompatible with the SQL layer's key format used by Expression_cache_tmptable (2-byte length + inline data from Field_blob::new_key_field()). The cache is disabled when blob fields are present, restoring the pre-blob-support behavior (previously blobs forced Aria, which also disabled the cache).
FULLTEXT forces disk engine — HEAP does not support FULLTEXT indexes. When a derived table is used in a query with MATCH...AGAINST, the tmp table is created with a disk-based engine (Aria/MyISAM) from the start via early detection in mysql_derived_prepare().

Test Changes

New Tests

heap/blob.test: Comprehensive test suite for direct HEAP blob operations — basic CRUD, multiple blob types (TINYBLOB through LONGBLOB), NULL/empty values, large blobs spanning multiple continuation runs, free list reuse after DELETE, free list fragmentation with NULL-blob/large-blob interleaving, mixed NULL/non-NULL multi-column blob scavenging, large blobs exceeding the uint16 run_rec_count cap (1MB blobs requiring ~70K records, forcing automatic splitting at the 65,535 boundary), TRUNCATE, full table scan correctness, HASH and BTREE indexes with blob columns present, REPLACE, INSERT...SELECT, table-full errors with partial row verification, UPDATE failure rollback, non-blob regression, zero-copy edge cases (Case A/B/C exact boundary sizes with cross-case UPDATE), INSERT ON DUPLICATE KEY UPDATE with blobs, JSON column CRUD, LONGBLOB at uint16 run split boundary, and BTREE+blob index rejection.
heap/blob_dedup.test: Tests DISTINCT and UNION deduplication with HEAP blob tables — verifies that hash-based deduplication correctly compares blob data (not pointer values).
main/blob_sj_test.test: Regression test for the SJ DuplicateWeedout HA_BLOB_PART bug — forces DuplicateWeedout strategy via semijoin=on,firstmatch=off,loosescan=off and verifies that spurious HA_BLOB_PART flags from uninitialized key_part_flag are correctly stripped in hp_create.c.
heap/blob_group_by.test: Tests GROUP BY on blob columns in HEAP tables — verifies that key_part_flag is set correctly and that blob columns produce correct grouping results.
heap/blob_ops.test: Tests query patterns from benchmark coverage gaps — COUNT(DISTINCT), IN-subquery, GROUP BY ROLLUP, window functions (ROW_NUMBER, RANK), CTE materialization, CTE self-join, and CTE referenced twice.
heap/heap_geometry.test: Tests GEOMETRY columns in MEMORY tables. Verifies the engine is MEMORY (not silently converted to Aria), inserts 7 distinct LineString geometries, then performs 11 rounds of INSERT INTO t1 SELECT * FROM t1 doublings (7→14,336 rows). After each doubling, validates zero corrupt rows (ST_AsText(c2) IS NOT NULL for all rows). Final verification: GROUP BY confirms all 7 geometry values present with correct counts, MBRWithin() spatial predicate validates geometry data integrity. This test exercises the Case A zero-copy detection fix in hp_read_blobs() / hp_materialize_one_blob() — without the fix, multi-run blobs whose first run has run_rec_count==1 (from free-list fragmentation during INSERT...SELECT) are incorrectly treated as single-record zero-copy, causing the read to overflow past the 7-byte rec 0 payload into adjacent records' flags bytes, corrupting the WKB geometry type field.
heap/blob_fallback.test: Tests Step 3 free-list scavenge fallback — fills a constrained table, creates heavy fragmentation via alternating deletions, then verifies blob inserts succeed by scavenging scattered free-list slots.
heap/blob_stress.test: 3-phase stress test — Phase 1: 200-cycle stored procedure with mixed INSERT/DELETE/UPDATE/NULL↔non-NULL transitions and varying blob sizes (Case A/B/C), shadow table row count verification. Phase 2: near-capacity fill/fragment/refill. Phase 3: 20 grow/shrink UPDATE cycles.
heap/blob_update_overflow.test: Tests overflow-to-Aria on the UPDATE path — MAX(TEXT) with monotonically growing values triggers HA_ERR_RECORD_FILE_FULL during ha_update_tmp_row(); 4 scenarios (single MAX, content verification, multiple blob aggregates, MIN+MAX).
heap/blob_find_unique.test: Tests the materialize-then-compare optimization for find_unique_row() — EXCEPT/INTERSECT/ALL with Case C blobs, multiple blob columns, mixed sizes, PAD SPACE, mismatch loop, content integrity.
heap/blob_truncate.test: TRUNCATE on HEAP blob tables — verifies that heap_clear() correctly invalidates has_pending_blob_free without flushing (HP_BLOCK tree already destroyed), and that subsequent operations work correctly after TRUNCATE.
heap/blob_group_concat_geom.test: GROUP_CONCAT with geometry columns — verifies ~~MDEV-39761~~ fix (Field_geom::store() assertion with blob_storage), geometry expressions (IF, CASE, COALESCE), and user variable geometry assignment in GROUP_CONCAT.
heap/blob_replace_overlap.test: REPLACE blob data integrity — single/multiple blob columns at various sizes (single-record, multi-record, bulk copy paths), INSERT ON DUPLICATE KEY UPDATE as control case (MDEV-39825).
heap/blob_replication.test: Tests row-based replication of DML on HEAP blob tables — PK (rnd_pos_by_record()), no-PK (table scan with record_compare()), and REPLACE with blobs (self-referencing, VALUES, multi-blob). Validates that deferred blob chain free (~~MDEV-39732~~), deferred UPDATE free (~~MDEV-39782~~), and write_can_replace forced copy mode (MDEV-39825) keep data valid for binlog_log_row(). INSERT, UPDATE, DELETE, REPLACE scenarios with slave consistency verification.

Unit Tests

hp_test_hash-t.c (67 assertions): Hash/comparison correctness for blob key segments — hash consistency across record and key formats (Cases A/B/C), record-to-record comparison, NULL/empty blob handling, multi-segment keys (int+blob), PAD SPACE collation, DISTINCT/GROUP BY key format round-trips, key truncation bug detection, hp_hashnr vs hp_rec_hashnr consistency for mixed varchar+blob keys, packlength variants (1/2/3/4), and blob+blob multi-segment keys.
hp_test_key_setup-t.cc (34 assertions): heap_prepare_hp_create_info() key setup — blob segment normalization (seg->length, bit_start, flag, type), VARCHAR/INT/ENUM/mixed non-blob key verification, geometry GROUP BY no-widening.
hp_test_freelist-t.c (79 assertions): Free-list contiguity detection, repeated delete-reinsert cycles, Step 3 scavenge fallback, capacity exhaustion, tail reclaim on failed blob allocation (single/cross/3-block boundary), orphaned block reuse.
hp_test_scan-t.c (19 assertions): Scan correctness with continuation records.
hp_test_concurrent-t.c: Concurrent blob del_link race condition test — two threads share one HP_SHARE via separate HP_INFO handles, do insert/delete cycles with blobs. Verifies that serializing hp_flush_pending_blob_free() under mutex (matching external_lock(F_UNLCK) under thr_lock) prevents del_link corruption.

Modified Tests

Several existing tests required adjustments because HEAP's row ordering differs from Aria's, and because I_S temp tables now use MEMORY instead of Aria:

main/select.test: Added ORDER BY fld3 to a GROUP BY ... LIMIT query. Without an explicit ORDER BY, LIMIT selects a non-deterministic window that depends on the temp table engine's physical row ordering.
main/cte_recursive.test: Added --sorted_result before a recursive CTE query where rows with identical (level, mid) values (due to integer overflow) have non-deterministic order.
main/distinct.test: Added --replace_regex to mask a ROLLUP row's b value — the value depends on which group is processed last, which varies by temp table engine.
main/group_by.test: Changed Created_tmp_disk_tables expected value from 1 to 0 — a query with (int, decimal, blob) columns now uses HEAP instead of Aria for the temp table.
main/intersect_all.test: Added --sorted_result before UNION ALL queries without ORDER BY whose row ordering depends on the temp table engine.
main/information_schema_part.test: Added subpartition_name to ORDER BY — within the same partition, subpartition ordering was non-deterministic.
main/temp_table_symlink.test: Removed the I_S query that tested disk temp file symlink conflicts — with HEAP blob support, I_S temp tables stay in memory and create no disk files, so the symlink attack vector does not apply.
main/information_schema.result, main/information_schema_parameters.result, main/information_schema_routines.result: Updated expected engine name from Aria to MEMORY for I_S system views that contain TEXT/BLOB columns.
main/derived_view.result: Query plan improvement — HEAP temp table enables ref access via hash index instead of block nested loop join.
main/select_pkeycache.result, main/select_jcl6.result: Re-recorded for row ordering differences.
main/create.test: Removed --error ER_TABLE_CANT_HANDLE_BLOB — HEAP now accepts blob columns. Kept as a positive test.
sys_vars/tmp_disk_table_size_basic.test: Removed -~~error ER_RECORD_FILE_FULL on I_S views query — with HEAP blob support, the I_S temp table stays in memory and is not affected by tmp_disk_table_size. Added~~ -sorted_result before SELECT table_name FROM INFORMATION_SCHEMA.views — row ordering is non-deterministic without ORDER BY and changed with HEAP temp tables.
funcs_1/is_events.result: Updated ENGINE=MEMORY in SHOW CREATE for I_S events table.
innodb_fts/innodb-fts-ddl.test: Added FTS_DOC_ID tiebreaker to ORDER BY for deterministic ordering of rows with identical FULLTEXT scores.
main/group_min_max_innodb.test: Added --replace_column 2 # to mask non-deterministic blob value in GROUP BY query — the serialized_c column is not aggregated, so which row is picked for voter_id=4 (two rows with different blob values) depends on internal access order.
funcs_1/is_tables_is.result, funcs_1/is_tables_is_embedded.result: Updated ENGINE from MYISAM_OR_MARIA to MEMORY and ROW_FORMAT from DYNAMIC_OR_PAGE to Fixed for I_S tables that now stay in MEMORY.
funcs_1/is_routines_embedded.result, funcs_1/is_triggers_embedded.result, funcs_1/is_views_embedded.result: Updated ENGINE=MEMORY in SHOW CREATE for embedded server variants of I_S tests. is_routines_embedded also updated for the mtr_check.sql stored procedure body change (ORDER BY columns_in_mysql, ordinal_position).
funcs_1/is_columns.result, funcs_1/is_routines.result, funcs_1/is_triggers.result, funcs_1/is_views.result: Updated ENGINE=MEMORY in SHOW CREATE for I_S system views.
funcs_1/processlist_priv_ps.result, funcs_1/processlist_val_ps.result: Updated ENGINE=MEMORY in SHOW CREATE for processlist I_S tables (ps-protocol variants).
sysschema/v_schema_redundant_indexes.result: Re-recorded for non-deterministic row ordering.
innodb_fts/misc.test: Added id column as tiebreaker to ORDER BY MATCH...AGAINST queries — rows with equal FTS relevance scores have non-deterministic ordering that changed when intermediate temp tables use HEAP instead of Aria.
funcs_1/processlist_priv_no_prot.result, funcs_1/processlist_val_no_prot.result: Updated ENGINE=MEMORY in SHOW CREATE for processlist I_S tables (no-protocol variants).
include/mtr_check.sql: Added ordinal_position tiebreaker to the INFORMATION_SCHEMA.COLUMNS query's ORDER BY clause — the query ordered by columns_in_mysql (table_schema.table_name) only, so columns within the same table had non-deterministic ordering that changed when I_S temp tables use HEAP instead of Aria. This change also required updating result files that embed the stored procedure body text: funcs_1/is_routines_embedded.result and plugins/sql_error_log_withdbinfo.result.
perfschema/include/transaction_nested_events_verifier.inc: Added r_end_event_id tiebreaker to the UNION query's ORDER BY clause — the combined transaction+statement history query ordered by thread_id, r_event_id, but rows with the same event_id and different end_event_id values (e.g. an in-progress transaction and its completed counterpart) had non-deterministic ordering that changed when intermediate temp tables use HEAP instead of Aria.
perfschema/transaction_nested_events.result: Re-recorded to match the stable ordering from the verifier fix above.
main/partition_explicit_prune.result: Updated for HANDLER_READ_RND_DELETED counter changes from HEAP blob continuation record scanning.
main/column_compression.test: Added a diagnostic SELECT before GROUP_CONCAT to provide deterministic ordering context for the following aggregate query.
main/count_distinct.test: Wrapped non-aggregated columns (t2.isbn, t1.libname) in MIN() — without aggregation, GROUP BY picks a non-deterministic row per group, which differs between HEAP and Aria temp table engines.
main/func_group.result, main/metadata.result, main/mysql_client_test.result: Updated wire protocol flags — UNION/DISTINCT/GROUP BY columns going through HEAP temp tables now have BLOB_FLAG (bit 2 = 4) set in metadata because HEAP stores the intermediate result in a blob-capable table.
main/type_bit.result, main/type_enum.result, main/type_set.result: Same wire protocol BLOB_FLAG change for DISTINCT/GROUP BY/UNION results on BIT, ENUM, and SET column types.
main/status.result: Handler counter changes — Handler_read_key 9→13 (HEAP hash lookups vs Aria index), Handler_read_rnd 7→6, Handler_tmp_write 7→6, Created_tmp_disk_tables 1→0 (temp table stays in HEAP).
main/tmp_table_error.test: Added SET tmp_memory_table_size=0 to force Aria temp tables — the test expects an Aria-specific "too many columns" error that requires the temp table to use Aria, not HEAP.
main/max_session_mem_used.test, perfschema/misc_session_status.test: Changed not_valgrind.inc to not_valgrind_build.inc — infrastructure adjustment for valgrind detection.
main/type_bit.test: Added --disable_cursor_protocol around metadata-sensitive DISTINCT/GROUP BY queries — cursor mode loses UNIQUE_KEY_FLAG on temp table columns, causing result mismatches.
heap/blob_big2.test: Reduced tmp_memory_table_size from 65536 to 32768 so UNION DISTINCT overflows to disk on 32-bit platforms (smaller recbuffer = more rows fit in memory before overflow).
heap/blob_big1.test, heap/blob_big3.test: Replaced exact Created_tmp_% counts with ON/OFF checks — view protocol creates additional temp tables whose overflow behavior depends on pointer size.
heap/blob_update_overflow.test: Added --disable_ps_protocol around FLUSH STATUS / SHOW STATUS sections — PS protocol creates extra internal temp tables that inflate Created_tmp_tables counters.
mroonga/storage/fulltext_order_natural_language_mode_different_match.test: Added , title tiebreaker to ORDER BY for equal MATCH scores; masked Created_tmp_tables counter with --replace_column. Removed unnecessary SHOW STATUS LIKE 'Created_tmp%' (~~MDEV-39703~~).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

gcov-missing-MDEV-38975-8641e4c.txt
2026-06-02 10:08
8 kB
Elena Stepanova
my_dynmem.c
2026-03-05 13:36
25 kB
Michael Widenius
Screenshot_20260304_213449.png
2026-03-05 02:36
85 kB
Arcadiy Ivanov
Screenshot_20260318_003255.png
2026-03-18 13:50
84 kB
Arcadiy Ivanov

Issue Links

blocks

MDEV-40032 Promote wide VARCHAR to BLOB for HEAP internal temp tables

In Review

MDEV-40033 HEAP: free-list contiguous block coalescing

Open

causes

MDEV-39702 main.type_bit fails with a wrong result with cursor-protocol, no UNIQUE_KEY flag for DISTINCT

Closed

MDEV-39703 mroonga/storage.fulltext_order_natural_language_mode_different_match fails with a wrong result with ps-protocol

Closed

MDEV-39704 mroonga tests fail sporadically with a wrong result

Closed

MDEV-39723 Unexpected ER_DUP_ENTRY upon GROUP BY query from a table with blob

Closed

MDEV-39724 Server crashes in hp_is_single_rec upon query with group by and variable assignment

Closed

MDEV-39732 Slave crashes in hp_free_run_chain upon applying row event on HEAP table with blobs

Closed

MDEV-39761 Assertion `!table->blob_storage' failed in Field_geom::store

Closed

MDEV-39782 RBR fails with ER_KEY_NOT_FOUND upon UPDATE on HEAP table with blob

Closed

MDEV-39825 ASAN memcpy-param-overlap in hp_write_run_data with blob in HEAP table

Open

MDEV-39871 Server crashes in my_hash_sort_bin upon GROUP BY with geometry field

In Testing

MDEV-40376 Protocol assertion fails or server hangs upon query using temporary table

In Review

MDEV-40378 HEAP table with blob gets corrupted after hitting ER_RECORD_FILE_FULL, assertion failure in diagnostics area

In Review

MDEV-40447 HEAP initial block allocation is sized from max_records, causing per-statement giant mmap/munmap and mmap_lock contention with large tmp_table_size

In Review

MDEV-40448 HEAP unique hash duplicate rejection inserts the key then deletes it, re-hashing full BLOB values (~1.8x collation hashing vs Aria)

In Review

relates to

MDEV-19 Memory tables: VARCHAR and BLOB support

Stalled

MDEV-40033 HEAP: free-list contiguous block coalescing

Open

split to

MDEV-40032 Promote wide VARCHAR to BLOB for HEAP internal temp tables

In Review

has action item: DOCS-6177 Loading...

(11 causes, 2 relates to, 1 split to, 1 has action item)

Activity

People

Assignee:: Elena Stepanova

Reporter:: Arcadiy Ivanov

Assigned for Implementation:: Georgi Kodinov

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 2026-03-05 02:47

Updated:: 3 days ago 18:26

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

13.25d

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

Details

Description

HEAP Engine: BLOB/TEXT/JSON/GEOMETRY Column Support

Motivation

Architecture

Existing HEAP Row Format

BLOB Representation in the Record Buffer

Continuation Record Chains

Flags Byte

Variable-Length Continuation Runs

Blob Storage Format

Run Allocation Strategy

Multiple BLOB Columns

Record Counting

Reassembly Buffer

Changes

HEAP Engine Core (storage/heap/)

Handler Interface (ha_heap.h / ha_heap.cc)

Data Structures (include/heap.h / heapdef.h)

Field_blob_key — HEAP-Native Blob Key Format (sql/field.h / sql/field.cc)

SQL Layer (sql/sql_select.cc)

SQL Layer (sql/item_sum.cc)

SQL Layer — Geometry GROUP_CONCAT Fix (MDEV-39761)

SQL Layer — Geometry GROUP BY Key Format Fix (MDEV-39871)

SQL Layer — Expression Cache (sql/sql_expression_cache.cc)

SQL Layer — FULLTEXT Early Detection (sql/sql_derived.cc)

Performance Impact

Non-Blob Tables: Zero Regression

Blob Tables: Performance Characteristics

Net Effect on Temporary Table Workloads

Limitations

Test Changes

New Tests

Unit Tests

Modified Tests

Attachments

Attachments

Issue Links

Activity

People

Dates

Time Tracking

Git Integration

HEAP Engine Core (`storage/heap/`)

Handler Interface (`ha_heap.h` / `ha_heap.cc`)

Data Structures (`include/heap.h` / `heapdef.h`)

`Field_blob_key` — HEAP-Native Blob Key Format (`sql/field.h` / `sql/field.cc`)

SQL Layer (`sql/sql_select.cc`)

SQL Layer (`sql/item_sum.cc`)

SQL Layer — Expression Cache (`sql/sql_expression_cache.cc`)

SQL Layer — FULLTEXT Early Detection (`sql/sql_derived.cc`)