Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-38975

BLOBs in MEMORY (HEAP)

    XMLWordPrintable

Details

    Description

      HEAP Engine: BLOB/TEXT/JSON/GEOMETRY Column Support

      Motivation

      The growing adoption of JSON columns (MYSQL_TYPE_BLOB-based), Dynamic Columns (COLUMN_CREATE/COLUMN_GET), and GEOMETRY types means that an increasing number of SQL operations — GROUP BY, DISTINCT, UNION, subqueries, CTEs — materialize their intermediate results into temporary tables that contain BLOB/TEXT fields.

      Today, the HEAP (MEMORY) engine unconditionally rejects BLOB/TEXT columns (HA_NO_BLOBS flag in ha_heap::table_flags()). This forces the optimizer to use Aria (or MyISAM) temporary tables whenever any blob-typed field appears in the result set — even when the actual data is small (a 50-byte JSON document, a short TEXT value, etc.). The overhead of Aria internal temporary tables includes: creating and deleting a temporary file on disk (.MAI + .MAD), and managing data through the Aria page cache. Note that Aria temp tables are non-transactional (ci->transactional= FALSE) — there is no transaction log overhead and no fsync(). Under memory pressure, dirty pages may be evicted to disk via pwrite() (a plain write to the OS page cache, not a physical sync).

      This situation will only get worse as internally structured columns (JSON, Dynamic Columns, VECTOR, spatial types) gain popularity. Queries over tables with a single JSON column that would otherwise fit entirely in memory are forced to disk simply because the storage engine cannot represent the blob descriptor.

      This feature removes the HA_NO_BLOBS restriction from the HEAP engine, allowing the optimizer to keep temporary tables with BLOB/TEXT columns in memory when they fit within max_heap_table_size / tmp_memory_table_size limits.

      Additionally, the MEMORY engine now advertises HA_CAN_GEOMETRY, enabling explicit CREATE TABLE ... ENGINE=MEMORY with GEOMETRY/POINT/LINESTRING/POLYGON columns. Previously, GEOMETRY columns were rejected by the MEMORY engine because they are internally represented as Field_blob. With blob support in place, GEOMETRY columns work naturally — spatial data is stored in the same continuation record chains as any other BLOB/TEXT data. All spatial functions (ST_AsText(), MBRWithin(), etc.) work correctly with MEMORY-stored geometry values.

      The patch also implements full HASH index key handling for BLOB columns. The HEAP hash functions (hp_rec_hashnr(), hp_rec_key_cmp(), hp_key_cmp(), hp_make_key(), hp_hashnr()) are extended to handle HA_BLOB_PART key segments, reading actual blob data via pointer dereference or chain materialization. This enables GROUP BY, DISTINCT, and UNION deduplication to work correctly on BLOB/TEXT columns entirely within the MEMORY engine. A hash pre-check optimization skips expensive blob materialization when hash values differ, and PAD SPACE collation semantics are preserved for blob key comparisons. The SQL layer bridges between its own key format (2-byte length + inline data from Field_blob::new_key_field()) and HEAP's native format (4-byte length + data pointer from hp_make_key()) via precomputed HP_KEYDEF::has_blob_seg flags and key rebuild logic in ha_heap::index_read_map() and related methods.


      Architecture

      Existing HEAP Row Format

      HEAP stores rows as fixed-length records in a hierarchical block tree (HP_BLOCK). Each row occupies exactly recbuffer bytes, computed as MY_ALIGN(MAX(reclength, sizeof(char*)) + 1, sizeof(char*)) — i.e. the record length (minimum sizeof(char*)) plus a flags byte, aligned up to pointer size (8 bytes on 64-bit). share->visible equals MAX(reclength, sizeof(char*)) — the byte offset of the flags byte. A single flags byte at offset share->visible marks the row as active or deleted. Deleted rows store a free-list pointer in their first sizeof(uchar*) bytes, forming a singly-linked free list through share->del_link.

      BLOB Representation in the Record Buffer

      In the MySQL/MariaDB record format, BLOB fields store only a descriptor in the fixed-width record buffer: a packlength-byte length prefix (1–4 bytes depending on BLOB type) followed by an 8-byte data pointer. The actual BLOB data lives in memory owned by the SQL layer (Field_blob::value). This descriptor is only 9–12 bytes regardless of BLOB data size, so it fits naturally in HEAP's fixed-width rows.

      The challenge is: where does the actual BLOB data live once the row is written into the HEAP engine?

      Continuation Record Chains

      BLOB data is stored using continuation records — additional fixed-size records allocated from the same HP_BLOCK that holds regular rows. This approach was chosen because it:

      • Reuses existing HP_BLOCK allocation, free list, and size accounting with minimal structural change
      • Ensures data_length tracking and max_table_size enforcement work automatically (continuation records consume regular row records)
      • Avoids per-BLOB my_malloc() calls — piggybacks on the amortized block allocation (~1 malloc per records_in_block rows)

      Flags Byte

      The existing visibility byte is extended into a flags byte:

      Bit Name Meaning
      0 HP_ROW_ACTIVE Record is active (not deleted)
      1 HP_ROW_HAS_CONT Primary record has blob continuation chain(s)
      2 HP_ROW_IS_CONT This record IS a continuation record
      3 HP_ROW_CONT_ZEROCOPY Zero-copy layout: data in rec 1..N-1 only (no data in rec 0 payload)

      Variable-Length Continuation Runs

      To minimize per-record overhead, continuation records are grouped into runs — contiguous sequences of recbuffer-sized records within the same leaf block. Only the first record of each run carries a header; subsequent records are pure payload with no flags byte. Runs are linked via next_cont pointers.

      Case A single-record run (HP_ROW_SINGLE_REC flag, no header):

      byte 0                                                         visible   recbuffer
        |                                                               |         |
        v                                                               v         v
        +---------------------------------------------------------------+---------+
        | BLOB data payload (visible bytes, full width)                 | flags   |
        |                                                               | byte    |
        +---------------------------------------------------------------+---------+
      

      Case B/C first record of a multi-record run (rec 0):

      byte 0                                                         visible   recbuffer
        |                                                               |         |
        v                                                               v         v
        +--------------+---------------+--------------------------------+---------+
        | next_cont    | run_rec_count | BLOB data payload              | flags   |
        | (8 bytes)    | (2 bytes)     | (visible - 10 bytes)           | byte    |
        +--------------+---------------+--------------------------------+---------+
      

      Inner records (rec 1..N-1) — no flags byte:

      byte 0                                                                    recbuffer
        |                                                                         |
        v                                                                         v
        +-------------------------------------------------------------------------+
        | BLOB data payload (recbuffer bytes, full width)                         |
        +-------------------------------------------------------------------------+
      

      • next_cont (8 bytes): pointer to the next run, or NULL if last. Shares position with del_link — when a record is active it's next_cont; when deleted it's del_link. No conflict. Not present in Case A.
      • run_rec_count (uint16, 2 bytes): number of consecutive records in this run. Maximum 65,535 records per run — blobs requiring more are split into multiple runs automatically. Not present in Case A.
      • Payload per Case A record: visible bytes.
      • Payload per Case B/C run of N records: (visible - 10) + (N - 1) × recbuffer bytes.

      For multi-record runs, this design amortizes the 10-byte per-run header overhead across many records, achieving near-100% space efficiency. For single-record runs (Case A), the header is eliminated entirely:

      recbuffer visible Case A (1 rec, no header) Case B/C 10-record Case B/C 100-record
      16 15 15 bytes (100%) 5 + 9×16 = 149 bytes (93%) 5 + 99×16 = 1,589 bytes (99%)
      504 500 500 bytes (100%) 490 + 9×504 = 5,026 bytes (99%) 490 + 99×504 = 49,886 bytes (99.9%)

      Blob Storage Format (enum hp_blob_format)

      All write, read, free, scan, and integrity-check paths use a single detection mechanism: hp_blob_run_format(chain, visible) decodes the flags byte into enum hp_blob_format. Three formats:

      Case Enum Flag Layout Read
      A HP_BLOB_CASE_A_SINGLE_REC HP_ROW_SINGLE_REC Single record, no header. Data at offset 0, full visible bytes. Zero-copy: chain.
      B HP_BLOB_CASE_B_ZEROCOPY HP_ROW_CONT_ZEROCOPY Single run, multiple records. Header in rec 0, data contiguous in rec 1..N-1. Inner records have no flags byte. Zero-copy: chain + recbuffer.
      C HP_BLOB_CASE_C_MULTI_RUN Neither One or more runs linked via next_cont. Header + data in each run's rec 0, plus inner records. Reassembly into blob_buff.

      The HP_ROW_CONT_ZEROCOPY flag on rec 0's flags byte distinguishes Case B from Case C. Without it, a free-list single-run blob (which uses Case C layout because it was allocated before we knew it would be the only run) would be indistinguishable from a tail-allocated single-run blob (Case B). The flag is set during write when: (1) the blob fits in a single run, (2) the run has multiple records, and (3) the run is allocated entirely from the block tail (not the free list).

      For Case B, rec 0's data payload area is unused (wasted) — all data goes into rec 1..N-1. This trade-off wastes at most visible - 10 bytes (typically 5–490 bytes) but enables zero-copy reads for the entire blob without any memcpy into blob_buff.

      Run Allocation Strategy

      Runs are allocated in two phases:

      1. Free list peek-then-unlink: Walk the free list read-only (following next pointers without unlinking), tracking contiguous groups of adjacent records. Only descending-address adjacency is checked — hp_free_run_chain() frees records in ascending order, so LIFO pushes create descending sequences on the free list. On discontinuity: if the accumulated group qualifies (>= min_run_records), unlink exactly those records and use them for a run, then continue scanning for more data. If the group does not qualify, the free list is fragmented — stop immediately and fall through to tail allocation. The free list is only modified when a qualifying contiguous group is actually found. Since run_rec_count is uint16, individual runs are capped at 65,535 records; larger contiguous groups are automatically split into multiple runs at this boundary.
        The min_run_records threshold prevents excessive chain fragmentation for large blobs while still allowing small blobs to reuse free-list slots efficiently. It is computed as: MAX(blob_length / 10, 128) / recbuffer, with a floor of 2 records. However, both min_run_bytes and min_run_records are capped at the blob's actual size: min_run_bytes is capped at data_len, and min_run_records is capped at total_records_needed (the total records the blob requires). This prevents the 128-byte floor from making free-list reuse impossible for small blobs on narrow tables — e.g. with recbuffer=16, the floor would produce min_run_records=8, but a 32-byte blob only needs 3 records, so the cap lowers the threshold to 3.
      2. Tail allocation: Remaining data is allocated from the block tail (hp_alloc_from_tail()), which is always contiguous within a leaf block. A new run starts at each block boundary. Tail-allocated single-run blobs use Case B (multiple records, zero-copy) when the data fits; otherwise they fall back to Case C. Case A (single record, no header) can be allocated from either the free list or the tail — it has no contiguity requirements since it is a single standalone record.

      This strategy exploits the fact that when a blob row is deleted, hp_free_blobs() pushes continuation records to the free list consecutively. When re-allocated, they naturally appear as contiguous sequences in the free list (LIFO order). Fragmentation only occurs when non-blob inserts consume some freed records in between.

      Multiple BLOB Columns

      Each BLOB column has its own independent continuation chain. The primary row stores the standard (packlength, data_pointer) descriptor for each blob field. On write, the data pointer is overwritten in the stored row to point to the first continuation run. On read, each chain is walked independently. Each blob column is independently classified as Case A, B, or C — some may use zero-copy while others require reassembly into blob_buff.

      Record Counting

      A new counter HP_SHARE::total_records tracks all physical records (primary + continuation). HP_SHARE::records remains the logical (primary-only) count. This separation is critical because HEAP's linear hash uses records for bucket mapping — inflating it with continuation records would corrupt the hash table.

      Counter Meaning Used by
      share->records Logical (primary) record count Hash bucket mapping, blength, statistics, optimizer
      share->total_records All physical records (primary + continuation) Block position arithmetic, scan boundary, heap_check_heap()
      share->deleted Deleted records on free list (both types) Free list, both counters

      Invariant: total_records = records + continuation_count. For non-blob tables, total_records == records always.

      Reassembly Buffer

      On read, blob data that requires reassembly (Case C) is copied from continuation chains into a per-handle buffer (HP_INFO::blob_buff). This follows the same pattern as InnoDB's blob_heap and MyISAM's rec_buff: a single buffer holds all blob data for the current row, grown via my_realloc() as needed, and freed on heap_reset() / HA_EXTRA_NORMAL / heap_close().

      For zero-copy cases (A and B), blob field pointers are set directly into the HP_BLOCK continuation records — no blob_buff allocation or copy is needed. The HP_INFO::has_zerocopy_blobs flag tracks whether the last hp_read_blobs() call produced any zero-copy pointers. This flag is used by heap_update() to refresh the caller's record buffer after an update frees old continuation chains — without this, zero-copy pointers in the caller's buffer would become dangling pointers into freed records.

      BLOB field pointers in record[0] are rewritten to point into either blob_buff (Case C) or directly into HP_BLOCK records (Cases A/B) after reassembly. The buffer/pointers remain valid until the next row fetch — the SQL layer's Field_blob::val_str() dereferences these pointers with no additional copy.

      Zero-copy and UPDATE safety: heap_update() uses a write-before-free strategy — new blob chains are written before old ones are freed, so old data survives until the new data is successfully in place. After freeing old chains, heap_update() checks has_zerocopy_blobs on the current handle and refreshes the caller's record buffer with new chain pointers via hp_read_blobs(), preventing dangling zero-copy pointers. Under HEAP's current table-level locking (thr_lock), other handles are blocked by the table lock during UPDATE, so they cannot hold active zero-copy pointers into the freed chains. For internal temporary tables (single handle, single thread), this is definitively safe. If HEAP ever gains row-level or concurrent access in the future, a create option to disable zero-copy (forcing all blobs through blob_buff reassembly) may be needed to prevent cross-handle dangling pointers.


      Changes

      HEAP Engine Core (storage/heap/)

      New file: hp_blob.c (~850 lines)

      • hp_write_blobs() — writes all blob columns for a row into continuation chains. On failure, rolls back any partially written chains. Supports all three formats: Case A (single record, no header), Case B (single run, zero-copy), and Case C (multi-run, with headers).
      • hp_read_blobs() — two-pass approach: first pass sums data lengths for Case C blobs to size blob_buff; second pass sets zero-copy pointers (Cases A/B) or reassembles into blob_buff (Case C). Sets has_zerocopy_blobs flag.
      • hp_free_blobs() — frees all continuation chains for a row (called on DELETE before free-list insertion).
      • hp_free_run_chain() — frees a single continuation chain (individual records returned to free list).
      • hp_materialize_one_blob() — materializes a single blob column from a stored continuation chain. Returns zero-copy pointer for Cases A/B, or reassembles into blob_buff for Case C. Used by hash comparison functions when comparing stored records.
      • hp_alloc_from_tail() — allocates a single record from the HP_BLOCK tail, bypassing the free list. Used for guaranteed-contiguous allocation.
      • hp_unlink_and_write_run() — unlinks a contiguous group of records from the free list and writes blob data into them.
      • Internal helpers for run data writing.

      hp_create.c

      • Blob descriptor array (HP_BLOB_DESC) allocated alongside HP_SHARE in a single allocation.
      • visible_offset minimum enforced for blob tables to ensure at least 1 byte of payload beyond the continuation header (prevents underflow for pathological single-TINYBLOB tables with reclength as low as 9).
      • Key segment setup: HA_BLOB_PART flag validated against blob_descs[] array — spurious flags (from uninitialized key_part_flag in SJ weedout tables, or BLOB_FLAG on non-Field_blob INFORMATION_SCHEMA temp tables) are stripped. Real blob segments get bit_start set to actual packlength (1–4) instead of the generic VARCHAR bit_start=2.
      • Key length calculation accounts for blob segments: 4 + sizeof(uchar*) (length + data pointer) in the key buffer.
      • Precomputed HP_KEYDEF::has_blob_seg flag set during table creation when any key segment has HA_BLOB_PART, avoiding per-call loops through key segments.

      hp_write.c

      • heap_write() calls hp_write_blobs() after memcpy. New err_blob: label for correct key rollback when blob write fails (all keys already written — must iterate all keydef entries, not skip the current one). Neither err: nor err_blob: calls hp_free_blobs(): err: is reached before memcpy/hp_write_blobs() (stale data in pos); err_blob: is reached after hp_write_blobs() fails, but hp_write_blobs() is self-cleaning (frees its own partial chains, NULLs all pointers). Calling hp_free_blobs() from err_blob: would crash because the flags byte is uninitialized and blob columns after the failed one still have SQL layer data pointers (not chain pointers).
      • next_free_record_pos() uses block.last_allocated instead of share->records for position arithmetic (since records no longer equals total physical records). Made non-static for use by hp_blob.c.

      hp_delete.c

      • hp_free_blobs() called before free-list insertion.
      • total_records decremented on delete.

      hp_update.c

      • Per-column unchanged blob detection: before rewriting chains, each blob column's old and new values are compared using length (O(1)), then data pointer (O(1)), then memcmp (O with early exit). Unchanged blobs keep their existing continuation chains with no allocation, copy, or free. Only changed blobs get new chains written. This avoids unnecessary chain churn for UPDATE of non-blob columns, INSERT ... ON DUPLICATE KEY UPDATE with unchanged blob values, and REPLACE with identical blob data.
      • Write-before-free strategy for changed blobs: new chains are written before old ones are freed. On failure, old record and chain pointers are restored. This ensures crash-safety on OOM/table-full during blob grow.
      • After update, when has_zerocopy_blobs is set or any blob changed, refreshes the caller's record buffer by copying chain pointers from the stored row and calling hp_read_blobs(). This prevents dangling zero-copy pointers from pointing into freed old chains (e.g. in the INTERSECT ALL unfold path where the caller reuses the record buffer after update).

      hp_scan.c

      • Continuation records (HP_ROW_IS_CONT) are skipped during table scans. Rec 0 of each run has the flags byte; inner records have no flags byte. The scan reads run_rec_count from the header and batch-skips the entire run (returned as HA_ERR_RECORD_DELETED to the caller, same as actual deleted records).
      • Scan boundary uses total_records + deleted instead of records + deleted.

      hp_hash.c — hash function and comparison for blob key segments

      • hp_rec_hashnr() — for HA_BLOB_PART segments: reads blob length using actual packlength, dereferences data pointer, hashes actual blob data. Input records always have valid blob pointers.
      • hp_rec_key_cmp() — blob segment comparison with materialization support. Takes new HP_INFO *info parameter. When info is non-NULL, the second record is treated as stored (blob pointers are continuation chain heads) and materialized via hp_materialize_one_blob(). When info is NULL (e.g. heap_update() comparing two input records), both sides are dereferenced directly. Length-based early exits are gated by MY_CS_NOPAD to preserve PAD SPACE collation semantics where trailing spaces are insignificant (e.g. 'a' and 'a ' compare equal).
      • hp_key_cmp() — blob segment comparison between stored record and pre-built search key. Stored side is materialized from chain; key side has 4-byte length + data pointer from hp_make_key(). PAD SPACE semantics handled the same way as hp_rec_key_cmp().
      • hp_make_key() — for blob segments: stores 4-byte length + data pointer (not inline data) in the key buffer. VARCHAR segments are normalized to always use a 2-byte length prefix in the key buffer (regardless of the record's native 1-byte or 2-byte prefix), ensuring consistency with hp_hashnr() and hp_key_cmp() which always read 2 bytes. Zero-padded to seg->length for sanitizer cleanliness.
      • hp_hashnr() — pre-built key hashing: blob segments encoded as 4-byte length + data pointer.
      • hp_vartext_key_pack_size() — new helper for key advancement past VARCHAR/BLOB segments in the key buffer (2 bytes for VARCHAR, 4+sizeof(ptr) for BLOB). Used by both hp_hashnr() and hp_key_cmp() for null-key advancement.
      • NULL handling: key advancement accounts for blob segment size (4 + sizeof(uchar*)) vs varchar (2) via hp_vartext_key_pack_size().
      • Hash pre-check optimization: hp_search(), hp_search_next(), hp_delete_key(), and find_unique_row() now compare HASH_INFO::hash_of_key against the search key's hash before calling the full key comparison. When hashes differ the keys are guaranteed different, skipping expensive hp_materialize_one_blob() calls. This pattern already existed in hp_write_key() for duplicate detection but was missing from the read/delete paths. HP_INFO::last_hash_of_key is added so hp_search_next() can reuse the hash computed by hp_search() without recomputing it.

      hp_rrnd.c, hp_rfirst.c, hp_rlast.c, hp_rnext.c, hp_rprev.c, hp_rkey.c, hp_rsame.c

      • hp_read_blobs() call added after memcpy(record, pos, reclength) in every read path, guarded by share->blob_count.

      hp_clear.ctotal_records and block.last_allocated reset on TRUNCATE.

      hp_close.cblob_buff freed on close.

      hp_extra.cblob_buff freed on heap_reset() (prevents a 4GB LONGBLOB from permanently consuming 4GB of memory per handle).

      hp_static.chp_key_memory_HP_BLOB PSI memory instrumentation key added.

      _check.cheap_check_heap() updated to iterate using total_records + deleted, count continuation records separately (batch-skipping inner records via run_rec_count), and validate records + cont_count == total_records.

      Handler Interface (ha_heap.h / ha_heap.cc)

      • HA_NO_BLOBS removed and HA_CAN_GEOMETRY added to ha_heap::table_flags(). GEOMETRY columns (which use Field_blob internally) are now supported in MEMORY tables.
      • heap_prepare_hp_create_info() populates HP_BLOB_DESC array from TABLE_SHARE::blob_fields, filtering non-Field_blob types (e.g. INFORMATION_SCHEMA Field_string with BLOB_FLAG). Both MYSQL_TYPE_BLOB and MYSQL_TYPE_GEOMETRY are included — Field_geom extends Field_blob but returns a different field_type(). Key segment setup uses field->key_part_flag() instead of key_part->key_part_flag (which may be uninitialized in SJ weedout paths). Blob key segments get seg->length=0 and key_part->length widened to max_data_length() for DISTINCT keys (guarded by pack_length_no_ptr() to avoid overflowing geometry GROUP BY keys). Precomputes has_blob_seg and needs_key_rebuild_from_group_buff flags.
      • remember_rnd_pos() / restart_rnd_next() implemented. Required by remove_dup_with_compare() for DISTINCT deduplication with blobs. Saves and restores both the row position and current_record counter (which heap_rrnd() does not update). restart_rnd_next() caps next_block at total_records + deleted to prevent heap_scan() from reading past the last allocated slot — without this cap, the scan boundary can exceed the allocated HP_BLOCK range when the saved position is in the last block segment, causing SIGSEGV.
      • find_unique_row() — argument order swapped for hp_rec_key_cmp() (input record first, stored record second) and hp_read_blobs() called after record copy.
      • index_read_map() and related index methods: when the index has blob key segments (HP_KEYDEF::has_blob_seg), the SQL layer's key buffer is rebuilt into HEAP's native format. This is needed because Field_blob::new_key_field() returns Field_varstring which produces 2-byte length + inline data format, while HEAP's hp_hashnr/hp_key_cmp expect hp_make_key format (4-byte length + data pointer). The GROUP BY path uses rebuild_key_from_group_buff() which parses the group buffer via SQL-layer KEY_PART_INFO; other paths (DISTINCT, materialization) use rebuild_blob_key_from_segments() which parses via HA_KEYSEG.
      • blob_descs memory freed in open() and create() after heap_create().

      Data Structures (include/heap.h / heapdef.h)

      • HP_BLOB_DESC struct: offset (byte offset within record buffer) + packlength (1–4).
      • HP_SHARE: added blob_descs, blob_count, total_records.
      • HP_KEYDEF: added has_blob_seg (precomputed flag for blob key segment detection), needs_key_rebuild_from_group_buff (GROUP BY key 0 with blob segments requires parsing the group buffer via rebuild_key_from_group_buff() instead of reading directly from record[0]).
      • HP_INFO: added blob_buff, blob_buff_len, has_zerocopy_blobs, last_hash_of_key (cached hash for hp_search_next() reuse).
      • HP_CREATE_INFO: added blob_descs, blob_count.
      • Constants: HP_ROW_ACTIVE, HP_ROW_HAS_CONT, HP_ROW_IS_CONT, HP_ROW_CONT_ZEROCOPY, HP_CONT_HEADER_SIZE, HP_CONT_REC_COUNT_SIZE, HP_CONT_MIN_RUN_BYTES, HP_CONT_RUN_FRACTION_NUM/DEN.

      SQL Layer (sql/sql_select.cc)

      choose_engine() (temp table engine selection):

      • Removed share->blob_fields from the condition that forces disk-based engine.
      • Added share->reclength > HA_MAX_REC_LENGTH check instead — HEAP's fixed-width rows would waste too much memory for very wide records.

      finalize() (temp table creation):

      • HEAP tables with blobs still use fixed-width rows (blob data lives in separate continuation records, not inline). The first-byte guard (ensuring byte 0 is non-zero for the delete-link mechanism) is applied to HEAP tables regardless of blob presence.
      • Packed row format (use_packed_rows) is skipped for HEAP tables — HEAP requires fixed-width rows.
      • GROUP BY key setup: m_key_part_info->key_part_flag is now set from field->key_part_flag() instead of hardcoded 0 — without HA_BLOB_PART, HEAP never enters blob-specific key handling code paths.
      • GROUP BY on blob columns: Field_blob::key_length() returns 0 (blobs use prefix indexing) and Field_geom::key_length() returns packlength (4), both too small to hold actual data. The GROUP BY key setup now uses the item's max_length (capped to MAX_BLOB_WIDTH) instead, which matches the buffer size allocated by calc_group_buffer(). A group buffer overflow check is added so that blobs that don't fit fall back to m_using_unique_constraint. key_part_flag is set AFTER the overflow check so that a break-out retains the original varchar flag, preventing rebuild_key_from_group_buff() from being called on a varchar-format buffer.
      • store_length initialized for all GROUP BY key parts so rebuild_key_from_group_buff() can advance through the key buffer. Previously uninitialized because nothing read it before.
      • DISTINCT key setup: HEAP handles NULLs per-segment in its hash index, so the extra null-bits helper key part (used by MyISAM/Aria for unique blob constraints) is skipped for HEAP. share->uniques is still set so that HEAP-to-disk conversion can build MARIA_UNIQUEDEF. The null-bits helper field uses empty_clex_str instead of null_clex_str to prevent SIGSEGV in Explain_index_use::set() which calls strlen() on the field name.

      end_update() (GROUP BY update path):

      • After copy_funcs() overwrites record[0]'s blob pointers with new expression results, the GROUP BY group buffer values are restored into record[0] so hp_make_key() in ha_write_tmp_row() builds the correct key. Only needed when blob_fields > 0: non-blob fields write directly into record[0] slots.

      calc_group_buffer():

      • Fixed blob subtype detection: only checked MYSQL_TYPE_BLOB, missing TINY_BLOB, MEDIUM_BLOB, LONG_BLOB, and GEOMETRY. Fixed with is_any_blob_field_type() which includes all five types.

      start() / pick_engine() (early engine prediction):

      • Extracted pick_engine() from choose_engine() and calls it early in start() to set m_heap_expected. When HEAP is predicted, the too_big_for_varchar() and group_length >= MAX_BLOB_WIDTH guards that would force m_using_unique_constraint are skipped — HEAP natively supports blob keys via its hash index, so these bail-outs are unnecessary and would prevent correct GROUP BY/DISTINCT on blob columns.

      remove_duplicates() (DISTINCT deduplication):

      • remove_dup_with_hash_index() cannot handle blobs (copies field data via make_sort_key() + memcmp, which compares pointer values for blob fields). The blob check is moved before the HEAP engine check so HEAP+blob tables fall through to remove_dup_with_compare().

      LIMIT adjustment with fanout:

      • Fixed undefined behavior when select_limit/fanout overflows. With HEAP blob support, the optimizer's temp table engine choice changes, altering cost estimates and sometimes producing near-zero fanout values (from tiny cond_selectivity). Dividing select_limit by such a fanout produces infinity or a value exceeding HA_POS_ERROR; casting that to ha_rows (unsigned long long) is undefined behavior per the C++ standard. Added an explicit cap: if select_limit/fanout >= (double) HA_POS_ERROR, set select_limit = HA_POS_ERROR.

      SQL Layer (sql/item_sum.cc)

      Aggregator_distinct::setup():

      • The Unique tree path (which compares raw record bytes via simple_raw_key_cmp / composite_key_cmp) is skipped for HEAP tables with blobs, since raw comparison would compare pointer values rather than actual blob data. HEAP+blob tables use the ha_write_tmp_row() path instead.

      Aggregator_distinct::add():

      • Added create_internal_tmp_table_from_heap() conversion when ha_write_tmp_row() fails with a non-duplicate error (e.g. HA_ERR_RECORD_FILE_FULL). Previously, only the Unique tree path handled HEAP overflow conversion. With blobs routed through ha_write_tmp_row(), the overflow-to-disk conversion must happen here too.

      SQL Layer — Expression Cache (sql/sql_expression_cache.cc)

      Expression_cache_tmptable::init():

      • Disables the expression cache when the HEAP temp table has blob fields. HEAP hash indexes on blob columns use a pointer-based key format (4-byte length + data pointer from hp_make_key()) that is incompatible with the SQL layer's key format (2-byte length + inline data from Field_blob::new_key_field() which returns Field_varstring). This restores the pre-blob-support behavior: previously blob items forced conversion to Aria, which failed the db_type() == heap_hton check and disabled the cache anyway. No performance impact.

      SQL Layer — FULLTEXT Early Detection (sql/sql_derived.cc)

      mysql_derived_prepare():

      • Forces a disk-based tmp engine when the outer query uses FULLTEXT functions. Before the derived table's result table is created, checks derived->select_lex->ftfunc_list->elements and sets TMP_TABLE_FORCE_MYISAM in the create options if FULLTEXT is present. This ensures the table is created with Aria/MyISAM from the start, avoiding the need for a handler swap later. Follows the same pattern as st_select_lex_unit::prepare() in sql_union.cc for UNION tmp tables. Item_func_match::fix_fields() retains only a safety-net error (ER_TABLE_CANT_HANDLE_FT) for engines that lack FULLTEXT support.

      Performance Impact

      Non-Blob Tables: Zero Regression

      The most important design goal was no measurable impact on existing workloads. For tables without BLOB/TEXT columns:

      • No new allocations: blob_buff, blob_descs, and continuation chains are never allocated. blob_count is 0, total_records tracks records identically, and blob_buff is never touched.
      • No new code in hot paths: Every blob-specific code path is guarded by if (share->blob_count). This is a single branch on a struct member that lives in the same cache line as other frequently accessed fields (reclength, visible, records). The branch predictor learns always-false within a few iterations — the branch itself is effectively free.
      • No changes to row format: The flags byte at offset visible continues to be written as 1 (active) or 0 (deleted) exactly as before. The new flag bits (HP_ROW_HAS_CONT, HP_ROW_IS_CONT, HP_ROW_CONT_ZEROCOPY) are never set.
      • No changes to hash functions for non-blob keys: The new HA_BLOB_PART branches in hp_rec_hashnr(), hp_rec_key_cmp(), hp_key_cmp(), and hp_make_key() are only entered for key segments with the HA_BLOB_PART flag set — which never happens for non-blob tables.
      • No change to next_free_record_pos() semantics: The switch from share->records to block.last_allocated for position arithmetic is equivalent when there are no continuation records (both values track the same quantity).

      In summary, for non-blob tables, the entire feature reduces to a handful of never-taken branches per row operation. There is no additional memory usage, no additional function calls, and no change to the data layout.

      Blob Tables: Performance Characteristics

      For tables with BLOB/TEXT columns, the primary benefit is avoiding disk I/O entirely. The tradeoff:

      • Write path: Each blob column requires allocating continuation records and copying blob data into them. The allocation is amortized (HP_BLOCK allocates leaf blocks of ~1000 records at a time), so the per-row overhead is dominated by the memcpy of blob data — which is unavoidable regardless of storage approach. For the common case of small blobs (JSON documents, short TEXT values under ~150 KB), this is a single contiguous run with one memcpy. Single-run blobs from tail allocation use the zero-copy layout (Cases A/B), enabling the read path to avoid copying entirely.
      • Read path: For single-run blobs (Cases A/B — the common case in temporary tables), the read path returns a pointer directly into HP_BLOCK with no memcpy at all. For multi-run blobs (Case C), data is reassembled from continuation chains into blob_buff via memcpy — the same pattern as InnoDB's blob_heap. The reassembly buffer is allocated once and reused for every row. The reassembly cost is proportional to blob data size, not to the number of runs (runs within a leaf block are contiguous, so the memcpy loop processes large chunks).
      • Hash operations: When DISTINCT or UNION deduplication requires hashing/comparing blob key segments, the stored blob must be materialized from its continuation chain before comparison. For zero-copy blobs (Cases A/B), materialization returns a direct pointer with no copy. For Case C, one reassembly is needed per hash probe for the stored side. The input side (the record being inserted) always has a valid direct pointer and requires no materialization. In practice, hash collisions are rare, so most insertions require only hashing (no comparison).
      • Memory: For Case C blobs, data exists in continuation records (persistent) plus the reassembly buffer (transient, per-handle, freed on heap_reset()). This is a 2× peak for the currently-read row's blob data — identical to InnoDB's approach. For Cases A/B (zero-copy), no reassembly buffer is needed — 1× memory only. The reassembly buffer is not allocated for non-blob tables, and is freed between statement executions to avoid holding large buffers across queries.
      • Free list reuse: When blob rows are deleted and reinserted, the peek-then-unlink algorithm walks the free list read-only, detecting contiguous groups of adjacent records (descending address order only — matching LIFO push order). On the first non-qualifying (too short) group, the scan stops entirely and falls through to tail allocation — no wasted time scanning a fragmented list. Only qualifying groups are actually unlinked. Since deleted records are pushed consecutively (LIFO), they naturally appear as contiguous sequences, so recovery has high probability. Individual runs are capped at 65,535 records (uint16 format limit); larger groups are split into multiple runs at this boundary.

      Net Effect on Temporary Table Workloads

      For the target use case — implicit/internal temporary tables created by GROUP BY, DISTINCT, UNION, and subqueries — the expected net effect is a significant performance improvement whenever blob-typed columns are present. Instead of:

      SQL layer -> Aria temp table -> page cache management -> (eviction to OS under memory pressure)
      

      the data path becomes:

      SQL layer -> HEAP temp table -> memcpy into HP_BLOCK -> zero-copy pointer out (Cases A/B)
                                                           -> memcpy out of HP_BLOCK (Case C)
      

      This eliminates file creation/deletion overhead and page cache management. For the common case of small-to-medium blobs fitting in a single run, the read path is entirely zero-copy — no blob_buff allocation, no memcpy. The max_heap_table_size / tmp_memory_table_size limits continue to act as a safety valve: when a temporary table with blobs exceeds the memory limit, the existing overflow-to-disk conversion (create_internal_tmp_table_from_heap()) triggers transparently, falling back to Aria exactly as it does today for non-blob tables that outgrow memory.


      Limitations

      1. BTREE indexes on blob columns are not supported — BTREE key packing (hp_rb_pack_key / hp_rb_var_key_length) is not updated for blob segments. HASH indexes on blob columns work correctly. In practice this is not a limitation for internal temporary tables, which exclusively use HASH indexes for deduplication.
      2. No partial-key prefix indexing for blobs — the entire blob value is hashed/compared. This matches the behavior of other engines for HASH indexes and is correct for DISTINCT/UNION deduplication.
      3. 2x memory for blob data during reads (Case C only) — for multi-run blobs, data exists both in continuation records and in the reassembly buffer simultaneously. The reassembly buffer is freed on heap_reset() / handler close. For single-run blobs (Cases A/B), zero-copy pointers are used and no reassembly buffer is needed — 1x memory only.
      4. No compression — blob data is stored uncompressed in continuation records. For the target use case (small JSON documents, short TEXT values in temporary tables), this is acceptable.
      5. 65,535 records per runrun_rec_count is stored as uint16, limiting a single multi-record continuation run to 65,535 records (Case A single-record runs have no run_rec_count field and are always exactly 1 record). Blobs requiring more are automatically split into multiple runs at this boundary. For the smallest possible recbuffer (16 bytes, visible=15), one run holds up to 5 + 65534 × 16 ≈ 1,024 KB of payload. Larger blobs use multiple runs with negligible additional overhead (10 bytes per extra run header).
      6. max_heap_table_size enforcement — continuation records consume the same space as regular rows and are counted toward data_length. The existing max_heap_table_size / tmp_memory_table_size limits apply naturally. When a blob insert would exceed the limit, the row is rolled back and the error propagated — the SQL layer's overflow-to-disk conversion (create_internal_tmp_table_from_heap()) handles this transparently.
      7. Expression cache disabled for blob fields — HEAP hash indexes on blob columns use a pointer-based key format (4-byte length + data pointer) that is incompatible with the SQL layer's key format used by Expression_cache_tmptable (2-byte length + inline data from Field_blob::new_key_field()). The cache is disabled when blob fields are present, restoring the pre-blob-support behavior (previously blobs forced Aria, which also disabled the cache).
      8. FULLTEXT forces disk engine — HEAP does not support FULLTEXT indexes. When a derived table is used in a query with MATCH...AGAINST, the tmp table is created with a disk-based engine (Aria/MyISAM) from the start via early detection in mysql_derived_prepare().

      Test Changes

      New Tests

      • heap/heap_blob.test: Comprehensive test suite for direct HEAP blob operations — basic CRUD, multiple blob types (TINYBLOB through LONGBLOB), NULL/empty values, large blobs spanning multiple continuation runs, free list reuse after DELETE, free list fragmentation with NULL-blob/large-blob interleaving, mixed NULL/non-NULL multi-column blob scavenging, large blobs exceeding the uint16 run_rec_count cap (1MB blobs requiring ~70K records, forcing automatic splitting at the 65,535 boundary), TRUNCATE, full table scan correctness, HASH and BTREE indexes with blob columns present, REPLACE, INSERT...SELECT, table-full errors, UPDATE failure rollback, non-blob regression, and zero-copy edge cases (Case A: wide table with small blobs, Case B: medium blobs in narrow/wide tables with delete+reinsert, Case B->C boundary: blobs near leaf block size).
      • heap/blob_dedup.test: Tests DISTINCT and UNION deduplication with HEAP blob tables — verifies that hash-based deduplication correctly compares blob data (not pointer values).
      • main/blob_sj_test.test: Regression test for the SJ weedout HA_BLOB_PART crash — exercises subquery materialization with blob columns to verify that spurious HA_BLOB_PART flags from uninitialized key_part_flag are correctly stripped in hp_create.c.
      • heap/heap_blob_groupby.test: Tests GROUP BY on blob columns in HEAP tables — verifies that key_part_flag is set correctly and that blob columns produce correct grouping results.
      • heap/heap_blob_ops.test: Tests query patterns from benchmark coverage gaps — COUNT(DISTINCT), IN-subquery, GROUP BY ROLLUP, window functions (ROW_NUMBER, RANK), CTE materialization, CTE self-join, and CTE referenced twice.
      • heap/heap_geometry.test: Tests GEOMETRY columns in MEMORY tables. Verifies the engine is MEMORY (not silently converted to Aria), inserts 7 distinct LineString geometries, then performs 11 rounds of INSERT INTO t1 SELECT * FROM t1 doublings (7→14,336 rows). After each doubling, validates zero corrupt rows (ST_AsText(c2) IS NOT NULL for all rows). Final verification: GROUP BY confirms all 7 geometry values present with correct counts, MBRWithin() spatial predicate validates geometry data integrity. This test exercises the Case A zero-copy detection fix in hp_read_blobs() / hp_materialize_one_blob() — without the fix, multi-run blobs whose first run has run_rec_count==1 (from free-list fragmentation during INSERT...SELECT) are incorrectly treated as single-record zero-copy, causing the read to overflow past the 7-byte rec 0 payload into adjacent records' flags bytes, corrupting the WKB geometry type field.

      Modified Tests

      Several existing tests required adjustments because HEAP's row ordering differs from Aria's, and because I_S temp tables now use MEMORY instead of Aria:

      • main/select.test: Added ORDER BY fld3 to a GROUP BY ... LIMIT query. Without an explicit ORDER BY, LIMIT selects a non-deterministic window that depends on the temp table engine's physical row ordering.
      • main/cte_recursive.test: Added --sorted_result before a recursive CTE query where rows with identical (level, mid) values (due to integer overflow) have non-deterministic order.
      • main/distinct.test: Added --replace_regex to mask a ROLLUP row's b value — the value depends on which group is processed last, which varies by temp table engine.
      • main/group_by.test: Changed Created_tmp_disk_tables expected value from 1 to 0 — a query with (int, decimal, blob) columns now uses HEAP instead of Aria for the temp table.
      • main/intersect_all.test: Added --sorted_result before UNION ALL queries without ORDER BY whose row ordering depends on the temp table engine.
      • main/information_schema_part.test: Added subpartition_name to ORDER BY — within the same partition, subpartition ordering was non-deterministic.
      • main/temp_table_symlink.test: Removed the I_S query that tested disk temp file symlink conflicts — with HEAP blob support, I_S temp tables stay in memory and create no disk files, so the symlink attack vector does not apply.
      • main/information_schema.result, main/information_schema_parameters.result, main/information_schema_routines.result: Updated expected engine name from Aria to MEMORY for I_S system views that contain TEXT/BLOB columns.
      • main/derived_view.result: Query plan improvement — HEAP temp table enables ref access via hash index instead of block nested loop join.
      • main/select_pkeycache.result, main/select_jcl6.result: Re-recorded for row ordering differences.
      • main/create.test: Removed --error ER_TABLE_CANT_HANDLE_BLOB — HEAP now accepts blob columns. Kept as a positive test.
      • sys_vars/tmp_disk_table_size_basic.test: Removed -error ER_RECORD_FILE_FULL on I_S views query — with HEAP blob support, the I_S temp table stays in memory and is not affected by tmp_disk_table_size. Added -sorted_result before SELECT table_name FROM INFORMATION_SCHEMA.views — row ordering is non-deterministic without ORDER BY and changed with HEAP temp tables.
      • funcs_1/is_events.result: Updated ENGINE=MEMORY in SHOW CREATE for I_S events table.
      • innodb_fts/innodb-fts-ddl.test: Added FTS_DOC_ID tiebreaker to ORDER BY for deterministic ordering of rows with identical FULLTEXT scores.
      • main/group_min_max_innodb.test: Added --replace_column 2 # to mask non-deterministic blob value in GROUP BY query — the serialized_c column is not aggregated, so which row is picked for voter_id=4 (two rows with different blob values) depends on internal access order.
      • funcs_1/is_tables_is.result, funcs_1/is_tables_is_embedded.result: Updated ENGINE from MYISAM_OR_MARIA to MEMORY and ROW_FORMAT from DYNAMIC_OR_PAGE to Fixed for I_S tables that now stay in MEMORY.
      • funcs_1/is_routines_embedded.result, funcs_1/is_triggers_embedded.result, funcs_1/is_views_embedded.result: Updated ENGINE=MEMORY in SHOW CREATE for embedded server variants of I_S tests. is_routines_embedded also updated for the mtr_check.sql stored procedure body change (ORDER BY columns_in_mysql, ordinal_position).
      • funcs_1/is_columns.result, funcs_1/is_routines.result, funcs_1/is_triggers.result, funcs_1/is_views.result: Updated ENGINE=MEMORY in SHOW CREATE for I_S system views.
      • funcs_1/processlist_priv_ps.result, funcs_1/processlist_val_ps.result: Updated ENGINE=MEMORY in SHOW CREATE for processlist I_S tables (ps-protocol variants).
      • sysschema/v_schema_redundant_indexes.result: Re-recorded for non-deterministic row ordering.
      • innodb_fts/misc.test: Added id column as tiebreaker to ORDER BY MATCH...AGAINST queries — rows with equal FTS relevance scores have non-deterministic ordering that changed when intermediate temp tables use HEAP instead of Aria.
      • funcs_1/processlist_priv_no_prot.result, funcs_1/processlist_val_no_prot.result: Updated ENGINE=MEMORY in SHOW CREATE for processlist I_S tables (no-protocol variants).
      • include/mtr_check.sql: Added ordinal_position tiebreaker to the INFORMATION_SCHEMA.COLUMNS query's ORDER BY clause — the query ordered by columns_in_mysql (table_schema.table_name) only, so columns within the same table had non-deterministic ordering that changed when I_S temp tables use HEAP instead of Aria. This change also required updating result files that embed the stored procedure body text: funcs_1/is_routines_embedded.result and plugins/sql_error_log_withdbinfo.result.
      • perfschema/include/transaction_nested_events_verifier.inc: Added r_end_event_id tiebreaker to the UNION query's ORDER BY clause — the combined transaction+statement history query ordered by thread_id, r_event_id, but rows with the same event_id and different end_event_id values (e.g. an in-progress transaction and its completed counterpart) had non-deterministic ordering that changed when intermediate temp tables use HEAP instead of Aria.
      • perfschema/transaction_nested_events.result: Re-recorded to match the stable ordering from the verifier fix above.
      • main/partition_explicit_prune.result: Updated for HANDLER_READ_RND_DELETED counter changes from HEAP blob continuation record scanning.

      Attachments

        1. my_dynmem.c
          25 kB
          Michael Widenius
        2. Screenshot_20260304_213449.png
          85 kB
          Arcadiy Ivanov
        3. Screenshot_20260318_003255.png
          84 kB
          Arcadiy Ivanov

        Issue Links

          Activity

            People

              gkodinov Georgi Kodinov
              arcivanov Arcadiy Ivanov
              Georgi Kodinov Georgi Kodinov
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.