Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-38975

BLOBs in MEMORY (HEAP)

    XMLWordPrintable

Details

    Description

      HEAP Engine: BLOB/TEXT/JSON/GEOMETRY Column Support

      Motivation

      The growing adoption of JSON columns (MYSQL_TYPE_BLOB-based), Dynamic Columns (COLUMN_CREATE/COLUMN_GET), and GEOMETRY types means that an increasing number of SQL operations — GROUP BY, DISTINCT, UNION, subqueries, CTEs — materialize their intermediate results into temporary tables that contain BLOB/TEXT fields.

      Today, the HEAP (MEMORY) engine unconditionally rejects BLOB/TEXT columns (HA_NO_BLOBS flag in ha_heap::table_flags()). This forces the optimizer to use Aria (or MyISAM) temporary tables whenever any blob-typed field appears in the result set — even when the actual data is small (a 50-byte JSON document, a short TEXT value, etc.). The overhead of Aria internal temporary tables includes: creating and deleting a temporary file on disk (.MAI + .MAD), and managing data through the Aria page cache. Note that Aria temp tables are non-transactional (ci->transactional= FALSE) — there is no transaction log overhead and no fsync(). Under memory pressure, dirty pages may be evicted to disk via pwrite() (a plain write to the OS page cache, not a physical sync).

      This situation will only get worse as internally structured columns (JSON, Dynamic Columns, VECTOR, spatial types) gain popularity. Queries over tables with a single JSON column that would otherwise fit entirely in memory are forced to disk simply because the storage engine cannot represent the blob descriptor.

      This feature removes the HA_NO_BLOBS restriction from the HEAP engine, allowing the optimizer to keep temporary tables with BLOB/TEXT columns in memory when they fit within max_heap_table_size / tmp_memory_table_size limits.

      Additionally, the MEMORY engine now advertises HA_CAN_GEOMETRY, enabling explicit CREATE TABLE ... ENGINE=MEMORY with GEOMETRY/POINT/LINESTRING/POLYGON columns. Previously, GEOMETRY columns were rejected by the MEMORY engine because they are internally represented as Field_blob. With blob support in place, GEOMETRY columns work naturally — spatial data is stored in the same continuation record chains as any other BLOB/TEXT data. All spatial functions (ST_AsText(), MBRWithin(), etc.) work correctly with MEMORY-stored geometry values.


      Architecture

      Existing HEAP Row Format

      HEAP stores rows as fixed-length records in a hierarchical block tree (HP_BLOCK). Each row occupies exactly recbuffer bytes, computed as MY_ALIGN(MAX(reclength, sizeof(char*)) + 1, sizeof(char*)) — i.e. the record length (minimum sizeof(char*)) plus a flags byte, aligned up to pointer size (8 bytes on 64-bit). share->visible equals MAX(reclength, sizeof(char*)) — the byte offset of the flags byte. A single flags byte at offset share->visible marks the row as active or deleted. Deleted rows store a free-list pointer in their first sizeof(uchar*) bytes, forming a singly-linked free list through share->del_link.

      BLOB Representation in the Record Buffer

      In the MySQL/MariaDB record format, BLOB fields store only a descriptor in the fixed-width record buffer: a packlength-byte length prefix (1–4 bytes depending on BLOB type) followed by an 8-byte data pointer. The actual BLOB data lives in memory owned by the SQL layer (Field_blob::value). This descriptor is only 9–12 bytes regardless of BLOB data size, so it fits naturally in HEAP's fixed-width rows.

      The challenge is: where does the actual BLOB data live once the row is written into the HEAP engine?

      Continuation Record Chains

      BLOB data is stored using continuation records — additional fixed-size records allocated from the same HP_BLOCK that holds regular rows. This approach was chosen because it:

      • Reuses existing HP_BLOCK allocation, free list, and size accounting with minimal structural change
      • Ensures data_length tracking and max_table_size enforcement work automatically (continuation records consume regular row records)
      • Avoids per-BLOB my_malloc() calls — piggybacks on the amortized block allocation (~1 malloc per records_in_block rows)

      Flags Byte

      The existing visibility byte is extended into a flags byte:

      Bit Name Meaning
      0 HP_ROW_ACTIVE Record is active (not deleted)
      1 HP_ROW_HAS_CONT Primary record has blob continuation chain(s)
      2 HP_ROW_IS_CONT This record IS a continuation record
      3 HP_ROW_CONT_ZEROCOPY Zero-copy layout: data in rec 1..N-1 only (no data in rec 0 payload)

      Variable-Length Continuation Runs

      To minimize per-record overhead, continuation records are grouped into runs — contiguous sequences of recbuffer-sized records within the same leaf block. Only the first record of each run carries a header; subsequent records are pure payload with no flags byte. Runs are linked via next_cont pointers.

      First record of a run (rec 0):

      byte 0                                                         visible   recbuffer
        |                                                               |         |
        v                                                               v         v
        +--------------+---------------+--------------------------------+---------+
        | next_cont    | run_rec_count | BLOB data payload              | flags   |
        | (8 bytes)    | (2 bytes)     | (visible - 10 bytes)           | byte    |
        +--------------+---------------+--------------------------------+---------+
      

      Inner records (rec 1..N-1) — no flags byte:

      byte 0                                                                    recbuffer
        |                                                                         |
        v                                                                         v
        +-------------------------------------------------------------------------+
        | BLOB data payload (recbuffer bytes, full width)                         |
        +-------------------------------------------------------------------------+
      

      • next_cont (8 bytes): pointer to the next run, or NULL if last. Shares position with del_link — when a record is active it's next_cont; when deleted it's del_link. No conflict.
      • run_rec_count (uint16, 2 bytes): number of consecutive records in this run. Maximum 65,535 records per run — blobs requiring more are split into multiple runs automatically.
      • Payload per run of N records: (visible - 10) + (N - 1) × recbuffer bytes.

      This design amortizes the 10-byte per-run overhead across many records, achieving near-100% space efficiency:

      recbuffer visible 1-record payload 10-record payload 100-record payload
      16 15 5 bytes (31%) 5 + 9×16 = 149 bytes (93%) 5 + 99×16 = 1,589 bytes (99%)
      504 500 490 bytes (97%) 490 + 9×504 = 5,026 bytes (99%) 490 + 99×504 = 49,886 bytes (99.9%)

      Zero-Copy Read Optimization

      For single-run blobs (the common case for temporary table workloads), the read path avoids copying blob data into the reassembly buffer entirely by returning a pointer directly into the HP_BLOCK continuation records. Three cases are distinguished:

      Case Condition Read Behavior
      A run_rec_count == 1 Data fits in rec 0 payload (after 10-byte header). Return chain + HP_CONT_HEADER_SIZE — zero-copy.
      B HP_ROW_CONT_ZEROCOPY flag set Single run, multiple records. Data written to rec 1..N-1 only (rec 0 carries only the header). Inner records have no flags byte, so data is contiguous across recbuffer boundaries. Return chain + recbuffer — zero-copy.
      C Neither A nor B Multiple runs or non-zerocopy layout. Walk chain, copy into blob_buff — reassembly required.

      The HP_ROW_CONT_ZEROCOPY flag on rec 0's flags byte distinguishes Case B from Case C. Without it, a free-list single-run blob (which uses Case C layout because it was allocated before we knew it would be the only run) would be indistinguishable from a tail-allocated single-run blob (Case B). The flag is set during write when: (1) the blob fits in a single run, and (2) the run is allocated entirely from the block tail (not the free list).

      For Case B, rec 0's data payload area is unused (wasted) — all data goes into rec 1..N-1. This trade-off wastes at most visible - 10 bytes (typically 5–490 bytes) but enables zero-copy reads for the entire blob without any memcpy into blob_buff.

      Run Allocation Strategy

      Runs are allocated in two phases:

      1. Free list peek-then-unlink: Walk the free list read-only (following next pointers without unlinking), tracking contiguous groups of adjacent records. Only descending-address adjacency is checked — hp_free_run_chain() frees records in ascending order, so LIFO pushes create descending sequences on the free list. On discontinuity: if the accumulated group qualifies (>= min_run_records, derived from MAX(blob_length / 10, 128) / recbuffer), unlink exactly those records and use them for a run, then continue scanning for more data. If the group does not qualify, the free list is fragmented — stop immediately and fall through to tail allocation. The free list is only modified when a qualifying contiguous group is actually found. Since run_rec_count is uint16, individual runs are capped at 65,535 records; larger contiguous groups are automatically split into multiple runs at this boundary.
      2. Tail allocation: Remaining data is allocated from the block tail (hp_alloc_from_tail()), which is always contiguous within a leaf block. A new run starts at each block boundary. Tail-allocated single-run blobs use the zero-copy layout (Case B) when the data fits; otherwise they fall back to Case C.

      This strategy exploits the fact that when a blob row is deleted, hp_free_blobs() pushes continuation records to the free list consecutively. When re-allocated, they naturally appear as contiguous sequences in the free list (LIFO order). Fragmentation only occurs when non-blob inserts consume some freed records in between.

      Multiple BLOB Columns

      Each BLOB column has its own independent continuation chain. The primary row stores the standard (packlength, data_pointer) descriptor for each blob field. On write, the data pointer is overwritten in the stored row to point to the first continuation run. On read, each chain is walked independently. Each blob column is independently classified as Case A, B, or C — some may use zero-copy while others require reassembly into blob_buff.

      Record Counting

      A new counter HP_SHARE::total_records tracks all physical records (primary + continuation). HP_SHARE::records remains the logical (primary-only) count. This separation is critical because HEAP's linear hash uses records for bucket mapping — inflating it with continuation records would corrupt the hash table.

      Counter Meaning Used by
      share->records Logical (primary) record count Hash bucket mapping, blength, statistics, optimizer
      share->total_records All physical records (primary + continuation) Block position arithmetic, scan boundary, heap_check_heap()
      share->deleted Deleted records on free list (both types) Free list, both counters

      Invariant: total_records = records + continuation_count. For non-blob tables, total_records == records always.

      Reassembly Buffer

      On read, blob data that requires reassembly (Case C) is copied from continuation chains into a per-handle buffer (HP_INFO::blob_buff). This follows the same pattern as InnoDB's blob_heap and MyISAM's rec_buff: a single buffer holds all blob data for the current row, grown via my_realloc() as needed, and freed on heap_reset() / HA_EXTRA_NORMAL / heap_close().

      For zero-copy cases (A and B), blob field pointers are set directly into the HP_BLOCK continuation records — no blob_buff allocation or copy is needed. The HP_INFO::has_zerocopy_blobs flag tracks whether the last hp_read_blobs() call produced any zero-copy pointers. This flag is used by heap_update() to refresh the caller's record buffer after an update frees old continuation chains — without this, zero-copy pointers in the caller's buffer would become dangling pointers into freed records.

      BLOB field pointers in record[0] are rewritten to point into either blob_buff (Case C) or directly into HP_BLOCK records (Cases A/B) after reassembly. The buffer/pointers remain valid until the next row fetch — the SQL layer's Field_blob::val_str() dereferences these pointers with no additional copy.


      Changes

      HEAP Engine Core (storage/heap/)

      New file: hp_blob.c (~850 lines)

      • hp_write_blobs() — writes all blob columns for a row into continuation chains. On failure, rolls back any partially written chains. Supports both zero-copy layout (Case A/B for single-run tail allocations) and standard layout (Case C for free-list or multi-run allocations).
      • hp_read_blobs() — two-pass approach: first pass sums data lengths for Case C blobs to size blob_buff; second pass sets zero-copy pointers (Cases A/B) or reassembles into blob_buff (Case C). Sets has_zerocopy_blobs flag.
      • hp_free_blobs() — frees all continuation chains for a row (called on DELETE before free-list insertion).
      • hp_free_run_chain() — frees a single continuation chain (individual records returned to free list).
      • hp_materialize_one_blob() — materializes a single blob column from a stored continuation chain. Returns zero-copy pointer for Cases A/B, or reassembles into blob_buff for Case C. Used by hash comparison functions when comparing stored records.
      • hp_alloc_from_tail() — allocates a single record from the HP_BLOCK tail, bypassing the free list. Used for guaranteed-contiguous allocation.
      • hp_unlink_and_write_run() — unlinks a contiguous group of records from the free list and writes blob data into them.
      • Internal helpers for run data writing.

      hp_create.c

      • Blob descriptor array (HP_BLOB_DESC) allocated alongside HP_SHARE in a single allocation.
      • visible_offset minimum enforced for blob tables to ensure at least 1 byte of payload beyond the continuation header (prevents underflow for pathological single-TINYBLOB tables with reclength as low as 9).
      • Key segment setup: HA_BLOB_PART flag validated against blob_descs[] array — spurious flags (from uninitialized key_part_flag in SJ weedout tables, or BLOB_FLAG on non-Field_blob INFORMATION_SCHEMA temp tables) are stripped. Real blob segments get bit_start set to actual packlength (1–4) instead of the generic VARCHAR bit_start=2.
      • Key length calculation accounts for blob segments: 4 + sizeof(uchar*) (length + data pointer) in the key buffer.

      hp_write.c

      • heap_write() calls hp_write_blobs() after memcpy. New err_blob: label for correct key rollback when blob write fails (all keys already written — must iterate all keydef entries, not skip the current one). Neither err: nor err_blob: calls hp_free_blobs(): err: is reached before memcpy/hp_write_blobs() (stale data in pos); err_blob: is reached after hp_write_blobs() fails, but hp_write_blobs() is self-cleaning (frees its own partial chains, NULLs all pointers). Calling hp_free_blobs() from err_blob: would crash because the flags byte is uninitialized and blob columns after the failed one still have SQL layer data pointers (not chain pointers).
      • next_free_record_pos() uses block.last_allocated instead of share->records for position arithmetic (since records no longer equals total physical records). Made non-static for use by hp_blob.c.

      hp_delete.c

      • hp_free_blobs() called before free-list insertion.
      • total_records decremented on delete.

      hp_update.c

      • Write-before-free strategy: new blob chains are written before old ones are freed. On failure, old record and chain pointers are restored. This ensures crash-safety on OOM/table-full during blob grow.
      • After update, when has_zerocopy_blobs is set, refreshes the caller's record buffer by copying new chain pointers from the stored row and calling hp_read_blobs(). This prevents dangling zero-copy pointers from pointing into freed old chains (e.g. in the INTERSECT ALL unfold path where the caller reuses the record buffer after update).

      hp_scan.c

      • Continuation records (HP_ROW_IS_CONT) are skipped during table scans. Rec 0 of each run has the flags byte; inner records have no flags byte. The scan reads run_rec_count from the header and batch-skips the entire run (returned as HA_ERR_RECORD_DELETED to the caller, same as actual deleted records).
      • Scan boundary uses total_records + deleted instead of records + deleted.

      hp_hash.c — hash function fixes for blob key segments

      • hp_rec_hashnr() — for HA_BLOB_PART segments: reads blob length using actual packlength, dereferences data pointer, hashes actual blob data. Input records always have valid blob pointers.
      • hp_rec_key_cmp() — blob segment comparison with materialization support. Takes new HP_INFO *info parameter. When info is non-NULL, the second record is treated as stored (blob pointers are continuation chain heads) and materialized via hp_materialize_one_blob(). When info is NULL (e.g. heap_update() comparing two input records), both sides are dereferenced directly.
      • hp_key_cmp() — blob segment comparison between stored record and pre-built search key. Stored side is materialized from chain; key side has 4-byte length + data pointer from hp_make_key().
      • hp_make_key() — for blob segments: stores 4-byte length + data pointer (not inline data) in the key buffer.
      • hp_hashnr() — pre-built key hashing: blob segments encoded as 4-byte length + data pointer.
      • NULL handling: key advancement accounts for blob segment size (4 + sizeof(uchar*)) vs varchar (2).

      hp_rrnd.c, hp_rfirst.c, hp_rlast.c, hp_rnext.c, hp_rprev.c, hp_rkey.c, hp_rsame.c

      • hp_read_blobs() call added after memcpy(record, pos, reclength) in every read path, guarded by share->blob_count.

      hp_clear.ctotal_records and block.last_allocated reset on TRUNCATE.

      hp_close.cblob_buff freed on close.

      hp_extra.cblob_buff freed on heap_reset() (prevents a 4GB LONGBLOB from permanently consuming 4GB of memory per handle).

      hp_static.chp_key_memory_HP_BLOB PSI memory instrumentation key added.

      _check.cheap_check_heap() updated to iterate using total_records + deleted, count continuation records separately (batch-skipping inner records via run_rec_count), and validate records + cont_count == total_records.

      Handler Interface (ha_heap.h / ha_heap.cc)

      • HA_NO_BLOBS removed and HA_CAN_GEOMETRY added to ha_heap::table_flags(). GEOMETRY columns (which use Field_blob internally) are now supported in MEMORY tables.
      • heap_prepare_hp_create_info() populates HP_BLOB_DESC array from TABLE_SHARE::blob_fields, filtering non-Field_blob types (e.g. INFORMATION_SCHEMA Field_string with BLOB_FLAG). Both MYSQL_TYPE_BLOB and MYSQL_TYPE_GEOMETRY are included — Field_geom extends Field_blob but returns a different field_type().
      • remember_rnd_pos() / restart_rnd_next() implemented. Required by remove_dup_with_compare() for DISTINCT deduplication with blobs. Saves and restores both the row position and current_record counter (which heap_rrnd() does not update). restart_rnd_next() caps next_block at total_records + deleted to prevent heap_scan() from reading past the last allocated slot — without this cap, the scan boundary can exceed the allocated HP_BLOCK range when the saved position is in the last block segment, causing SIGSEGV.
      • find_unique_row() — argument order swapped for hp_rec_key_cmp() (input record first, stored record second) and hp_read_blobs() called after record copy.
      • blob_descs memory freed in open() and create() after heap_create().

      Data Structures (include/heap.h / heapdef.h)

      • HP_BLOB_DESC struct: offset (byte offset within record buffer) + packlength (1–4).
      • HP_SHARE: added blob_descs, blob_count, total_records.
      • HP_INFO: added blob_buff, blob_buff_len, has_zerocopy_blobs.
      • HP_CREATE_INFO: added blob_descs, blob_count.
      • Constants: HP_ROW_ACTIVE, HP_ROW_HAS_CONT, HP_ROW_IS_CONT, HP_ROW_CONT_ZEROCOPY, HP_CONT_HEADER_SIZE, HP_CONT_REC_COUNT_SIZE, HP_CONT_MIN_RUN_BYTES, HP_CONT_RUN_FRACTION_NUM/DEN.

      SQL Layer (sql/sql_select.cc)

      choose_engine() (temp table engine selection):

      • Removed share->blob_fields from the condition that forces disk-based engine.
      • Added share->reclength > HA_MAX_REC_LENGTH check instead — HEAP's fixed-width rows would waste too much memory for very wide records.

      finalize() (temp table creation):

      • HEAP tables with blobs still use fixed-width rows (blob data lives in separate continuation records, not inline). The first-byte guard (ensuring byte 0 is non-zero for the delete-link mechanism) is applied to HEAP tables regardless of blob presence.
      • Packed row format (use_packed_rows) is skipped for HEAP tables — HEAP requires fixed-width rows.

      remove_duplicates() (DISTINCT deduplication):

      • remove_dup_with_hash_index() cannot handle blobs (copies field data via make_sort_key() + memcmp, which compares pointer values for blob fields). The blob check is moved before the HEAP engine check so HEAP+blob tables fall through to remove_dup_with_compare().

      LIMIT adjustment with fanout:

      • Fixed undefined behavior when select_limit/fanout overflows. With HEAP blob support, the optimizer's temp table engine choice changes, altering cost estimates and sometimes producing near-zero fanout values (from tiny cond_selectivity). Dividing select_limit by such a fanout produces infinity or a value exceeding HA_POS_ERROR; casting that to ha_rows (unsigned long long) is undefined behavior per the C++ standard. Added an explicit cap: if select_limit/fanout >= (double) HA_POS_ERROR, set select_limit = HA_POS_ERROR.

      SQL Layer (sql/item_sum.cc)

      Aggregator_distinct::setup():

      • The Unique tree path (which compares raw record bytes via simple_raw_key_cmp / composite_key_cmp) is skipped for HEAP tables with blobs, since raw comparison would compare pointer values rather than actual blob data. HEAP+blob tables use the ha_write_tmp_row() path instead.

      Aggregator_distinct::add():

      • Added create_internal_tmp_table_from_heap() conversion when ha_write_tmp_row() fails with a non-duplicate error (e.g. HA_ERR_RECORD_FILE_FULL). Previously, only the Unique tree path handled HEAP overflow conversion. With blobs routed through ha_write_tmp_row(), the overflow-to-disk conversion must happen here too.

      SQL Layer — Expression Cache (sql/sql_expression_cache.cc)

      Expression_cache_tmptable::init():

      • Disables the expression cache when the HEAP temp table has blob fields. HEAP hash indexes on blob columns use a pointer-based key format (4-byte length + data pointer from hp_make_key()) that is incompatible with the SQL layer's key format (2-byte length + inline data from Field_blob::new_key_field() which returns Field_varstring). This restores the pre-blob-support behavior: previously blob items forced conversion to Aria, which failed the db_type() == heap_hton check and disabled the cache anyway. No performance impact.

      SQL Layer — FULLTEXT Engine Swap (sql/item_func.cc)

      Item_func_match::fix_fields():

      • Converts HEAP tmp tables to Aria when FULLTEXT is needed. With HEAP blob support, derived tables with TEXT columns stay in memory instead of being forced to Aria. If such a table is used with MATCH...AGAINST, the HEAP handler must be swapped to a disk-based engine that supports HA_CAN_FULLTEXT. The swap targets only unopened (!table->is_created()) HEAP tmp tables. The handler is replaced following the same pattern as create_internal_tmp_table_from_heap(): delete old handler, reset ha_share, plugin_unlock / ha_lock_engine for the engine plugin, create new handler via get_new_handler(), set_ha_share_ref, set_table.

      Performance Impact

      Non-Blob Tables: Zero Regression

      The most important design goal was no measurable impact on existing workloads. For tables without BLOB/TEXT columns:

      • No new allocations: blob_buff, blob_descs, and continuation chains are never allocated. blob_count is 0, total_records tracks records identically, and blob_buff is never touched.
      • No new code in hot paths: Every blob-specific code path is guarded by if (share->blob_count). This is a single branch on a struct member that lives in the same cache line as other frequently accessed fields (reclength, visible, records). The branch predictor learns always-false within a few iterations — the branch itself is effectively free.
      • No changes to row format: The flags byte at offset visible continues to be written as 1 (active) or 0 (deleted) exactly as before. The new flag bits (HP_ROW_HAS_CONT, HP_ROW_IS_CONT, HP_ROW_CONT_ZEROCOPY) are never set.
      • No changes to hash functions for non-blob keys: The new HA_BLOB_PART branches in hp_rec_hashnr(), hp_rec_key_cmp(), hp_key_cmp(), and hp_make_key() are only entered for key segments with the HA_BLOB_PART flag set — which never happens for non-blob tables.
      • No change to next_free_record_pos() semantics: The switch from share->records to block.last_allocated for position arithmetic is equivalent when there are no continuation records (both values track the same quantity).

      In summary, for non-blob tables, the entire feature reduces to a handful of never-taken branches per row operation. There is no additional memory usage, no additional function calls, and no change to the data layout.

      Blob Tables: Performance Characteristics

      For tables with BLOB/TEXT columns, the primary benefit is avoiding disk I/O entirely. The tradeoff:

      • Write path: Each blob column requires allocating continuation records and copying blob data into them. The allocation is amortized (HP_BLOCK allocates leaf blocks of ~1000 records at a time), so the per-row overhead is dominated by the memcpy of blob data — which is unavoidable regardless of storage approach. For the common case of small blobs (JSON documents, short TEXT values under ~150 KB), this is a single contiguous run with one memcpy. Single-run blobs from tail allocation use the zero-copy layout (Cases A/B), enabling the read path to avoid copying entirely.
      • Read path: For single-run blobs (Cases A/B — the common case in temporary tables), the read path returns a pointer directly into HP_BLOCK with no memcpy at all. For multi-run blobs (Case C), data is reassembled from continuation chains into blob_buff via memcpy — the same pattern as InnoDB's blob_heap. The reassembly buffer is allocated once and reused for every row. The reassembly cost is proportional to blob data size, not to the number of runs (runs within a leaf block are contiguous, so the memcpy loop processes large chunks).
      • Hash operations: When DISTINCT or UNION deduplication requires hashing/comparing blob key segments, the stored blob must be materialized from its continuation chain before comparison. For zero-copy blobs (Cases A/B), materialization returns a direct pointer with no copy. For Case C, one reassembly is needed per hash probe for the stored side. The input side (the record being inserted) always has a valid direct pointer and requires no materialization. In practice, hash collisions are rare, so most insertions require only hashing (no comparison).
      • Memory: For Case C blobs, data exists in continuation records (persistent) plus the reassembly buffer (transient, per-handle, freed on heap_reset()). This is a 2× peak for the currently-read row's blob data — identical to InnoDB's approach. For Cases A/B (zero-copy), no reassembly buffer is needed — 1× memory only. The reassembly buffer is not allocated for non-blob tables, and is freed between statement executions to avoid holding large buffers across queries.
      • Free list reuse: When blob rows are deleted and reinserted, the peek-then-unlink algorithm walks the free list read-only, detecting contiguous groups of adjacent records (descending address order only — matching LIFO push order). On the first non-qualifying (too short) group, the scan stops entirely and falls through to tail allocation — no wasted time scanning a fragmented list. Only qualifying groups are actually unlinked. Since deleted records are pushed consecutively (LIFO), they naturally appear as contiguous sequences, so recovery has high probability. Individual runs are capped at 65,535 records (uint16 format limit); larger groups are split into multiple runs at this boundary.

      Net Effect on Temporary Table Workloads

      For the target use case — implicit/internal temporary tables created by GROUP BY, DISTINCT, UNION, and subqueries — the expected net effect is a significant performance improvement whenever blob-typed columns are present. Instead of:

      SQL layer -> Aria temp table -> page cache management -> (eviction to OS under memory pressure)
      

      the data path becomes:

      SQL layer -> HEAP temp table -> memcpy into HP_BLOCK -> zero-copy pointer out (Cases A/B)
                                                           -> memcpy out of HP_BLOCK (Case C)
      

      This eliminates file creation/deletion overhead and page cache management. For the common case of small-to-medium blobs fitting in a single run, the read path is entirely zero-copy — no blob_buff allocation, no memcpy. The max_heap_table_size / tmp_memory_table_size limits continue to act as a safety valve: when a temporary table with blobs exceeds the memory limit, the existing overflow-to-disk conversion (create_internal_tmp_table_from_heap()) triggers transparently, falling back to Aria exactly as it does today for non-blob tables that outgrow memory.


      Limitations

      1. BTREE indexes on blob columns are not supported — BTREE key packing (hp_rb_pack_key / hp_rb_var_key_length) is not updated for blob segments. HASH indexes on blob columns work correctly. In practice this is not a limitation for internal temporary tables, which exclusively use HASH indexes for deduplication.
      2. No partial-key prefix indexing for blobs — the entire blob value is hashed/compared. This matches the behavior of other engines for HASH indexes and is correct for DISTINCT/UNION deduplication.
      3. 2x memory for blob data during reads (Case C only) — for multi-run blobs, data exists both in continuation records and in the reassembly buffer simultaneously. The reassembly buffer is freed on heap_reset() / handler close. For single-run blobs (Cases A/B), zero-copy pointers are used and no reassembly buffer is needed — 1x memory only.
      4. No compression — blob data is stored uncompressed in continuation records. For the target use case (small JSON documents, short TEXT values in temporary tables), this is acceptable.
      5. 65,535 records per runrun_rec_count is stored as uint16, limiting a single continuation run to 65,535 records. Blobs requiring more are automatically split into multiple runs at this boundary. For the smallest possible recbuffer (16 bytes, visible=15), one run holds up to 5 + 65534 × 16 ≈ 1,024 KB of payload. Larger blobs use multiple runs with negligible additional overhead (10 bytes per extra run header).
      6. max_heap_table_size enforcement — continuation records consume the same space as regular rows and are counted toward data_length. The existing max_heap_table_size / tmp_memory_table_size limits apply naturally. When a blob insert would exceed the limit, the row is rolled back and the error propagated — the SQL layer's overflow-to-disk conversion (create_internal_tmp_table_from_heap()) handles this transparently.
      7. Expression cache disabled for blob fields — HEAP hash indexes on blob columns use a pointer-based key format (4-byte length + data pointer) that is incompatible with the SQL layer's key format used by Expression_cache_tmptable (2-byte length + inline data from Field_blob::new_key_field()). The cache is disabled when blob fields are present, restoring the pre-blob-support behavior (previously blobs forced Aria, which also disabled the cache).
      8. FULLTEXT requires engine swap — HEAP does not support FULLTEXT indexes. When a derived table with TEXT columns is used with MATCH...AGAINST, the HEAP handler is swapped to Aria in Item_func_match::fix_fields() before the table is opened.

      Test Changes

      New Tests

      • heap/heap_blob.test: Comprehensive test suite for direct HEAP blob operations — basic CRUD, multiple blob types (TINYBLOB through LONGBLOB), NULL/empty values, large blobs spanning multiple continuation runs, free list reuse after DELETE, free list fragmentation with NULL-blob/large-blob interleaving, mixed NULL/non-NULL multi-column blob scavenging, large blobs exceeding the uint16 run_rec_count cap (1MB blobs requiring ~70K records, forcing automatic splitting at the 65,535 boundary), TRUNCATE, full table scan correctness, HASH and BTREE indexes with blob columns present, REPLACE, INSERT...SELECT, table-full errors, UPDATE failure rollback, non-blob regression, and zero-copy edge cases (Case A: wide table with small blobs, Case B: medium blobs in narrow/wide tables with delete+reinsert, Case B->C boundary: blobs near leaf block size).
      • heap/blob_dedup.test: Tests DISTINCT and UNION deduplication with HEAP blob tables — verifies that hash-based deduplication correctly compares blob data (not pointer values).
      • main/blob_sj_test.test: Regression test for the SJ weedout HA_BLOB_PART crash — exercises subquery materialization with blob columns to verify that spurious HA_BLOB_PART flags from uninitialized key_part_flag are correctly stripped in hp_create.c.
      • heap/heap_geometry.test: Tests GEOMETRY columns in MEMORY tables. Verifies the engine is MEMORY (not silently converted to Aria), inserts 7 distinct LineString geometries, then performs 11 rounds of INSERT INTO t1 SELECT * FROM t1 doublings (7→14,336 rows). After each doubling, validates zero corrupt rows (ST_AsText(c2) IS NOT NULL for all rows). Final verification: GROUP BY confirms all 7 geometry values present with correct counts, MBRWithin() spatial predicate validates geometry data integrity. This test exercises the Case A zero-copy detection fix in hp_read_blobs() / hp_materialize_one_blob() — without the fix, multi-run blobs whose first run has run_rec_count==1 (from free-list fragmentation during INSERT...SELECT) are incorrectly treated as single-record zero-copy, causing the read to overflow past the 7-byte rec 0 payload into adjacent records' flags bytes, corrupting the WKB geometry type field.

      Modified Tests

      Several existing tests required adjustments because HEAP's row ordering differs from Aria's, and because I_S temp tables now use MEMORY instead of Aria:

      • main/select.test: Added ORDER BY fld3 to a GROUP BY ... LIMIT query. Without an explicit ORDER BY, LIMIT selects a non-deterministic window that depends on the temp table engine's physical row ordering.
      • main/cte_recursive.test: Added --sorted_result before a recursive CTE query where rows with identical (level, mid) values (due to integer overflow) have non-deterministic order.
      • main/distinct.test: Added --replace_regex to mask a ROLLUP row's b value — the value depends on which group is processed last, which varies by temp table engine.
      • main/group_by.test: Changed Created_tmp_disk_tables expected value from 1 to 0 — a query with (int, decimal, blob) columns now uses HEAP instead of Aria for the temp table.
      • main/intersect_all.test: Added --sorted_result before UNION ALL queries without ORDER BY whose row ordering depends on the temp table engine.
      • main/information_schema_part.test: Added subpartition_name to ORDER BY — within the same partition, subpartition ordering was non-deterministic.
      • main/temp_table_symlink.test: Removed the I_S query that tested disk temp file symlink conflicts — with HEAP blob support, I_S temp tables stay in memory and create no disk files, so the symlink attack vector does not apply.
      • main/information_schema.result, main/information_schema_parameters.result, main/information_schema_routines.result: Updated expected engine name from Aria to MEMORY for I_S system views that contain TEXT/BLOB columns.
      • main/derived_view.result: Query plan improvement — HEAP temp table enables ref access via hash index instead of block nested loop join.
      • main/select_pkeycache.result, main/select_jcl6.result: Re-recorded for row ordering differences.
      • main/create.test: Removed --error ER_TABLE_CANT_HANDLE_BLOB — HEAP now accepts blob columns. Kept as a positive test.
      • sys_vars/tmp_disk_table_size_basic.test: Removed -error ER_RECORD_FILE_FULL on I_S views query — with HEAP blob support, the I_S temp table stays in memory and is not affected by tmp_disk_table_size. Added -sorted_result before SELECT table_name FROM INFORMATION_SCHEMA.views — row ordering is non-deterministic without ORDER BY and changed with HEAP temp tables.
      • funcs_1/is_events.result: Updated ENGINE=MEMORY in SHOW CREATE for I_S events table.
      • innodb_fts/innodb-fts-ddl.test: Added FTS_DOC_ID tiebreaker to ORDER BY for deterministic ordering of rows with identical FULLTEXT scores.
      • main/group_min_max_innodb.test: Added --replace_column 2 # to mask non-deterministic blob value in GROUP BY query — the serialized_c column is not aggregated, so which row is picked for voter_id=4 (two rows with different blob values) depends on internal access order.
      • funcs_1/is_tables_is.result, funcs_1/is_tables_is_embedded.result: Updated ENGINE from MYISAM_OR_MARIA to MEMORY and ROW_FORMAT from DYNAMIC_OR_PAGE to Fixed for I_S tables that now stay in MEMORY.
      • funcs_1/is_routines_embedded.result, funcs_1/is_triggers_embedded.result, funcs_1/is_views_embedded.result: Updated ENGINE=MEMORY in SHOW CREATE for embedded server variants of I_S tests. is_routines_embedded also updated for the mtr_check.sql stored procedure body change (ORDER BY columns_in_mysql, ordinal_position).
      • funcs_1/is_columns.result, funcs_1/is_routines.result, funcs_1/is_triggers.result, funcs_1/is_views.result: Updated ENGINE=MEMORY in SHOW CREATE for I_S system views.
      • funcs_1/processlist_priv_ps.result, funcs_1/processlist_val_ps.result: Updated ENGINE=MEMORY in SHOW CREATE for processlist I_S tables (ps-protocol variants).
      • sysschema/v_schema_redundant_indexes.result: Re-recorded for non-deterministic row ordering.
      • innodb_fts/misc.test: Added id column as tiebreaker to ORDER BY MATCH...AGAINST queries — rows with equal FTS relevance scores have non-deterministic ordering that changed when intermediate temp tables use HEAP instead of Aria.
      • funcs_1/processlist_priv_no_prot.result, funcs_1/processlist_val_no_prot.result: Updated ENGINE=MEMORY in SHOW CREATE for processlist I_S tables (no-protocol variants).
      • include/mtr_check.sql: Added ordinal_position tiebreaker to the INFORMATION_SCHEMA.COLUMNS query's ORDER BY clause — the query ordered by columns_in_mysql (table_schema.table_name) only, so columns within the same table had non-deterministic ordering that changed when I_S temp tables use HEAP instead of Aria. This change also required updating result files that embed the stored procedure body text: funcs_1/is_routines_embedded.result and plugins/sql_error_log_withdbinfo.result.
      • perfschema/include/transaction_nested_events_verifier.inc: Added r_end_event_id tiebreaker to the UNION query's ORDER BY clause — the combined transaction+statement history query ordered by thread_id, r_event_id, but rows with the same event_id and different end_event_id values (e.g. an in-progress transaction and its completed counterpart) had non-deterministic ordering that changed when intermediate temp tables use HEAP instead of Aria.
      • perfschema/transaction_nested_events.result: Re-recorded to match the stable ordering from the verifier fix above.

      Attachments

        Issue Links

          Activity

            People

              monty Michael Widenius
              arcivanov Arcadiy Ivanov
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.