[MDEV-17598] InnoDB index option for per-record transaction ID - Jira

Details

Type: New Feature
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Storage Engine - InnoDB, Virtual Columns
Labels:
- lock
- performance
- purge

Description

In secondary indexes, InnoDB trades complexity for space, by not keeping track of version history for each secondary index record separately. This adds complexity to three interdependent areas: multi-version concurrency control (MVCC), purge of transaction history, and implicit locking.

In multi-versioning, if a secondary index record has been marked for deletion, or if PAGE_MAX_TRX_ID indicates that the page where the record resides has been recently modified, InnoDB has to look up the record in the clustered index and then painstakingly construct each available version of the record to find out if the secondary index record exists in the current read view.

As noted in ~~MDEV-16962~~, the purge of history must ensure that a delete-marked secondary index record is not visible in any MVCC view. If the secondary index comprises virtual columns, then the matching of clustered index record versions and the secondary index record may have to evaluate virtual column values.

As noted in MDEV-11215, a record is implicitly locked by a transaction if it was written or modified by that transaction. If a conflicting lock request arrives from another transaction, that transaction may convert the implicit lock request to an explicit one (on behalf of the lock-owner transaction) before creating its own lock object for starting the lock wait.

For clustered index records, determining whether a record is implicitly locked is easy: the hidden DB_TRX_ID column will belong to an active (or XA PREPARE; not committed) transaction. For secondary index records it is very expensive because there is no per-record DB_TRX_ID but only a PAGE_MAX_TRX_ID. Therefore the function row_vers_impl_x_locked() has to look up the matching version of the record in the clustered index before we can determine whether the secondary index record is implicitly locked. This and the similarly slow MVCC logic can lead to a 'death spiral' of a busy server. These operations slow down the purge of transaction history, and they become slower as the history grows. In my response to a blog post I mentioned Bug#14704286 SECONDARY INDEX UPDATES MAKE CONSISTENT READS DO O(N^2) UNDO PAGE LOOKUPS.

If we had DB_TRX_ID in InnoDB secondary index records, the existence implicit locks could be detected directly based on that field, just like in clustered indexes.

To reduce the space overhead, we could introduce a change to the secondary index leaf page header, allowing the following choices. This could be implemented by repurposing the field PAGE_MAX_TRX_ID.

PAGE_MAX_TRX_ID = 0: All records in the page are visible for all transactions.
PAGE_MAX_TRX_ID = 0xFFFFFFFFFFFFFFFF: A DB_TRX_ID will be appended to all records, possibly multiple records per (key_cols,pk_cols) value.
Else: There is no per-record versioning. (The old way.)

With this, we could gradually convert secondary index pages to the new format as the pages are being modified. No change to the metadata would be needed; all secondary indexes would be in this hybrid format. Tables could be imported to old MariaDB versions, but secondary indexes would have to be dropped and re-created if any page contains the extra fields.

Attachments

Issue Links

blocks

MDEV-11634 Improve the InnoDB change buffer

Closed

relates to

MDEV-22367 Remove write support for ROW_FORMAT=COMPRESSED

Closed

MDEV-23571 InnoDB does not raise a warning if ALTER TABLE index operations are optimized away

Open

MDEV-31589 Secondary index MVCC access is unnecessarily inefficient

Stalled

MDEV-5800 indexes on virtual (not materialized) columns

Closed

MDEV-11215 Several locks taken to same record inside a transaction.

Stalled

MDEV-11655 Transactional data dictionary

Open

MDEV-11658 Simpler, faster IMPORT of InnoDB tables

Open

MDEV-14341 Allow LOCK=NONE in table-rebuilding ALTER when indexed virtual columns exist

Open

MDEV-15140 Implement Partial / Filtered Indexes

Open

MDEV-16962 Assertion `!error || !ot_ctx.can_recover_from_failed_open()' failed in open_purge_table upon concurrent ALTER and FLUSH

Closed

MDEV-20301 InnoDB's MVCC has O(N^2) behaviors

Closed

MDEV-20973 Allow write operations to an InnoDB tablespace that is locked with "FLUSH TABLES ... FOR EXPORT"

Open

MDEV-22361 Cross-engine foreign keys support

Open

MDEV-22363 Reimplement the InnoDB virtual column code

Open

MDEV-25599 innodb_debug_sync for mariadb 10.5+

Closed

MDEV-32050 UNDO logs still growing for write-intensive workloads

Closed

MDEV-33067 SCN(Sequence Commit Number) based MVCC

Stalled

MDEV-34515 Contention between secondary index UPDATE and purge due to large innodb_purge_batch_size

Closed

MDEV-36472 10.6 History length regression

Open

(15 relates to)

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2020-04-24 10:33

Because multi-versioning in secondary indexes is based on duplicating the records, it does not appear necessary to store an undo log pointer (DB_ROLL_PTR) in secondary index records. In the clustered index, we must store it, because all versions of a record share a single primary key.

If we always stored a per-record transaction identifier in secondary index leaf pages, the access patterns that are explained in a lengthy comment at the start of the file read0read.cc will be greatly simplified. We would never have to visit the clustered index for checking if a secondary index record is visible or implicitly locked.

When InnoDB evaluates indexed virtual column values

The code that evaluates indexed virtual columns (~~MDEV-5800~~) in InnoDB is causing hard-to-fix problems, such as ~~MDEV-16962~~. A sign of virtual column evaluation is the function innodb_find_table_for_vc(). It is being called from innobase_allocate_row_for_vcol(), which in turn is called due to the following:

ADD INDEX that includes a virtual column
FOREIGN KEY processing (until MDEV-22361)
After MDEV-14341 is implemented: native online rebuild of a table containing indexed virtual columns
MVCC for secondary indexes: row_sel_sec_rec_is_for_clust_rec()
MVCC and purge: row_vers_build_clust_v_col()
row_upd_store_v_row(), invoked at the start of UPDATE. The SQL layer should be able to fill in both the old and the new values of indexed virtual columns in TABLE::record[], so this call should be redundant.
row_upd_build_difference_binary(), invoked when an INSERT replaces a previously delete-marked PRIMARY KEY record.

Note: row_vers_impl_x_locked_low() (checking whether a secondary index record is implicitly locked) will retrieve the virtual column information from undo log records and does not have to evaluate anything. All undo log record types TRX_UNDO_UPD_EXIST_REC, TRX_UNDO_UPD_DEL_REC, TRX_UNDO_DEL_MARK_REC will contain the values of all indexed virtual columns.

Indexed virtual column evaluation that cannot be avoided

Until MDEV-22361 moves the FOREIGN KEY processing to the SQL layer, InnoDB must be able to evaluate indexed virtual column values when a CASCADE or SET NULL operation would change the values of base columns.

ALTER TABLE…ADD INDEX or CREATE INDEX on virtual columns must be able to evaluate the column values. The calling thread is necessarily holding MDL, and the THD and both old and new TABLE are available.

When an INSERT (or UPDATE of a PRIMARY KEY) replaces a delete-marked record, that record must either have been deleted by the current transaction, or a DELETE must have been executed by a previously committed transaction. Either way, the delete-marked record may be accessible to old MVCC read views. The TRX_UNDO_UPD_DEL_REC must store both the old and new values of all indexed columns. The information is strictly needed on ROLLBACK if the current transaction executed both the DELETE and the INSERT (or UPDATE).

Eliminating indexed virtual column evaluation in MVCC and Purge

It turns out that we will only need a trivial check of a per-record transaction identifier between the secondary index record, the current read view, or the clustered index record.

MVCC: (row_sel_sec_rec_is_for_clust_rec())
check for implicit lock (row_vers_impl_x_locked_low())
MVCC and purge: row_vers_build_clust_v_col()

Let us prove this for row_vers_build_clust_v_col() in more detail. The following quote from the lengthy comment in read0read.cc is relevant:

When purge needs to remove a delete-marked row from a secondary index, it will first check that the DB_TRX_ID value of the corresponding record in the clustered index is older than the purge view. It will also check if there is a newer version of the row (clustered index record) that is not delete-marked in the secondary index. If such a row exists and is collation-equal to the delete-marked secondary index record then purge will not remove the secondary index record.

The function row_purge_poss_sec() invokes row_vers_old_has_index_entry(true, …) on a clustered index record rec and a secondary index tuple ientry that has been constructed from rec (non-virtual column values) and the undo log record (indexed virtual column values). So, both rec and ientry are valid. At this point, we are not holding a secondary index page latch; only the clustered index leaf page is latched. The aim is to check whether any visible (active) version of rec matches ientry.

If we have a transaction identifier in the secondary index record, also this matching becomes a simple matter of comparing the DB_TRX_ID in the available clustered index record versions with the secondary index record.

Furthermore, we can avoid constructing the exact clustered index record value for each active version and converting it into secondary index record values. We will merely have to determine the DB_TRX_ID of each active version of rec by dereferencing the chain of DB_ROLL_PTR.

In other words, we can greatly simplify the function trx_undo_prev_version_build(). Let us consider the following example:

--source include/have_innodb.inc

CREATE TABLE t(a SERIAL, b CHAR(1) UNIQUE) ENGINE=InnoDB;

INSERT t SET b='x';

connect (mvcc,localhost,root);

START TRANSACTION WITH CONSISTENT SNAPSHOT;

connection default;

UPDATE t SET b='X';

SELECT b FROM t FORCE INDEX(b);

connection mvcc;

SELECT b FROM t FORCE INDEX(b);

disconnect mvcc;

connection default;

DROP TABLE t;

Let us assume that the INSERT was transaction identifier 100 and the UPDATE was 101. The database contents before the disconnect mvcc will be as follows:

Clustered index: (a,DB_TRX_ID,DB_ROLL_PTR,b)=(1,101,DB_ROLL_PTR_undo2,'X')
undo1: INSERT(100) (1,'x')
undo2: UPDATE(101) (1,100,DB_ROLL_PTR_undo1,'x'↦'X')
Secondary index: (b,a)=('X',1)
New format: (b,a,DB_TRX_ID)=('x',1,100),('X',1,101)

With a per-record transaction identifier in the secondary index, we will have 2 records and can filter out those transaction identifiers that do not match the visible clustered index record version.

Without a per-record transaction identifier, we are forced to construct the secondary index record from the visible clustered index record version. With the per-record transaction identifier, we can avoid that as follows:

MVCC (`SELECT b FROM t`)

Scan the secondary index in ascending order of (key,pk,DB_TRX_ID).
If PAGE_MAX_TRX_ID=0 on a leaf page, return all records from the page.
Else if PAGE_MAX_TRX_ID≠0xFFFFFFFFFFFFFFFF (indicating old format), fall back to old code or report an error that the index is in unsupported format.
Else, if the DB_TRX_ID is newer than the read view, skip to next key.
Else, skip the record unless it carries the largest DB_TRX_ID that is visible to our read view. (Return at most one record of each (key,pk).)

Purge (`row_purge_poss_sec()`, after `disconnect mvcc`)

Purging undo1: Look up ('x',1). We will find the record with the smallest DB_TRX_ID: ('x',1,100). It should be the record that we are to purge, but we could build in some robustness check and delete all matching records with DB_TRX_ID≤100.
Purging undo2: Look up ('X',1). We will find ('X',1,101). Because we will thus purge the last record with our transaction identifier (101), we will check the entire page to find if there are any records with DB_TRX_ID>101. Because no such records exist, we will convert the page to the PAGE_MAX_TRX_ID=0 format, converting the record to ('X',1) with no transaction history.

At the end of the purge (if it completes before DROP TABLE t), the table will contain the following:

Clustered index: (a,DB_TRX_ID,DB_ROLL_PTR,b)=(1,0,1<<55,'X') (history cleared by ~~MDEV-12288~~)
Secondary index: (b,a)=('X',1), PAGE_MAX_TRX_ID=0

Compatibility notes

Before this change, InnoDB will treat secondary index leaf pages with the new two PAGE_MAX_TRX_ID values as corrupted. Hence, tables can be exported to old versions only after dropping all secondary indexes.

We might want to retain support for the old format (without per-record secondary indexes) and remove the support in a later major release. Until the support is removed, this would allow users to upgrade without having to rebuilding secondary indexes (or entire tables).

We could also require that all secondary indexes be rebuilt on upgrade. Any secondary index leaf pages where PAGE_MAX_TRX_ID is not one of our two special values would be treated as corrupted, and the secondary indexes would not be updated, and user transactions would be aborted.

Marko Mäkelä added a comment - 2020-04-24 10:33 Because multi-versioning in secondary indexes is based on duplicating the records, it does not appear necessary to store an undo log pointer ( DB_ROLL_PTR ) in secondary index records. In the clustered index, we must store it, because all versions of a record share a single primary key. If we always stored a per-record transaction identifier in secondary index leaf pages, the access patterns that are explained in a lengthy comment at the start of the file read0read.cc will be greatly simplified. We would never have to visit the clustered index for checking if a secondary index record is visible or implicitly locked. When InnoDB evaluates indexed virtual column values The code that evaluates indexed virtual columns ( MDEV-5800 ) in InnoDB is causing hard-to-fix problems, such as MDEV-16962 . A sign of virtual column evaluation is the function innodb_find_table_for_vc() . It is being called from innobase_allocate_row_for_vcol() , which in turn is called due to the following: ADD INDEX that includes a virtual column FOREIGN KEY processing (until MDEV-22361 ) After MDEV-14341 is implemented: native online rebuild of a table containing indexed virtual columns MVCC for secondary indexes: row_sel_sec_rec_is_for_clust_rec() MVCC and purge: row_vers_build_clust_v_col() row_upd_store_v_row() , invoked at the start of UPDATE . The SQL layer should be able to fill in both the old and the new values of indexed virtual columns in TABLE::record[] , so this call should be redundant. row_upd_build_difference_binary() , invoked when an INSERT replaces a previously delete-marked PRIMARY KEY record. Note: row_vers_impl_x_locked_low() (checking whether a secondary index record is implicitly locked) will retrieve the virtual column information from undo log records and does not have to evaluate anything. All undo log record types TRX_UNDO_UPD_EXIST_REC , TRX_UNDO_UPD_DEL_REC , TRX_UNDO_DEL_MARK_REC will contain the values of all indexed virtual columns. Indexed virtual column evaluation that cannot be avoided Until MDEV-22361 moves the FOREIGN KEY processing to the SQL layer, InnoDB must be able to evaluate indexed virtual column values when a CASCADE or SET NULL operation would change the values of base columns. ALTER TABLE…ADD INDEX or CREATE INDEX on virtual columns must be able to evaluate the column values. The calling thread is necessarily holding MDL, and the THD and both old and new TABLE are available. When an INSERT (or UPDATE of a PRIMARY KEY ) replaces a delete-marked record, that record must either have been deleted by the current transaction, or a DELETE must have been executed by a previously committed transaction. Either way, the delete-marked record may be accessible to old MVCC read views. The TRX_UNDO_UPD_DEL_REC must store both the old and new values of all indexed columns. The information is strictly needed on ROLLBACK if the current transaction executed both the DELETE and the INSERT (or UPDATE ). Eliminating indexed virtual column evaluation in MVCC and Purge It turns out that we will only need a trivial check of a per-record transaction identifier between the secondary index record, the current read view, or the clustered index record. MVCC: ( row_sel_sec_rec_is_for_clust_rec() ) check for implicit lock ( row_vers_impl_x_locked_low() ) MVCC and purge: row_vers_build_clust_v_col() Let us prove this for row_vers_build_clust_v_col() in more detail. The following quote from the lengthy comment in read0read.cc is relevant: When purge needs to remove a delete-marked row from a secondary index, it will first check that the DB_TRX_ID value of the corresponding record in the clustered index is older than the purge view. It will also check if there is a newer version of the row (clustered index record) that is not delete-marked in the secondary index. If such a row exists and is collation-equal to the delete-marked secondary index record then purge will not remove the secondary index record. The function row_purge_poss_sec() invokes row_vers_old_has_index_entry(true, …) on a clustered index record rec and a secondary index tuple ientry that has been constructed from rec (non-virtual column values) and the undo log record (indexed virtual column values). So, both rec and ientry are valid. At this point, we are not holding a secondary index page latch; only the clustered index leaf page is latched. The aim is to check whether any visible (active) version of rec matches ientry . If we have a transaction identifier in the secondary index record, also this matching becomes a simple matter of comparing the DB_TRX_ID in the available clustered index record versions with the secondary index record. Furthermore, we can avoid constructing the exact clustered index record value for each active version and converting it into secondary index record values. We will merely have to determine the DB_TRX_ID of each active version of rec by dereferencing the chain of DB_ROLL_PTR . In other words, we can greatly simplify the function trx_undo_prev_version_build() . Let us consider the following example: --source include/have_innodb.inc CREATE TABLE t(a SERIAL, b CHAR (1) UNIQUE ) ENGINE=InnoDB; INSERT t SET b= 'x' ; connect (mvcc,localhost,root); START TRANSACTION WITH CONSISTENT SNAPSHOT; connection default ; UPDATE t SET b= 'X' ; SELECT b FROM t FORCE INDEX (b); connection mvcc; SELECT b FROM t FORCE INDEX (b); disconnect mvcc; connection default ; DROP TABLE t; Let us assume that the INSERT was transaction identifier 100 and the UPDATE was 101. The database contents before the disconnect mvcc will be as follows: Clustered index: (a,DB_TRX_ID,DB_ROLL_PTR,b)=(1,101,DB_ROLL_PTR_undo2,'X') undo1: INSERT(100) (1,'x') undo2: UPDATE(101) (1,100,DB_ROLL_PTR_undo1,'x'↦'X') Secondary index: (b,a)=('X',1) New format: (b,a,DB_TRX_ID)=('x',1,100),('X',1,101) With a per-record transaction identifier in the secondary index, we will have 2 records and can filter out those transaction identifiers that do not match the visible clustered index record version. Without a per-record transaction identifier, we are forced to construct the secondary index record from the visible clustered index record version. With the per-record transaction identifier, we can avoid that as follows: MVCC ( SELECT b FROM t ) Scan the secondary index in ascending order of (key,pk,DB_TRX_ID). If PAGE_MAX_TRX_ID=0 on a leaf page, return all records from the page. Else if PAGE_MAX_TRX_ID≠0xFFFFFFFFFFFFFFFF (indicating old format), fall back to old code or report an error that the index is in unsupported format. Else, if the DB_TRX_ID is newer than the read view, skip to next key. Else, skip the record unless it carries the largest DB_TRX_ID that is visible to our read view. (Return at most one record of each (key,pk).) Purge ( row_purge_poss_sec() , after disconnect mvcc ) Purging undo1: Look up ('x',1). We will find the record with the smallest DB_TRX_ID: ('x',1,100). It should be the record that we are to purge, but we could build in some robustness check and delete all matching records with DB_TRX_ID≤100. Purging undo2: Look up ('X',1). We will find ('X',1,101). Because we will thus purge the last record with our transaction identifier (101), we will check the entire page to find if there are any records with DB_TRX_ID>101. Because no such records exist, we will convert the page to the PAGE_MAX_TRX_ID=0 format, converting the record to ('X',1) with no transaction history. At the end of the purge (if it completes before DROP TABLE t ), the table will contain the following: Clustered index: (a,DB_TRX_ID,DB_ROLL_PTR,b)=(1,0,1<<55,'X') (history cleared by MDEV-12288 ) Secondary index: (b,a)=('X',1), PAGE_MAX_TRX_ID=0 Compatibility notes Before this change, InnoDB will treat secondary index leaf pages with the new two PAGE_MAX_TRX_ID values as corrupted. Hence, tables can be exported to old versions only after dropping all secondary indexes. We might want to retain support for the old format (without per-record secondary indexes) and remove the support in a later major release. Until the support is removed, this would allow users to upgrade without having to rebuilding secondary indexes (or entire tables). We could also require that all secondary indexes be rebuilt on upgrade. Any secondary index leaf pages where PAGE_MAX_TRX_ID is not one of our two special values would be treated as corrupted, and the secondary indexes would not be updated, and user transactions would be aborted.

Marko Mäkelä added a comment - 2020-04-24 15:40

The format change is not easily adaptable to ROW_FORMAT=COMPRESSED. We might address that in ~~MDEV-22367~~ by removing write support for such tables.

valerii expressed a concern that for large INSERT, DELETE or UPDATE operations affecting many secondary index leaf pages, we may have to split and merge pages frequently because the record size would change a lot.

PAGE_MAX_TRX_ID is 8 bytes while DB_TRX_ID is 6 bytes. The 2 extra bytes could be used for some flags.
Unless the secondary index comprises a single page (the root page), we can repurpose previously unused fields PAGE_BTR_SEG_LEAF, PAGE_BTR_SEG_TOP for some storage.
We could store a ‘base’ value in PAGE_MAX_TRX_ID (the minimum transaction identifier that is present in the page).
Using a delta encoding, we could store a ‘dictionary’ of distinct PAGE_MAX_TRX_ID in the page, maybe starting at PAGE_FREE.
For small indexes into this ‘dictionary’, repurpose info_bits and status_bits as pointers. For larger values, append the index to the record payload using variable-length encoding (1 or 2 bytes per record).

In my example, I forgot that the secondary index record ('x',1) would be delete-marked when the 'x' is updated to 'X'. For delete-marked records, we will probably have to store two transaction identifiers: the one where the record was ‘inserted’ (this can be 0 if the record was visible in all read views, that is, no old history is available), and the transaction identifier of delete-marking the record. If this turns out to be insufficient, we may have to allow an arbitrary number of transaction identifiers to be stored for a single secondary index record. It would be preferable to avoid encoding DB_ROLL_PTR.

When storing multiple consecutive records with the same PRIMARY KEY value, it could be useful to omit the value for subsequent records. For example, for the second record of ('x',1,100,delete-marked-by(101)),('X',1,101) we might set a flag in info_bits to indicate that the PRIMARY KEY is the same as in the preceding record.

Marko Mäkelä added a comment - 2020-04-24 15:40 The format change is not easily adaptable to ROW_FORMAT=COMPRESSED . We might address that in MDEV-22367 by removing write support for such tables. valerii expressed a concern that for large INSERT , DELETE or UPDATE operations affecting many secondary index leaf pages, we may have to split and merge pages frequently because the record size would change a lot. PAGE_MAX_TRX_ID is 8 bytes while DB_TRX_ID is 6 bytes. The 2 extra bytes could be used for some flags. Unless the secondary index comprises a single page (the root page), we can repurpose previously unused fields PAGE_BTR_SEG_LEAF , PAGE_BTR_SEG_TOP for some storage. We could store a ‘base’ value in PAGE_MAX_TRX_ID (the minimum transaction identifier that is present in the page). Using a delta encoding, we could store a ‘dictionary’ of distinct PAGE_MAX_TRX_ID in the page, maybe starting at PAGE_FREE . For small indexes into this ‘dictionary’, repurpose info_bits and status_bits as pointers. For larger values, append the index to the record payload using variable-length encoding (1 or 2 bytes per record). In my example, I forgot that the secondary index record ('x',1) would be delete-marked when the 'x' is updated to 'X'. For delete-marked records, we will probably have to store two transaction identifiers: the one where the record was ‘inserted’ (this can be 0 if the record was visible in all read views, that is, no old history is available), and the transaction identifier of delete-marking the record. If this turns out to be insufficient, we may have to allow an arbitrary number of transaction identifiers to be stored for a single secondary index record. It would be preferable to avoid encoding DB_ROLL_PTR . When storing multiple consecutive records with the same PRIMARY KEY value, it could be useful to omit the value for subsequent records. For example, for the second record of ('x',1,100,delete-marked-by(101)),('X',1,101) we might set a flag in info_bits to indicate that the PRIMARY KEY is the same as in the preceding record.

zongzhi chen added a comment - 2024-04-16 22:23

what's going on with this issue?
I thought performance is much important than space in OLTP workload. so add about 8 Byte trx_id in secondary index is not a cost in space.
However, the trx_id in secondary index will solve many issue, especially in look up the record in the clustered index.

zongzhi chen added a comment - 2024-04-16 22:23 what's going on with this issue? I thought performance is much important than space in OLTP workload. so add about 8 Byte trx_id in secondary index is not a cost in space. However, the trx_id in secondary index will solve many issue, especially in look up the record in the clustered index.

Marko Mäkelä added a comment - 2024-04-17 04:41

The reason why I have been hesitant to implement this is file format compatibility and upgrades. I would like to bundle this with another file format change that I think is long overdue, related to MDEV-11658. ALTER TABLE…IMPORT TABLESPACE would work without .cfg files, and to older servers such tables could only be imported without secondary indexes. Come to think of it, it would make sense to implement MDEV-11658 first, including some per-index format flags, so that in case MDEV-11658 is in a major release before this one, importing secondary indexes in this format to older releases will be correctly prevented.

Actually, the transaction identifier is only 48 bits (6 bytes) in the clustered index and undo log records. I don’t think that the storage overhead would be that much. We could also reduce it by adding some kind of a "dictionary" of transaction identifiers in the page. If there were up to 16 distinct per-record transaction identifiers, the overhead would be at most 4 bits per record + 16*6 bytes per page. The 16 most significant bits of the PAGE_MAX_TRX_ID are currently unused. We could apply some delta coding when storing the "dictionary", maybe using at most 16 bits per transaction ID. There are some unused header fields in every non-root page that we could repurpose for this. If too many per-record identifiers are needed, we could degrade to the old per-page logic, which would be identified by PAGE_MAX_TRX_ID being less than 2⁴⁸. This would reduce the overhead to (say) 4 bits per record. In fact, we are currently wasting at least 3 spare bits per record (except in the ROW_FORMAT=COMPRESSED format).

Marko Mäkelä added a comment - 2024-04-17 04:41 The reason why I have been hesitant to implement this is file format compatibility and upgrades. I would like to bundle this with another file format change that I think is long overdue, related to MDEV-11658 . ALTER TABLE…IMPORT TABLESPACE would work without .cfg files, and to older servers such tables could only be imported without secondary indexes. Come to think of it, it would make sense to implement MDEV-11658 first, including some per-index format flags, so that in case MDEV-11658 is in a major release before this one, importing secondary indexes in this format to older releases will be correctly prevented. Actually, the transaction identifier is only 48 bits (6 bytes) in the clustered index and undo log records. I don’t think that the storage overhead would be that much. We could also reduce it by adding some kind of a "dictionary" of transaction identifiers in the page. If there were up to 16 distinct per-record transaction identifiers, the overhead would be at most 4 bits per record + 16*6 bytes per page. The 16 most significant bits of the PAGE_MAX_TRX_ID are currently unused. We could apply some delta coding when storing the "dictionary", maybe using at most 16 bits per transaction ID. There are some unused header fields in every non-root page that we could repurpose for this. If too many per-record identifiers are needed, we could degrade to the old per-page logic, which would be identified by PAGE_MAX_TRX_ID being less than 2⁴⁸. This would reduce the overhead to (say) 4 bits per record. In fact, we are currently wasting at least 3 spare bits per record (except in the ROW_FORMAT=COMPRESSED format).

zongzhi chen added a comment - 2024-04-17 07:39

Yes, I agree with you about the storage overhead.
Actually, in the non-unique secondary index, InnoDB havn't do the duplicate record deduplicate. The prefix compression in secondary can save lots of space. We have done it in our MySQL version.

zongzhi chen added a comment - 2024-04-17 07:39 Yes, I agree with you about the storage overhead. Actually, in the non-unique secondary index, InnoDB havn't do the duplicate record deduplicate. The prefix compression in secondary can save lots of space. We have done it in our MySQL version.

People

Assignee:: Thirunarayanan Balathandayuthapani

Reporter:: Marko Mäkelä

Votes:: 5 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 2018-11-02 10:20

Updated:: 2025-04-03 13:22

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server