[MDEV-17598] InnoDB index option for per-record transaction ID Created: 2018-11-02 Updated: 2024-02-07 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB, Virtual Columns |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Unresolved | Votes: | 5 |
| Labels: | lock, performance, purge | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
In secondary indexes, InnoDB trades complexity for space, by not keeping track of version history for each secondary index record separately. This adds complexity to three interdependent areas: multi-version concurrency control (MVCC), purge of transaction history, and implicit locking. In multi-versioning, if a secondary index record has been marked for deletion, or if PAGE_MAX_TRX_ID indicates that the page where the record resides has been recently modified, InnoDB has to look up the record in the clustered index and then painstakingly construct each available version of the record to find out if the secondary index record exists in the current read view. As noted in As noted in MDEV-11215, a record is implicitly locked by a transaction if it was written or modified by that transaction. If a conflicting lock request arrives from another transaction, that transaction may convert the implicit lock request to an explicit one (on behalf of the lock-owner transaction) before creating its own lock object for starting the lock wait. For clustered index records, determining whether a record is implicitly locked is easy: the hidden DB_TRX_ID column will belong to an active (or XA PREPARE; not committed) transaction. For secondary index records it is very expensive because there is no per-record DB_TRX_ID but only a PAGE_MAX_TRX_ID. Therefore the function row_vers_impl_x_locked() has to look up the matching version of the record in the clustered index before we can determine whether the secondary index record is implicitly locked. This and the similarly slow MVCC logic can lead to a 'death spiral' of a busy server. These operations slow down the purge of transaction history, and they become slower as the history grows. In my response to a blog post I mentioned Bug#14704286 SECONDARY INDEX UPDATES MAKE CONSISTENT READS DO O(N^2) UNDO PAGE LOOKUPS. If we had DB_TRX_ID in InnoDB secondary index records, the existence implicit locks could be detected directly based on that field, just like in clustered indexes. To reduce the space overhead, we could introduce a change to the secondary index leaf page header, allowing the following choices. This could be implemented by repurposing the field PAGE_MAX_TRX_ID.
With this, we could gradually convert secondary index pages to the new format as the pages are being modified. No change to the metadata would be needed; all secondary indexes would be in this hybrid format. Tables could be imported to old MariaDB versions, but secondary indexes would have to be dropped and re-created if any page contains the extra fields. |
| Comments |
| Comment by Marko Mäkelä [ 2020-04-24 ] | |||||||||||||
|
Because multi-versioning in secondary indexes is based on duplicating the records, it does not appear necessary to store an undo log pointer (DB_ROLL_PTR) in secondary index records. In the clustered index, we must store it, because all versions of a record share a single primary key. If we always stored a per-record transaction identifier in secondary index leaf pages, the access patterns that are explained in a lengthy comment at the start of the file read0read.cc will be greatly simplified. We would never have to visit the clustered index for checking if a secondary index record is visible or implicitly locked. When InnoDB evaluates indexed virtual column valuesThe code that evaluates indexed virtual columns (
Note: row_vers_impl_x_locked_low() (checking whether a secondary index record is implicitly locked) will retrieve the virtual column information from undo log records and does not have to evaluate anything. All undo log record types TRX_UNDO_UPD_EXIST_REC, TRX_UNDO_UPD_DEL_REC, TRX_UNDO_DEL_MARK_REC will contain the values of all indexed virtual columns. Indexed virtual column evaluation that cannot be avoidedUntil MDEV-22361 moves the FOREIGN KEY processing to the SQL layer, InnoDB must be able to evaluate indexed virtual column values when a CASCADE or SET NULL operation would change the values of base columns. ALTER TABLE…ADD INDEX or CREATE INDEX on virtual columns must be able to evaluate the column values. The calling thread is necessarily holding MDL, and the THD and both old and new TABLE are available. When an INSERT (or UPDATE of a PRIMARY KEY) replaces a delete-marked record, that record must either have been deleted by the current transaction, or a DELETE must have been executed by a previously committed transaction. Either way, the delete-marked record may be accessible to old MVCC read views. The TRX_UNDO_UPD_DEL_REC must store both the old and new values of all indexed columns. The information is strictly needed on ROLLBACK if the current transaction executed both the DELETE and the INSERT (or UPDATE). Eliminating indexed virtual column evaluation in MVCC and PurgeIt turns out that we will only need a trivial check of a per-record transaction identifier between the secondary index record, the current read view, or the clustered index record.
Let us prove this for row_vers_build_clust_v_col() in more detail. The following quote from the lengthy comment in read0read.cc is relevant:
The function row_purge_poss_sec() invokes row_vers_old_has_index_entry(true, …) on a clustered index record rec and a secondary index tuple ientry that has been constructed from rec (non-virtual column values) and the undo log record (indexed virtual column values). So, both rec and ientry are valid. At this point, we are not holding a secondary index page latch; only the clustered index leaf page is latched. The aim is to check whether any visible (active) version of rec matches ientry. If we have a transaction identifier in the secondary index record, also this matching becomes a simple matter of comparing the DB_TRX_ID in the available clustered index record versions with the secondary index record. Furthermore, we can avoid constructing the exact clustered index record value for each active version and converting it into secondary index record values. We will merely have to determine the DB_TRX_ID of each active version of rec by dereferencing the chain of DB_ROLL_PTR. In other words, we can greatly simplify the function trx_undo_prev_version_build(). Let us consider the following example:
Let us assume that the INSERT was transaction identifier 100 and the UPDATE was 101. The database contents before the disconnect mvcc will be as follows:
With a per-record transaction identifier in the secondary index, we will have 2 records and can filter out those transaction identifiers that do not match the visible clustered index record version. Without a per-record transaction identifier, we are forced to construct the secondary index record from the visible clustered index record version. With the per-record transaction identifier, we can avoid that as follows: MVCC (SELECT b FROM t)
Purge (row_purge_poss_sec(), after disconnect mvcc)
At the end of the purge (if it completes before DROP TABLE t), the table will contain the following:
Compatibility notesBefore this change, InnoDB will treat secondary index leaf pages with the new two PAGE_MAX_TRX_ID values as corrupted. Hence, tables can be exported to old versions only after dropping all secondary indexes. We might want to retain support for the old format (without per-record secondary indexes) and remove the support in a later major release. Until the support is removed, this would allow users to upgrade without having to rebuilding secondary indexes (or entire tables). We could also require that all secondary indexes be rebuilt on upgrade. Any secondary index leaf pages where PAGE_MAX_TRX_ID is not one of our two special values would be treated as corrupted, and the secondary indexes would not be updated, and user transactions would be aborted. | |||||||||||||
| Comment by Marko Mäkelä [ 2020-04-24 ] | |||||||||||||
|
The format change is not easily adaptable to ROW_FORMAT=COMPRESSED. We might address that in valerii expressed a concern that for large INSERT, DELETE or UPDATE operations affecting many secondary index leaf pages, we may have to split and merge pages frequently because the record size would change a lot.
In my example, I forgot that the secondary index record ('x',1) would be delete-marked when the 'x' is updated to 'X'. For delete-marked records, we will probably have to store two transaction identifiers: the one where the record was ‘inserted’ (this can be 0 if the record was visible in all read views, that is, no old history is available), and the transaction identifier of delete-marking the record. If this turns out to be insufficient, we may have to allow an arbitrary number of transaction identifiers to be stored for a single secondary index record. It would be preferable to avoid encoding DB_ROLL_PTR. When storing multiple consecutive records with the same PRIMARY KEY value, it could be useful to omit the value for subsequent records. For example, for the second record of ('x',1,100,delete-marked-by(101)),('X',1,101) we might set a flag in info_bits to indicate that the PRIMARY KEY is the same as in the preceding record. |