[MDEV-18188] Maintain persistent COUNT(*) in InnoDB Created: 2019-01-09  Updated: 2023-11-30

Status: Stalled
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: None

Type: New Feature Priority: Major
Reporter: Marko Mäkelä Assignee: Thirunarayanan Balathandayuthapani
Resolution: Unresolved Votes: 5
Labels: performance, statistics

Issue Links:
Relates
relates to MDEV-10128 Port MySQL 5.7 select count() optimiz... Closed
relates to MDEV-17605 Statistics for InnoDB table is wrong ... Closed

 Description   

Query planning needs to know the number of records in a table. Currently, InnoDB only provides an estimate of this.

If InnoDB kept accurate track of the number of records in a table, then it would not only benefit statistics, but also limited cases like the following:

SELECT COUNT(*) FROM t LOCK IN SHARE MODE;
SELECT COUNT(*) FROM t FOR UPDATE;
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
SELECT COUNT(*) FROM t;

At the moment, InnoDB would scan the entire table in order to count the records. If we maintained a durable count of committed records, we could instantly return the results for the above cases. The locking variants would simply lock the table. For the default REPEATABLE READ we would still have to count the records.

However, if we additionally maintained a count of uncommitted rows, then we could have instant COUNT in most cases:

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
SELECT COUNT(*) FROM t;
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- instantaneous if committed count = uncommitted count
SELECT COUNT(*) FROM t;

These counts could possibly be maintained in the clustered index root page. If this hurts concurrency too much (in MDEV-6076, storing the persistent AUTO_INCREMENT there did not seem to hurt), then we could partition the two counters into multiple pages.



 Comments   
Comment by Marko Mäkelä [ 2019-02-24 ]

While doing this, we could also introduce a persistent checksum on the data, synchronized with every update of the clustered index. The checksum would necessarily be something else than what is currently returned by CHECKSUM TABLE, because it would depend on the page contents and would not be updated on instant ALTER TABLE operations that would affect the logical data format.

Comment by Marko Mäkelä [ 2020-12-03 ]

Based on some technical difficulties that we have encountered during the development of MDEV-515, I think we’d better store the persistent count information in the clustered index root page, in place of the "infimum" and "supremum" strings. MDEV-15562 already repurposed some of these bytes in 10.4. The original idea was to extend the MDEV-15562 file format change in a different way, that is, storing the information in the metadata BLOB. Using the metadata BLOB would break IMPORT TABLESPACE (MDEV-18543) and if we go with the planned limitation, disable MDEV-515.

Neither way of extending the file format would support persistent COUNT(*) for ROW_FORMAT=COMPRESSED tables, which we are hoping to phase out, as noted in MDEV-23497.

We could protect the updates of the persistent count in a similar way as we protect the changes of the persistent AUTO_INCREMENT (MDEV-6076). Similar to that field, there should be no reader-writer conflicts even if we allow concurrent reads on the page, because both the persistent AUTO_INCREMENT and COUNT(*) would only be read when the table is added to the data dictionary cache.

Generated at Thu Feb 08 08:42:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.