[MDEV-19556] Support native storage engine sampling of rows Created: 2019-05-22  Updated: 2022-06-28

Status: Open
Project: MariaDB Server
Component/s: Optimizer, Storage Engine - Aria, Storage Engine - InnoDB, Storage Engine - MyISAM
Fix Version/s: None

Type: Task Priority: Major
Reporter: Vicențiu Ciorbaru Assignee: Vicențiu Ciorbaru
Resolution: Unresolved Votes: 1
Labels: None

Issue Links:
Relates
relates to MDEV-15020 Server hangs due to InnoDB persistent... Closed
relates to MDEV-28637 Add basic TABLESAMPLE SYSTEM support Stalled

 Description   

Histogram collection has been augmented in 10.4 with the ability to collect a percentage of rows. This was implemented via Bernoulli sampling. The drawback is that one has to perform a full table scan to perform sampling. This technique has reduced the bottleneck of Histograms collection substantially, however it can still be improved.

Storage engine API should be extended to allow the server to make use (if available) storage engine sampling capabilities.

This feature can be additionally used to support fast approximation functions such as a version of fast count-distinct with an estimator attached. (ex: Smoothed Jackknife Estimator) Additionally, with native sampling support, one could afford to perform various optimizations in the background (such as statistics collection), as the performance impact would be much smaller. Another use case for native sampling is SELECT FROM <table-sample>

https://www.2ndquadrant.com/en/blog/tablesample-in-postgresql-9-5-2/

The implementation will be done for 2 different storage engines (Aria & Innodb). The algorithm will make use of a weighted index-dive (to counteract if index pages are unbalanced).



 Comments   
Comment by Sergei Petrunia [ 2022-05-20 ]

See MDEV-28637 for a patch that allows to test the sampling.

Generated at Thu Feb 08 08:52:36 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.