[MDEV-12212] plugin for returning random rows in a table Created: 2017-03-09  Updated: 2017-03-19

Status: Open
Project: MariaDB Server
Component/s: Plugins
Fix Version/s: None

Type: Task Priority: Major
Reporter: Daniel Black Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: gsoc17


 Description   

Selecting random rows from a table is a historicity hard problem. This is because the distribution of row values isn't evident.

"Workarounds" have existed for a while - http://jan.kneschke.de/projects/mysql/order-by-rand/ - fairly intrusive however.

Using engine independent statistics these are available in mariadb https://mariadb.com/kb/en/mariadb/engine-independent-table-statistics/.

To solve exposing a fake table it should be able to get a clean SQL to get random rows.

Like https://mariadb.com/kb/en/mariadb/sequence/ this should be a table discover https://mariadb.com/kb/en/mariadb/table-discovery/.

Mode of operations could be:

ANALYZE TABLE xxx PERSISTENT FOR ALL;
INSTALL SONAME "ha_rand";
SELECT * FROM rand_xxx_col1_5;

The select would return 5 random rows from the table xxx based on the distribution of col1.

Its use a a full member should be returned:

SELECT r.id, other.id FROM rand_xxx_col1_5 r JOIN other on r.id=other.id;

Goals in order:

  1. Initial table discovery and returning tables
  2. Participation in optimizer negotiation
  3. Ensuring a unique result - probably desirable from an expected use however altering the table syntax to have a flag as part of the name for this
  4. Make the table modifiable

I'm willing to mentor this, co-mentors with a good knowledge of the storage API and table discovery would be much appreciated



 Comments   
Comment by Sergei Golubchik [ 2017-03-09 ]

How would it select these random rows? Under the hood. Storage engine, discovery — it's just an interface, how will it know what records to return?

Comment by Daniel Black [ 2017-03-09 ]

By examining the engine independent statistics tables for the (db,table,column) the max, min and histogram can provide an even distribution of results by equating this distribution to an approximate value for the primary key.

Comment by Sergei Petrunia [ 2017-03-17 ]

I'm wondering if this task should be actually TABLE SAMPLING. TABLE SAMPLING is a standard feature from SQL:2003,
PostgreSQL has it: https://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2/ , and they did it as part of the GSoC project.

Comment by Daniel Black [ 2017-03-19 ]

Nice - thanks psergey. I'm happy to implement it in this way and still using EIS (if available) as the basis for selection.

As a stretch goal an auxiliary sampling standard (not BERNOULLI or SYSTEM) could define a sampling based on column values.

Generated at Thu Feb 08 07:55:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.