[MDEV-12212] plugin for returning random rows in a table Created: 2017-03-09 Updated: 2017-03-19 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Plugins |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major |
| Reporter: | Daniel Black | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | gsoc17 | ||
| Description |
|
Selecting random rows from a table is a historicity hard problem. This is because the distribution of row values isn't evident. "Workarounds" have existed for a while - http://jan.kneschke.de/projects/mysql/order-by-rand/ - fairly intrusive however. Using engine independent statistics these are available in mariadb https://mariadb.com/kb/en/mariadb/engine-independent-table-statistics/. To solve exposing a fake table it should be able to get a clean SQL to get random rows. Like https://mariadb.com/kb/en/mariadb/sequence/ this should be a table discover https://mariadb.com/kb/en/mariadb/table-discovery/. Mode of operations could be:
The select would return 5 random rows from the table xxx based on the distribution of col1. Its use a a full member should be returned:
Goals in order:
I'm willing to mentor this, co-mentors with a good knowledge of the storage API and table discovery would be much appreciated |
| Comments |
| Comment by Sergei Golubchik [ 2017-03-09 ] |
|
How would it select these random rows? Under the hood. Storage engine, discovery — it's just an interface, how will it know what records to return? |
| Comment by Daniel Black [ 2017-03-09 ] |
|
By examining the engine independent statistics tables for the (db,table,column) the max, min and histogram can provide an even distribution of results by equating this distribution to an approximate value for the primary key. |
| Comment by Sergei Petrunia [ 2017-03-17 ] |
|
I'm wondering if this task should be actually TABLE SAMPLING. TABLE SAMPLING is a standard feature from SQL:2003, |
| Comment by Daniel Black [ 2017-03-19 ] |
|
Nice - thanks psergey. I'm happy to implement it in this way and still using EIS (if available) as the basis for selection. As a stretch goal an auxiliary sampling standard (not BERNOULLI or SYSTEM) could define a sampling based on column values. |