[MDEV-24092] support oracle syntax: sample - Jira

Details

Type: New Feature
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Optimizer, Storage Engine - InnoDB
Labels:
- Compatibility
- energy
- optimizer
- oracle

Description

It is recommended to implement oracle's sample clause in MariaDB of sql_mode=oracle

SAMPLE [ BLOCK ]
(sample_percent)
[ SEED (seed_value) ]

The sample_clause lets you instruct the database to select from a random sample of data from the table, rather than from the entire table.

BLOCK instructs the database to attempt to perform random block sampling instead of random row sampling.
For sample_percent, specify the percentage of the total row or block count to be included in the sample.
SEED seed_value Specify this to instruct the database to attempt to return the same sample from one execution to the next.

oracle sample clause and example: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6

Attachments

Activity

Ascending order - Click to sort in descending order

Michael Widenius added a comment - 2020-11-03 11:42

How is this used syntax used the application?

In how many places?
Does this have to be fast or is it ok to always do a table scan when doing sampling?
BLOCK with Oracle only works with table scans, so it's probably not that fast either.
Is BLOCK important ? The question is because not all storage engines works with blocks and it's hard to do a general implementation with BLOCK. We can of course have BLOCK as no-op for storage engines that does not work with blocks.

Why doesn't the following work good enough?
SELECT * from t1 where rand() < 0.1 LIMIT 1000?
This returns 10% of the rows up to 1000 rows.

It would not be hard to implement the BLOCK syntax but internally replace it with WHERE rand() < X for the tables we do table scans on.

Michael Widenius added a comment - 2020-11-03 11:42 How is this used syntax used the application? In how many places? Does this have to be fast or is it ok to always do a table scan when doing sampling? BLOCK with Oracle only works with table scans, so it's probably not that fast either. Is BLOCK important ? The question is because not all storage engines works with blocks and it's hard to do a general implementation with BLOCK. We can of course have BLOCK as no-op for storage engines that does not work with blocks. Why doesn't the following work good enough? SELECT * from t1 where rand() < 0.1 LIMIT 1000? This returns 10% of the rows up to 1000 rows. It would not be hard to implement the BLOCK syntax but internally replace it with WHERE rand() < X for the tables we do table scans on.

woqutech added a comment - 2020-11-03 13:10

sample clause is not widely used
sample requires faster than table scan
Block is not important, you can ignore it now
SELECT * from t1 where rand() <0.1 LIMIT 1000 is good enough if it is fast than table scan
we should also support sample clause of consecutive rows or blocks like this: SAMPLE BLOCK ( 1 , 1 ) SEED ( 1 ) ：https://jonathanlewis.wordpress.com/2010/03/12/sample-clause/. but we can also ignore consecutive rows or blocks syntax.

woqutech added a comment - 2020-11-03 13:10 sample clause is not widely used sample requires faster than table scan Block is not important, you can ignore it now SELECT * from t1 where rand() <0.1 LIMIT 1000 is good enough if it is fast than table scan we should also support sample clause of consecutive rows or blocks like this: SAMPLE BLOCK ( 1 , 1 ) SEED ( 1 ) ： https://jonathanlewis.wordpress.com/2010/03/12/sample-clause/ . but we can also ignore consecutive rows or blocks syntax.

Sergei Golubchik added a comment - 2020-11-08 17:47

SQL Standard syntax, FYI:

<table factor> ::= <table primary> [ <sample clause> ]

<sample clause> ::= TABLESAMPLE <sample method> <left paren> <sample percentage> <right paren> [ <repeatable clause> ]

<sample method> ::= BERNOULLI | SYSTEM

<repeatable clause> ::= REPEATABLE <left paren> <repeat argument> <right paren>

<sample percentage> ::= <numeric value expression>

<repeat argument> ::= <numeric value expression>

Sergei Golubchik added a comment - 2020-11-08 17:47 SQL Standard syntax, FYI: <table factor> ::= <table primary> [ <sample clause> ] <sample clause> ::= TABLESAMPLE <sample method> <left paren> <sample percentage> <right paren> [ <repeatable clause> ] <sample method> ::= BERNOULLI | SYSTEM <repeatable clause> ::= REPEATABLE <left paren> <repeat argument> <right paren> <sample percentage> ::= <numeric value expression> <repeat argument> ::= <numeric value expression>

woqutech added a comment - 2020-12-16 05:13 - edited

I implemented a simple version of the SAMPLE syntax：

1. Add the SAMPLE/SEED keywords and opt_sample_clause definition in the sql_yacc.yy file
2. Add opt_sample_clause to table_primary_ident definition
3. The original data acquisition process remains unchanged. Modify the data return process (evaluate_join_record function), and determine whether a row of data is returned according to the random value and the percentage of the acquired data
4. Generate a random number, and check each binary bit of the random number in turn. If the bit is 1 and the number of rows currently returned (return_rows)/the total number of rows currently obtained (total_rows) is less than the percentage, the row is returned, and add 1 to both return_rows and total_rows, otherwise skip the row total_rows and add 1
5. After processing each bit of the random number, repeat step 4 until all data is processed

advantages:

Easy to implement
Only Server layer's code is modified, can be applied to all storage engines

disadvantages:

The original processing flow has not been modified, and the query efficiency has not been improved
The current implementation does not support sampling by block, so the block keyword is ignored when specifying it

need some advice, thanks!

woqutech added a comment - 2020-12-16 05:13 - edited I implemented a simple version of the SAMPLE syntax： 1. Add the SAMPLE/SEED keywords and opt_sample_clause definition in the sql_yacc.yy file 2. Add opt_sample_clause to table_primary_ident definition 3. The original data acquisition process remains unchanged. Modify the data return process (evaluate_join_record function), and determine whether a row of data is returned according to the random value and the percentage of the acquired data 4. Generate a random number, and check each binary bit of the random number in turn. If the bit is 1 and the number of rows currently returned (return_rows)/the total number of rows currently obtained (total_rows) is less than the percentage, the row is returned, and add 1 to both return_rows and total_rows, otherwise skip the row total_rows and add 1 5. After processing each bit of the random number, repeat step 4 until all data is processed advantages: Easy to implement Only Server layer's code is modified, can be applied to all storage engines disadvantages: The original processing flow has not been modified, and the query efficiency has not been improved The current implementation does not support sampling by block, so the block keyword is ignored when specifying it need some advice, thanks!

Sergei Golubchik added a comment - 2020-12-18 19:17

Just to comment, this is Bernoulli sampling, so it's perfectly standard behavior and something want to have eventually, sooner or later.

Sergei Golubchik added a comment - 2020-12-18 19:17 Just to comment, this is Bernoulli sampling, so it's perfectly standard behavior and something want to have eventually, sooner or later.

Vicențiu Ciorbaru added a comment - 2020-12-23 16:54 - edited

woqutech Bernoulli is something we have implemented before for histogram collection. You can see the sampling code here:

sql/sql_statistics.cc

void Column_statistics_collected::finish(ha_rows rows, double sample_fraction)

However I am not sure if modifying evaluate_join_record is the right approach, because it stops the optimizer from making use of the sample clause altogether when optimizing. Additionally, the sample clause should be evaluated before any optimizations take place, such as key lookups, which with your approach would imply returning incorrect results, at least according to SAMPLE semantics described in SQL standard.

I suggest the following approach (which serg should comment upon too):

Extend parser with the right syntax. (This is probably OK as described above by woqutech)
Extend storage engine API with a new handler::ha_rnd_sample and handler::ha_rnd_init_sample method, similar to ha_rnd_next and ha_rnd_init. This will allow for using an optimized storage engine implementation if it exists.
Implement the default "Bernoulli" sampling as a fallback implementation within the handler class.
Disable indexes for the sampled table.
HASH joins should be disabled for the sampled tables too. (conclusion came up after a discussion with psergei)
The changes in the optimizer will involve extending JOIN_TAB::type enum to include JT_SAMPLE, but I can not immediately comment on all the required changes. This is because optimizer "overrides" JOIN_TAB::type with JT_ALL for all access types that do not involve JT_REF (or related) access. (notes from discussion with psergei). I will investigate this part further.

Vicențiu Ciorbaru added a comment - 2020-12-23 16:54 - edited woqutech Bernoulli is something we have implemented before for histogram collection. You can see the sampling code here: sql/sql_statistics.cc void Column_statistics_collected::finish(ha_rows rows, double sample_fraction) However I am not sure if modifying evaluate_join_record is the right approach, because it stops the optimizer from making use of the sample clause altogether when optimizing. Additionally, the sample clause should be evaluated before any optimizations take place, such as key lookups, which with your approach would imply returning incorrect results, at least according to SAMPLE semantics described in SQL standard. I suggest the following approach (which serg should comment upon too): Extend parser with the right syntax. (This is probably OK as described above by woqutech ) Extend storage engine API with a new handler::ha_rnd_sample and handler::ha_rnd_init_sample method, similar to ha_rnd_next and ha_rnd_init . This will allow for using an optimized storage engine implementation if it exists. Implement the default "Bernoulli" sampling as a fallback implementation within the handler class. Disable indexes for the sampled table. HASH joins should be disabled for the sampled tables too. (conclusion came up after a discussion with psergei ) The changes in the optimizer will involve extending JOIN_TAB::type enum to include JT_SAMPLE, but I can not immediately comment on all the required changes. This is because optimizer "overrides" JOIN_TAB::type with JT_ALL for all access types that do not involve JT_REF (or related) access. (notes from discussion with psergei ). I will investigate this part further.

People

Assignee:: Unassigned

Reporter:: woqutech

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2020-11-02 08:19

Updated:: 2025-03-16 22:39

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server