|
How is this used syntax used the application?
- In how many places?
- Does this have to be fast or is it ok to always do a table scan when doing sampling?
- BLOCK with Oracle only works with table scans, so it's probably not that fast either.
- Is BLOCK important ? The question is because not all storage engines works with blocks and it's hard to do a general implementation with BLOCK. We can of course have BLOCK as no-op for storage engines that does not work with blocks.
Why doesn't the following work good enough?
SELECT * from t1 where rand() < 0.1 LIMIT 1000?
This returns 10% of the rows up to 1000 rows.
It would not be hard to implement the BLOCK syntax but internally replace it with WHERE rand() < X for the tables we do table scans on.
|
- sample clause is not widely used
- sample requires faster than table scan
- Block is not important, you can ignore it now
- SELECT * from t1 where rand() <0.1 LIMIT 1000 is good enough if it is fast than table scan
- we should also support sample clause of consecutive rows or blocks like this: SAMPLE BLOCK ( 1 , 1 ) SEED ( 1 ) :https://jonathanlewis.wordpress.com/2010/03/12/sample-clause/. but we can also ignore consecutive rows or blocks syntax.
|
|
SQL Standard syntax, FYI:
<table factor> ::= <table primary> [ <sample clause> ]
|
<sample clause> ::= TABLESAMPLE <sample method> <left paren> <sample percentage> <right paren> [ <repeatable clause> ]
|
<sample method> ::= BERNOULLI | SYSTEM
|
<repeatable clause> ::= REPEATABLE <left paren> <repeat argument> <right paren>
|
<sample percentage> ::= <numeric value expression>
|
<repeat argument> ::= <numeric value expression>
|
|
|
I implemented a simple version of the SAMPLE syntax:
1. Add the SAMPLE/SEED keywords and opt_sample_clause definition in the sql_yacc.yy file
2. Add opt_sample_clause to table_primary_ident definition
3. The original data acquisition process remains unchanged. Modify the data return process (evaluate_join_record function), and determine whether a row of data is returned according to the random value and the percentage of the acquired data
4. Generate a random number, and check each binary bit of the random number in turn. If the bit is 1 and the number of rows currently returned (return_rows)/the total number of rows currently obtained (total_rows) is less than the percentage, the row is returned, and add 1 to both return_rows and total_rows, otherwise skip the row total_rows and add 1
5. After processing each bit of the random number, repeat step 4 until all data is processed
advantages:
- Easy to implement
- Only Server layer's code is modified, can be applied to all storage engines
disadvantages:
- The original processing flow has not been modified, and the query efficiency has not been improved
- The current implementation does not support sampling by block, so the block keyword is ignored when specifying it
need some advice, thanks!
|
|
Just to comment, this is Bernoulli sampling, so it's perfectly standard behavior and something want to have eventually, sooner or later.
|
|
woqutech Bernoulli is something we have implemented before for histogram collection. You can see the sampling code here:
sql/sql_statistics.cc
|
void Column_statistics_collected::finish(ha_rows rows, double sample_fraction)
|
However I am not sure if modifying evaluate_join_record is the right approach, because it stops the optimizer from making use of the sample clause altogether when optimizing. Additionally, the sample clause should be evaluated before any optimizations take place, such as key lookups, which with your approach would imply returning incorrect results, at least according to SAMPLE semantics described in SQL standard.
I suggest the following approach (which serg should comment upon too):
- Extend parser with the right syntax. (This is probably OK as described above by woqutech)
- Extend storage engine API with a new handler::ha_rnd_sample and handler::ha_rnd_init_sample method, similar to ha_rnd_next and ha_rnd_init. This will allow for using an optimized storage engine implementation if it exists.
- Implement the default "Bernoulli" sampling as a fallback implementation within the handler class.
- Disable indexes for the sampled table.
- HASH joins should be disabled for the sampled tables too. (conclusion came up after a discussion with psergei)
- The changes in the optimizer will involve extending JOIN_TAB::type enum to include JT_SAMPLE, but I can not immediately comment on all the required changes. This is because optimizer "overrides" JOIN_TAB::type with JT_ALL for all access types that do not involve JT_REF (or related) access. (notes from discussion with psergei). I will investigate this part further.
|