[MDEV-23704] Optimizer does not use semijoin optimization for some WHERE (pk1, pk2, pk3) IN ((1,2,3), ...) queries Created: 2020-09-09 Updated: 2021-04-27 Resolved: 2020-10-07 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Optimizer |
| Affects Version/s: | 10.3.22 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Valerii Kravchuk | Assignee: | Sergei Petrunia |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | in-to-subquery | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Description |
|
For query like this:
with long enough list of tuples for multi-column PK values there are 3 possible plans presented by quotes from ANALYZE FORMAT=JSON outputs): 1. Range scan, can be forced in some cases with FORCE INDEX(PRIMARY):
2. For the same IN list in other environment we end up with full table scan that is very slow:
3. Finally in other environment with very similar data for the same IN list we end up with semijoin optimization applied:
This plan is the fastest even comparing to range join, but there is no way to force it. So, I think there is a bug when semijoin optimization is not used even when "range" is extended to full table scan as an alternative. I'd also like to have a way to force the best plan. |
| Comments |
| Comment by Sergei Petrunia [ 2020-09-18 ] | |||||||||||||||||||
|
As far as I understand, the rewrite to semi-join depends purely on # of elements in the IN-list? | |||||||||||||||||||
| Comment by Valerii Kravchuk [ 2020-09-18 ] | |||||||||||||||||||
|
As we have the same query getting different plans oin different environments, it seems something else also matters, not just number of items in the IN() list. | |||||||||||||||||||
| Comment by Sergei Petrunia [ 2020-09-18 ] | |||||||||||||||||||
|
Ok, debugging when In->Subquery conversion is done and when it is not. The initial candidates are selected in Item_func_in::to_be_transformed_into_in_subq The condition in that function is that the subselect must have
in order to be considered a candidate. Then, Item_func_in::in_predicate_to_in_subs_transformer does the conversion. | |||||||||||||||||||
| Comment by Sergei Petrunia [ 2020-09-18 ] | |||||||||||||||||||
|
... except when there is a collation mismatch.
| |||||||||||||||||||
| Comment by Sergei Petrunia [ 2020-10-01 ] | |||||||||||||||||||
It seems, the difference between those was that one of them was 10.3.18 and the other 10.3.24. The latter included the fix for
| |||||||||||||||||||
| Comment by Sergei Petrunia [ 2020-10-01 ] | |||||||||||||||||||
|
| |||||||||||||||||||
| Comment by Sergei Petrunia [ 2020-10-01 ] | |||||||||||||||||||
|
Ok, so conversion to IN-subquery didn't work. But why does the optimizer use full table scan? Consider this case:
All indexes are essential.
produces (on 10.3.24):
This happens, because the range optimizer reaches SEL_ARG::MAX_SEL_ARGS (=16000) limitation and aborts. A naive logic: 1000-elements IN-list should produce ~3K SEL_ARG objects for the PK and ~1K objects for the 3 other indexes, which gives in total 3+1=4K objects. | |||||||||||||||||||
| Comment by Sergei Petrunia [ 2020-10-01 ] | |||||||||||||||||||
|
Testcase demonstrating the above: a2.test | |||||||||||||||||||
| Comment by Sergei Petrunia [ 2020-10-07 ] | |||||||||||||||||||
|
Re-trying with the customer's table definition (which has multiple indexes). Number of SEL_ARGS alloced (note this is as counted in PARAM::alloced_sel_args, so it less than the actual number of objects allocated).
It's high but it's linear. | |||||||||||||||||||
| Comment by Sergei Petrunia [ 2020-10-07 ] | |||||||||||||||||||
|
This is an optimizer deficiency but I won't consider this as a bug. MariaDB's answer to very large IN-lists is IN-to-subquery conversion. That conversion is currently limited in datatype support (as described in the linked MDEVs). This can and will be fixed as part of those MDEVs. |