[MDEV-26090] where CONCAT(column) = 'xxx' is much faster than without concat condition Created: 2021-07-04 Updated: 2023-11-28 |
|
| Status: | Stalled |
| Project: | MariaDB Server |
| Component/s: | Optimizer |
| Affects Version/s: | 10.5.9, 10.2, 10.3, 10.4, 10.5 |
| Fix Version/s: | 10.11 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Allen Lee (Inactive) | Assignee: | Sergei Petrunia |
| Resolution: | Unresolved | Votes: | 2 |
| Labels: | optimizer | ||
| Attachments: |
|
| Description |
|
Customer shared the interesting test result and was wondering why where CONCAT(col) is faster without CONCAT. I've attached test schema and data.
With initial query and index, it was extremely slow. So, I proposed change index of `m` table to use primary key
Then, it got much faster compared to 1st one. However, customer noted that query got even faster with concat on where clause.
what would be the logical explanation on this dramatic query performance improvement? |
| Comments |
| Comment by Sergei Petrunia [ 2021-08-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Examining the dataset and the query: The table F has 100K rows. It has a restriction
which has a good index and it matches ~250 rows. The table M has 250K rows. It has a restriction:
which has a suitable index and it matches 21K rows. The join condition is a non-equality one:
which prevents use of ref access. There is an index on M.t_start. This looks suitable for construction of Range-Checked-for-each-record plans. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2021-08-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The first query plan:
Just takes the best ways to read the tables individually and then computes a cross-join between the result using a nested-loops join algorithm. Takes 27 sec for me. An interesting thing to observe: use of ref access prevents use of Join buffering (Enabling MRR and setting join_cache_level=6 will not cause BKA to be used). One can force join buffering to be used by changing ref into "range" . In order to do this, I replace M.station = 'LFBO' with (M.station='NO-STATION' or M.station = 'LFBO').
7.3 seconds. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2021-08-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The second query plan:
This basically achieves a similar thing to what I have achieved with adding "M.station='NO-STATION'": it enables the use of join buffering. Takes 6.2 seconds for me | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2021-08-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The 3rd query plan: replace M.station='LFBO' with CONCAT(M.station)='LFBO' and we get:
The idea is: the optimizer does NOT recognize concat(M.station)='LFBO' as a possible usable expression. As explain shows, the optimizer constructs RC-FER query plan.
The query took 82 milliseconds. The execution is basically:
and ANALYZE output shows the index "PRIMARY" was good for use every single time:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2021-08-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Can the optimizer be fixed to handle this? The problem is that in general case it is very hard to predict the performance of Range-Checked-For-Each-Record plan. Consider a basic case:
It is hard to tell whether "t2.key<t1.col" will be selective condition or not. One can observe that a BETWEEN condition in form like this query uses:
is a quite common special case. It can be proven selective if a [certain kind of?] histogram is available (like one we'll get when | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2021-08-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
allen.lee@mariadb.com , here are the takeaways: Please find detailed explanations about each query performance above. In particular, using CONCAT(column) makes the optimizer unable to use a certain query plan, after which it tries a "last resort" query plan of Range-Checked-For-Each-Record which turns out to be very good for this dataset and query. I see a way to fix this, but it will only be possible after MariaDB 10.7 and will be a new optimizer feature (NRE anyone?) At the moment, we can only suggest to use the CONCAT(column) or similar workarounds. |