[MDEV-15777] Use inferred IS NOT NULL predicates in the range optimizer Created: 2018-04-04  Updated: 2020-08-25  Resolved: 2019-09-03

Status: Closed
Project: MariaDB Server
Component/s: Optimizer
Fix Version/s: 10.5.0

Type: Task Priority: Critical
Reporter: Sergei Petrunia Assignee: Igor Babaev
Resolution: Fixed Votes: 0
Labels: None

Attachments: File mdev15777-testcase.sql    
Issue Links:
Relates
relates to MDEV-19424 InnoDB's records_in_range estimates a... Open
relates to MDEV-19567 Print null-rejecting conditions in th... Open
relates to MDEV-16214 Incorrect plan taken by the optimizer... Closed
relates to MDEV-19518 Introduce an optimizer_switch flag fo... Open

 Description   

Consider an example (inspired by queries generated by Hibernate):

SELECT * 
FROM 
  t1
WHERE 
  t1.subset_id IN (SELECT t1_subsets.id
                   FROM t1_subsets)

The subquery is converted into an inner join (because t1_subsets.id is a primary key).
ANALYZE output:

+------+-------------+------------+--------+---------------+---------+---------+------------------+-------+----------+----------+------------+--------------------------+
| id   | select_type | table      | type   | possible_keys | key     | key_len | ref              | rows  | r_rows   | filtered | r_filtered | Extra                    |
+------+-------------+------------+--------+---------------+---------+---------+------------------+-------+----------+----------+------------+--------------------------+
|    1 | PRIMARY     | t1         | ALL    | t1_subset_id  | NULL    | NULL    | NULL             | 50396 | 49999.00 |   100.00 |       0.00 | Using where              |
|    1 | PRIMARY     | t1_subsets | eq_ref | PRIMARY       | PRIMARY | 4       | j22.t1.subset_id |     1 |     NULL |   100.00 |       NULL | Using where; Using index |
+------+-------------+------------+--------+---------------+---------+---------+------------------+-------+----------+----------+------------+--------------------------+

t1 has only a few rows which have a non-NULL value for t1.subset_id.
This is visible in r_filtered column: Early NULLs filtering optimization adds a NOT NULL condition which filters out rows with NULLs.

A piece of ANALYZE FORMAT=JSON output to confirm this:

    "table": {
      "table_name": "t1",
      "access_type": "ALL",
      "possible_keys": ["t1_subset_id"],
      "r_loops": 1,
      "rows": 50396,
      "r_rows": 49999,
      "r_total_time_ms": 290.22,
      "filtered": 100,
      "r_filtered": 0,
      "attached_condition": "t1.subset_id is not null"
    },

However, range optimizer is not able to make use of "Early NULLs filtering". We could have used t1_subset_id index to construct range access, but we dont



 Comments   
Comment by Sergei Petrunia [ 2018-04-04 ]

Basic ideas how to solve this:

In range optimizer:

for each key part $kp
{
  find in the keyuse array elements that describe "$table.$kp= expr"
  {
    if (the keyuse element describes a NULL-rejecting equality)
    {
      $tree= construct a SEL_TREE representing "$kp IS NOT NULL".
      sel_tree = tree_and(sel_tree, $tree);
    }
  }
}

Other considerations:
We should somehow get table->const_keys to include the index (Just merge table->keys bitmap into table->const_keys?)

Comment by Julien Fritsch [ 2018-04-06 ]

Is it a bug or new feature request ?

Comment by Varun Gupta (Inactive) [ 2018-04-10 ]

julien.fritsch It is a feature request.

Comment by Varun Gupta (Inactive) [ 2018-04-18 ]

The EARLY FILTERING in range access is a bit different from ref/eq_ref access

For ref access
If we have a condition as tbl1.key.keypart= tbl2.other_field, then if the tbl2.other_field is
NULLABLE then we add the condition that tbl2.other_field IS NOT NULL.

*For range access *
If we have a condition as tbl1.key.keypart= tbl2.other_field then we check if tbl1.key.keypart is
NULLABLE and then add the condition that tbl1.key.keypart IS NOT NULL

So for ref access we check the right hand side of the equality while for the range access
we check the left hand side of the equality

Comment by Varun Gupta (Inactive) [ 2018-04-20 ]

First attempt to the patch here
http://lists.askmonty.org/pipermail/commits/2018-April/012422.html

Comment by Varun Gupta (Inactive) [ 2018-05-15 ]

The latest patch is in the 10.3-mdev15777 branch

Comment by Varun Gupta (Inactive) [ 2018-05-15 ]

Here is the new patch
http://lists.askmonty.org/pipermail/commits/2018-May/012545.html

Comment by Sergei Petrunia [ 2019-05-06 ]

Review feedback sent: https://lists.launchpad.net/maria-developers/msg11825.html

Comment by Varun Gupta (Inactive) [ 2019-05-08 ]

A problem that psergey caught while doing the review is adding duplicate NOT NULL conditions for a particular column

The query is:

select * from t1 ,t2  where t1.a= t2.a;

the table structure is:

create table t1 ( a int , b int , c int, key key1(a), key key2(a,b));

so we have 2 keys that have columns a as a keypart, so both try to add a NOT NULL cond and that is why we end up with this duplication.

(lldb) p (char*)dbug_print_item(cond)
(char *) $1 = 0x00000001046131a0 "t1.a is not null and t1.a is not null"

Comment by Sergei Petrunia [ 2019-05-09 ]

The current patch adds NOT NULL predicates unconditionally. This might have un-intended consequences. Check out MDEV-19424: InnoDB is known for returning 50% estimates for ranges that span >50% of the table. So,

  • this patch adds NOT NULL conditions to use by the range optimizer.
  • innodb returns 50% selectivity for them
  • the optimizer starts taking it into account.

might have a negative effect.

One way to work around this (was discussed on some optimizer call and discarded) : only inject NOT NULL conditions for which EITS data says that they have selectivity less than 50%. (This is much easier to do than to fix MDEV-19424).

Comment by Sergei Petrunia [ 2019-05-09 ]

Takeaway from yesterday discussion with Varun:

  • There is no convenient way to show this optimization in EXPLAIN or ANALYZE output.
  • It should be printed into Optimizer Trace in 10.4
    • And into debug trace for versions before the 10.4
Comment by Varun Gupta (Inactive) [ 2019-05-14 ]

Explaining the problem with estimates which I saw while running the main suite

Basic information

CREATE INDEX groups_dt ON t3(domain, type);
 
MariaDB [test]> select count(*) from t3  where domain='queue';
+----------+
| count(*) |
+----------+
|       12 |
+----------+
1 row in set (0.003 sec)

Lets break into 2 case

Case 1: Without NULL rejecting conditions added for range analysis

MariaDB [test]> explain
    -> SELECT STRAIGHT_JOIN g.id FROM t2 a, t3 g USE INDEX(groups_dt)
    -> WHERE g.domain = 'queue' AND g.type = a.type;
+------+-------------+-------+------+---------------+-----------+---------+-------------------+------+-----------------------+
| id   | select_type | table | type | possible_keys | key       | key_len | ref               | rows | Extra                 |
+------+-------------+-------+------+---------------+-----------+---------+-------------------+------+-----------------------+
|    1 | SIMPLE      | a     | ALL  | NULL          | NULL      | NULL    | NULL              |    2 | Using where           |
|    1 | SIMPLE      | g     | ref  | groups_dt     | groups_dt | 70      | const,test.a.type |  335 | Using index condition |
+------+-------------+-------+------+---------------+-----------+---------+-------------------+------+-----------------------+

Range created

quick range select, key groups_dt, length: 35
queue <= X <= queue

Case 2:

MariaDB [test]> explain
    -> SELECT STRAIGHT_JOIN g.id FROM t2 a, t3 g USE INDEX(groups_dt)
    -> WHERE g.domain = 'queue' AND g.type = a.type;
+------+-------------+-------+------+---------------+-----------+---------+-------------------+------+-----------------------+
| id   | select_type | table | type | possible_keys | key       | key_len | ref               | rows | Extra                 |
+------+-------------+-------+------+---------------+-----------+---------+-------------------+------+-----------------------+
|    1 | SIMPLE      | a     | ALL  | NULL          | NULL      | NULL    | NULL              | 2    | Using where           |
|    1 | SIMPLE      | g     | ref  | groups_dt     | groups_dt | 70      | const,test.a.type | 13   | Using index condition |
+------+-------------+-------+------+---------------+-----------+---------+-------------------+------+-----------------------+

Range created

quick range select, key groups_dt, length: 70
queue/NULL < X <= queue

Comment by Varun Gupta (Inactive) [ 2019-05-14 ]

The main concern here is that we see the estimates for rows for table 'g' is 13 in case (1)
and 335 in case (2)

The reason for this is that with NULL rejecting conditions the range is extended for key(groups_dt) and now
includes two keyparts as shown above in the ranges

There is a code which tries to use Range estimates for Ref access, so the idea is if we have a range that is a prefix of the ref access and the estimates for range access is lower than that of ref access, then use
the estimates of range access.

So for cases like above we have the where clause as

 g.domain = 'queue' AND g.type = a.type;

Before the null-rejecting part the range that was created was only for 1 keypart(domain) and
then we were able to reuse the range estimates for ref access estimates

With the null rejecting conditions added the range no more remains the prefix of the ref access and
we cannot reuse these range estimates

Comment by Varun Gupta (Inactive) [ 2019-05-15 ]

On the optimizer call it was decided to solve the problem above (that is reuse range estimates when range is a prefix of ref access and the estimates for range is less than the estimates for ref access) by creating NULL rejecting conditions only for the first key part of the index.

Also it was decided to introduce an optimizer_swith_flag for this optimization.

Comment by Sergei Petrunia [ 2019-05-22 ]

Provided review input for the latest patch: https://lists.launchpad.net/maria-developers/msg11841.html

Comment by Igor Babaev [ 2019-05-23 ]

The task can be formulated in a very simple way:
1. for each field f participating in the WHERE condition C and in any index infer the predicate IS NOT NULL f.
2. Use the inferred IS NOT NULL predicates to look for range scans and use them in the optimizer when searching for the optimizer plan.

(1) can be resolved by finding all null rejecting fields in the WHERE condition (this is similar to how we find null rejecting tables) and choosing only those fields that participate in indexes.
(2) is already resolved in the patch.

'Early' null filtering is applied to the keys used for equi-joins, while in this task we need keys used for range access.

Comment by Varun Gupta (Inactive) [ 2019-07-03 ]

The patch
http://lists.askmonty.org/pipermail/commits/2019-June/013872.html

Comment by Igor Babaev [ 2019-09-03 ]

This task was pushed into 10.5.

Comment by Varun Gupta (Inactive) [ 2019-09-12 ]

Here is the commit message for the patch pushed to 10.5

commit 9380850d874c77656d0c42cfa11bf0d187064849
Author: Igor Babaev <igor@askmonty.org>
Date:   Fri Aug 30 18:47:14 2019 -0700
 
    MDEV-15777 Use inferred IS NOT NULL predicates in the range optimizer
    
    This patch introduces the optimization that allows range optimizer to
    consider index range scans that are built employing NOT NULL predicates
    inferred from WHERE conditions and ON expressions.
    The patch adds a new optimizer switch not_null_range_scan.

th patch adds a new optimizer_swtich not_null_range_scan, this is currently turned off.

Generated at Thu Feb 08 08:23:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.