Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-19424

InnoDB's records_in_range estimates are capped at about 50%

Details

    Description

      InnoDB's records_in_range estimates seem to be capped at ~50% of the table. (We used to observe this on various occasions before but I haven't been able to find an MDEV for this).

      If I pass a range that contains a bigger fraction of the table, the estimated number of rows is still around 50% of the total rows in the table.

      Test dataset:

      create table ten(a int);
      insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
       
      create table one_k(a int);
      insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;
       
      create table t1 (
        a int,
        b varchar(32),
        c int,
        key(a),
        key(b)
      ) engine=innodb;
      

      # 100K NULLs
      insert into t1 select
        null,
        null,
        A.a + 1000*B.a 
      from
        one_k A,
        one_k B
      where 
        B.a<100;
       
      # 900 K non-NULLs
      insert into t1 select
        A.a + 1000*B.a,
        A.a + 1000*B.a,
        A.a + 1000*B.a 
      from
        one_k A,
        one_k B
      where 
        B.a>=100;
      

      Now, both a IS NOT NULL or b IS NOT NULL match 900K rows (90% of the table).
      But EXPLAIN will show the estimates of about 500K rows, which is 50% of the table:

      explain select * from t1 force index (a) where a is not null ;
      +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
      | id   | select_type | table | type  | possible_keys | key  | key_len | ref  | rows   | Extra                 |
      +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
      |    1 | SIMPLE      | t1    | range | a             | a    | 5       | NULL | 494308 | Using index condition |
      +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
      

      explain select * from t1 force index (b) where b is not null ;
      +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
      | id   | select_type | table | type  | possible_keys | key  | key_len | ref  | rows   | Extra                 |
      +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
      |    1 | SIMPLE      | t1    | range | b             | b    | 35      | NULL | 494308 | Using index condition |
      +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
      

      When the optimizer was simple, this property was not a problem.

      A simple optimizer would only use range estimates to construct range access. Range access is cheaper than full table if it covers about 30% of the table. Returning 50% of the table instead of 90% was not an issue.

      A more advanced optimizer also attempts to use range estimates for condition selectivity, etc. Here, returning 50% selectivity instead of 90% is a problem. (One must take into account that selectivity is computed for multiple indexes. For example, for 5 indexes 0.5^2= 1/32 . 32x under-estimation of selectivity)

      Attachments

        Issue Links

          Activity

            psergei Sergei Petrunia created issue -
            psergei Sergei Petrunia made changes -
            Field Original Value New Value
            Description InnoDB's {{records_in_range}} estimates seem to be capped at ~50% of the table. (We used to observe this on various occasions before but I haven't been able to find an MDEV for this).

            If I pass a range that contains a bigger fraction of the table, the estimated number of rows is still around 50% of the total rows in the table.

            Test dataset:
            {code:sql}
            create table ten(a int);
            insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

            create table one_k(a int);
            insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

            create table t1 (
              a int,
              b varchar(32),
              c int,
              key(a),
              key(b)
            ) engine=innodb;
            {code}

            {code:sql}
            # 100K NULLs
            insert into t1 select
              null,
              null,
              A.a + 1000*B.a
            from
              one_k A,
              one_k B
            where
              B.a<100;

            # 900 K non-NULLs
            insert into t1 select
              A.a + 1000*B.a,
              A.a + 1000*B.a,
              A.a + 1000*B.a
            from
              one_k A,
              one_k B
            where
              B.a>=100;
            {code}

            {code}
            explain select * from t1 force index (a) where a is not null ;
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | 1 | SIMPLE | t1 | range | a | a | 5 | NULL | 494308 | Using index condition |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            {code}

            {code}
            explain select * from t1 force index (b) where b is not null ;
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | 1 | SIMPLE | t1 | range | b | b | 35 | NULL | 494308 | Using index condition |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            {code}
            InnoDB's {{records_in_range}} estimates seem to be capped at ~50% of the table. (We used to observe this on various occasions before but I haven't been able to find an MDEV for this).

            If I pass a range that contains a bigger fraction of the table, the estimated number of rows is still around 50% of the total rows in the table.

            Test dataset:
            {code:sql}
            create table ten(a int);
            insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

            create table one_k(a int);
            insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

            create table t1 (
              a int,
              b varchar(32),
              c int,
              key(a),
              key(b)
            ) engine=innodb;
            {code}

            {code:sql}
            # 100K NULLs
            insert into t1 select
              null,
              null,
              A.a + 1000*B.a
            from
              one_k A,
              one_k B
            where
              B.a<100;

            # 900 K non-NULLs
            insert into t1 select
              A.a + 1000*B.a,
              A.a + 1000*B.a,
              A.a + 1000*B.a
            from
              one_k A,
              one_k B
            where
              B.a>=100;
            {code}

            Now, either {{a IS NOT NULL}} or {{b IS NOT NULL}} should match about 50% of the rows in the table.
            But EXPLAIN will show the estimates of about 50% of the table:

            {code}
            explain select * from t1 force index (a) where a is not null ;
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | 1 | SIMPLE | t1 | range | a | a | 5 | NULL | 494308 | Using index condition |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            {code}

            {code}
            explain select * from t1 force index (b) where b is not null ;
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | 1 | SIMPLE | t1 | range | b | b | 35 | NULL | 494308 | Using index condition |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            {code}
            psergei Sergei Petrunia made changes -
            Description InnoDB's {{records_in_range}} estimates seem to be capped at ~50% of the table. (We used to observe this on various occasions before but I haven't been able to find an MDEV for this).

            If I pass a range that contains a bigger fraction of the table, the estimated number of rows is still around 50% of the total rows in the table.

            Test dataset:
            {code:sql}
            create table ten(a int);
            insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

            create table one_k(a int);
            insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

            create table t1 (
              a int,
              b varchar(32),
              c int,
              key(a),
              key(b)
            ) engine=innodb;
            {code}

            {code:sql}
            # 100K NULLs
            insert into t1 select
              null,
              null,
              A.a + 1000*B.a
            from
              one_k A,
              one_k B
            where
              B.a<100;

            # 900 K non-NULLs
            insert into t1 select
              A.a + 1000*B.a,
              A.a + 1000*B.a,
              A.a + 1000*B.a
            from
              one_k A,
              one_k B
            where
              B.a>=100;
            {code}

            Now, either {{a IS NOT NULL}} or {{b IS NOT NULL}} should match about 50% of the rows in the table.
            But EXPLAIN will show the estimates of about 50% of the table:

            {code}
            explain select * from t1 force index (a) where a is not null ;
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | 1 | SIMPLE | t1 | range | a | a | 5 | NULL | 494308 | Using index condition |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            {code}

            {code}
            explain select * from t1 force index (b) where b is not null ;
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | 1 | SIMPLE | t1 | range | b | b | 35 | NULL | 494308 | Using index condition |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            {code}
            InnoDB's {{records_in_range}} estimates seem to be capped at ~50% of the table. (We used to observe this on various occasions before but I haven't been able to find an MDEV for this).

            If I pass a range that contains a bigger fraction of the table, the estimated number of rows is still around 50% of the total rows in the table.

            Test dataset:
            {code:sql}
            create table ten(a int);
            insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

            create table one_k(a int);
            insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

            create table t1 (
              a int,
              b varchar(32),
              c int,
              key(a),
              key(b)
            ) engine=innodb;
            {code}

            {code:sql}
            # 100K NULLs
            insert into t1 select
              null,
              null,
              A.a + 1000*B.a
            from
              one_k A,
              one_k B
            where
              B.a<100;

            # 900 K non-NULLs
            insert into t1 select
              A.a + 1000*B.a,
              A.a + 1000*B.a,
              A.a + 1000*B.a
            from
              one_k A,
              one_k B
            where
              B.a>=100;
            {code}

            Now, either {{a IS NOT NULL}} or {{b IS NOT NULL}} should match about 50% of the rows in the table.
            But EXPLAIN will show the estimates of about 50% of the table:

            {code}
            explain select * from t1 force index (a) where a is not null ;
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | 1 | SIMPLE | t1 | range | a | a | 5 | NULL | 494308 | Using index condition |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            {code}

            {code}
            explain select * from t1 force index (b) where b is not null ;
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | 1 | SIMPLE | t1 | range | b | b | 35 | NULL | 494308 | Using index condition |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            {code}

            When the optimizer was simple, this property was not a problem.

            A MySQL-5.0 era optimizer
            psergei Sergei Petrunia made changes -
            Description InnoDB's {{records_in_range}} estimates seem to be capped at ~50% of the table. (We used to observe this on various occasions before but I haven't been able to find an MDEV for this).

            If I pass a range that contains a bigger fraction of the table, the estimated number of rows is still around 50% of the total rows in the table.

            Test dataset:
            {code:sql}
            create table ten(a int);
            insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

            create table one_k(a int);
            insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

            create table t1 (
              a int,
              b varchar(32),
              c int,
              key(a),
              key(b)
            ) engine=innodb;
            {code}

            {code:sql}
            # 100K NULLs
            insert into t1 select
              null,
              null,
              A.a + 1000*B.a
            from
              one_k A,
              one_k B
            where
              B.a<100;

            # 900 K non-NULLs
            insert into t1 select
              A.a + 1000*B.a,
              A.a + 1000*B.a,
              A.a + 1000*B.a
            from
              one_k A,
              one_k B
            where
              B.a>=100;
            {code}

            Now, either {{a IS NOT NULL}} or {{b IS NOT NULL}} should match about 50% of the rows in the table.
            But EXPLAIN will show the estimates of about 50% of the table:

            {code}
            explain select * from t1 force index (a) where a is not null ;
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | 1 | SIMPLE | t1 | range | a | a | 5 | NULL | 494308 | Using index condition |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            {code}

            {code}
            explain select * from t1 force index (b) where b is not null ;
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | 1 | SIMPLE | t1 | range | b | b | 35 | NULL | 494308 | Using index condition |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            {code}

            When the optimizer was simple, this property was not a problem.

            A MySQL-5.0 era optimizer
            InnoDB's {{records_in_range}} estimates seem to be capped at ~50% of the table. (We used to observe this on various occasions before but I haven't been able to find an MDEV for this).

            If I pass a range that contains a bigger fraction of the table, the estimated number of rows is still around 50% of the total rows in the table.

            Test dataset:
            {code:sql}
            create table ten(a int);
            insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

            create table one_k(a int);
            insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

            create table t1 (
              a int,
              b varchar(32),
              c int,
              key(a),
              key(b)
            ) engine=innodb;
            {code}

            {code:sql}
            # 100K NULLs
            insert into t1 select
              null,
              null,
              A.a + 1000*B.a
            from
              one_k A,
              one_k B
            where
              B.a<100;

            # 900 K non-NULLs
            insert into t1 select
              A.a + 1000*B.a,
              A.a + 1000*B.a,
              A.a + 1000*B.a
            from
              one_k A,
              one_k B
            where
              B.a>=100;
            {code}

            Now, both {{a IS NOT NULL}} or {{b IS NOT NULL}} match 900K rows (90% of the table).
            But EXPLAIN will show the estimates of about 500K rows, which is 50% of the table:

            {code}
            explain select * from t1 force index (a) where a is not null ;
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | 1 | SIMPLE | t1 | range | a | a | 5 | NULL | 494308 | Using index condition |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            {code}

            {code}
            explain select * from t1 force index (b) where b is not null ;
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            | 1 | SIMPLE | t1 | range | b | b | 35 | NULL | 494308 | Using index condition |
            +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
            {code}

            When the optimizer was simple, this property was not a problem.

            A simple optimizer would only use range estimates to construct range access. Range access is cheaper than full table if it covers about 30% of the table. Returning 50% of the table instead of 90% was not an issue.

            A more advanced optimizer also attempts to use range estimates for condition selectivity, etc. Here, returning 50% selectivity instead of 90% is a problem. (One must take into account that selectivity is computed for multiple indexes. For example, for 5 indexes 0.5^2= 1/32 . 32x under-estimation of selectivity)
            psergei Sergei Petrunia made changes -
            sachin.setiya.007 Sachin Setiya (Inactive) made changes -
            Assignee Sachin Setiya [ sachin.setiya.007 ]
            elenst Elena Stepanova made changes -
            Fix Version/s 10.1 [ 16100 ]
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Elkin Andrei Elkin made changes -
            Elkin Andrei Elkin made changes -
            psergei Sergei Petrunia made changes -
            psergei Sergei Petrunia made changes -
            Labels records_in_range
            psergei Sergei Petrunia made changes -
            marko Marko Mäkelä made changes -
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.1 [ 16100 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 96631 ] MariaDB v4 [ 141284 ]
            marko Marko Mäkelä made changes -
            Assignee Sachin Setiya [ sachin.setiya.007 ] Sergei Petrunia [ psergey ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.2 [ 14601 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.3 [ 22126 ]
            psergei Sergei Petrunia made changes -
            Fix Version/s 11.0.1 [ 28548 ]
            Fix Version/s 10.4(EOL) [ 22408 ]
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Closed [ 6 ]

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.