Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4362

Selectivity estimates for IN (...) do not depend on whether the values are in range

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • None
    • 10.0.10
    • None

    Description

      Create the dataset:

      create table t5 (col1 int);
      set @a=-1;
      create table one_k (a int)  select (@a:=@a+1) as a from information_schema.session_variables A, information_schema.session_variables B limit 1000;
      insert into t5 select A.a from one_k A, one_k B where A.a < 100 and B.a < 100;
      set histogram_size=100;
      analyze table t5 persistent for all;
      select *, hex(histogram) from mysql.column_stats where table_name='t5'\G
      *************************** 1. row ***************************
             db_name: j10
          table_name: t5
         column_name: col1
           min_value: 0
           max_value: 99
         nulls_ratio: 0.0000
          avg_length: 4.0000
       avg_frequency: 100.0000
           hist_size: 100
           hist_type: 
           histogram:  (100 bytes here)

      Ok, so we've got a table with 100 rows of 0, 100 rows of 1, and so forth up to 99.

      Let's see how estimating works:

      MariaDB [j10]> explain extended select * from t5 where col1 in (1,2,3);
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      |    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     3.79 | Using where |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      1 row in set, 1 warning (10.64 sec)

      The real selectivity is 3%, we've got 3.79. Good.

      Now, let's try values that are certainly not in the table:

      MariaDB [j10]> explain extended select * from t5 where col1 in (-1,-2,-3);
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      |    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     3.79 | Using where |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

      OOps, again 3.79%.

      Let's see what non-equality range shows:

      MariaDB [j10]> explain extended select * from t5 where col1<=-1;
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      |    1 | SIMPLE      | t5    | ALL  | NULL          | NULL | NULL    | NULL | 10000 |     0.99 | Using where |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

      1%. It's better.

      I consider selecitivity obtained for "where col1 in (-1,-2,-3)" to be a bug.

      Attachments

        Issue Links

          Activity

            That is, the issue from comment #2 where filtered=100% suddenly, is gone.

            psergei Sergei Petrunia added a comment - That is, the issue from comment #2 where filtered=100% suddenly, is gone.

            pomyk, I was leaning towards that idea also.. but then I questioned the selectivity formula, tried to come up with my own formula, and, surprise, it didn't have anything resembling the "width" that the old formula used.

            psergei Sergei Petrunia added a comment - pomyk , I was leaning towards that idea also.. but then I questioned the selectivity formula, tried to come up with my own formula, and, surprise, it didn't have anything resembling the "width" that the old formula used.

            So, we still have col1 =

            {value_less_than_tables_minimum}

            return the same estimate as col1=

            { value_within_range_of_table_values}

            . (The data distribution is uniform).

            This might be not as bad as it sounds, because minimum and maximum are statistics. The actual table can have values which are greater than the maximum value we've had when collecting statistics (or, correspondingly, less than the minimum... ).

            I have also discovered a problem with very skewed distributions, MDEV-5950. My opinion is that estimates that are produced for this MDEV are satisfactory, while estimates that are produced for MDEV-5950 are not. However, when we fix MDEV-5950, we might fix this one also.

            psergei Sergei Petrunia added a comment - So, we still have col1 = {value_less_than_tables_minimum} return the same estimate as col1= { value_within_range_of_table_values} . (The data distribution is uniform). This might be not as bad as it sounds, because minimum and maximum are statistics . The actual table can have values which are greater than the maximum value we've had when collecting statistics (or, correspondingly, less than the minimum... ). I have also discovered a problem with very skewed distributions, MDEV-5950 . My opinion is that estimates that are produced for this MDEV are satisfactory, while estimates that are produced for MDEV-5950 are not. However, when we fix MDEV-5950 , we might fix this one also.

            With Fix for MDEV-5926, "Attempt#2", the division-by-zero problem started to appear again (it fails an assert that I've added specifically to catch this).

            Developed a fix for it. getting value_pos=0 (or 1) and the first (or the last) bucket of histogram having zero width is a special case.

            psergei Sergei Petrunia added a comment - With Fix for MDEV-5926 , "Attempt#2", the division-by-zero problem started to appear again (it fails an assert that I've added specifically to catch this). Developed a fix for it. getting value_pos=0 (or 1) and the first (or the last) bucket of histogram having zero width is a special case.

            Pushed the second variant of the fix for "division by zero" problem. Pushed a testcase.

            psergei Sergei Petrunia added a comment - Pushed the second variant of the fix for "division by zero" problem. Pushed a testcase.

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.