Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31067

selectivity_from_histogram >1.0 for col=const on DOUBLE_PREC_HB histogram

Details

    Description

      This is a public part of TODO-3823.

      For certain data distributions, histogram code can produce selectivity>1.0 for a column=const equality on a DOUBLE_PREC_HB histogram (I think one can achieve the same on SINGLE_PREC_HB, the point is that it's not JSON_HB histogram).

      create table t10 (a int);
      -- Fill the table, will provide INSERTs below
      analyze table t10 persistent for all;
      

      set optimizer_trace=1;
      explain select * from t10  where a in (91303);
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      | id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      |    1 | SIMPLE      | t10   | ALL  | NULL          | NULL | NULL    | NULL | 99840 |   100.00 | Using where |
      +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
      

      select json_detailed(json_extract(trace, '$**.selectivity_for_columns')) from information_schema.optimizer_trace\G
      *************************** 1. row ***************************
      json_detailed(json_extract(trace, '$**.selectivity_for_columns')): [
          [
              {
                  "column_name": "a",
                  "ranges": 
                  ["91303 <= a <= 91303"],
                  "selectivity_from_histogram": 18.28543534
              }
          ]
      ]
      

      INSERTs (synthetic data):

      create table bucket(a int);  # This holds how many rows we hold in a bucket.
      insert into bucket select 1 from seq_1_to_780;
       
      create table one_third_of_bucket(a int);  # one-third of a bucket
      insert into one_third_of_bucket select 1 from seq_1_to_260;
       
      create table t10 (a int);
      insert into t10 select 0 from bucket, seq_1_to_4;
       
      insert into t10 select 8693 from one_third_of_bucket;
      insert into t10 select 8694 from one_third_of_bucket;
      insert into t10 select 8695 from one_third_of_bucket;
       
       
      insert into t10 select 34783 from one_third_of_bucket;
      insert into t10 select 34784 from one_third_of_bucket;
      insert into t10 select 34785 from one_third_of_bucket;
       
       
      insert into t10 select 34785 from bucket, seq_1_to_8;
       
      insert into t10 select 65214 from one_third_of_bucket;
      insert into t10 select 65215 from one_third_of_bucket;
      insert into t10 select 65216 from one_third_of_bucket;
       
      insert into t10 select 65216 from bucket, seq_1_to_52;
       
      insert into t10 select 65217 from one_third_of_bucket;
      insert into t10 select 65218 from one_third_of_bucket;
      insert into t10 select 65219 from one_third_of_bucket;
       
      insert into t10 select 65219 from bucket;
       
       
      insert into t10 select 73913 from one_third_of_bucket;
      insert into t10 select 73914 from one_third_of_bucket;
      insert into t10 select 73915 from one_third_of_bucket;
       
      insert into t10 select 73915 from bucket, seq_1_to_40;
       
       
      insert into t10 select 78257 from one_third_of_bucket;
      insert into t10 select 78258 from one_third_of_bucket;
      insert into t10 select 78259 from one_third_of_bucket;
       
      insert into t10 select 91300 from one_third_of_bucket;
      insert into t10 select 91301 from one_third_of_bucket;
      insert into t10 select 91302 from one_third_of_bucket;
       
      insert into t10 select 91302 from bucket, seq_1_to_6;
       
      insert into t10 select 91303 from one_third_of_bucket;
      insert into t10 select 91304 from one_third_of_bucket;
      insert into t10 select 91305 from one_third_of_bucket;
       
      insert into t10 select 91305 from bucket, seq_1_to_8;
       
      insert into t10 select  99998 from one_third_of_bucket;
      insert into t10 select  99999 from one_third_of_bucket;
      insert into t10 select 100000 from one_third_of_bucket;
      

      Attachments

        Issue Links

          Activity

            psergei Sergei,
            1. This is a bug of 10.0 code. Thus the fix should be applied starting from 10.4.
            2. Let's call a value frequently used if its position in histogram coincides with the beginning of two adjacent buckets. Whether the probed value is frequently used can be easily checked. The number of buckets with the same starting point coinciding with the position of the value provides us with a good low bound for the selectivity of such value. For not frequently used values we should calculate the average frequency for them only and it will be less than the average frequency for all values.
            Here's an example:
            100 rows with 7 values and 5 buckets with 20 rows in each bucket, one value is encountered in 40 rows and this is the only frequently used value, all remaining values are encountered in 10 rows. The average frequency is 100/7 ~ 14.3, so the average selectivity is 14.3% that is greater than 10% expected for non-frequently used values.

            igor Igor Babaev (Inactive) added a comment - psergei Sergei, 1. This is a bug of 10.0 code. Thus the fix should be applied starting from 10.4. 2. Let's call a value frequently used if its position in histogram coincides with the beginning of two adjacent buckets. Whether the probed value is frequently used can be easily checked. The number of buckets with the same starting point coinciding with the position of the value provides us with a good low bound for the selectivity of such value. For not frequently used values we should calculate the average frequency for them only and it will be less than the average frequency for all values. Here's an example: 100 rows with 7 values and 5 buckets with 20 rows in each bucket, one value is encountered in 40 rows and this is the only frequently used value, all remaining values are encountered in 10 rows. The average frequency is 100/7 ~ 14.3, so the average selectivity is 14.3% that is greater than 10% expected for non-frequently used values.
            psergei Sergei Petrunia added a comment - https://github.com/MariaDB/server/commits/bb-10.6-mdev31067-variant3 . igor please review.

            In my opinion the function Histogram::point_selectivity() the has patch touched still requires some improvements.

            igor Igor Babaev (Inactive) added a comment - In my opinion the function Histogram::point_selectivity() the has patch touched still requires some improvements.

            igor,

            Yet I really mind the code that calculates selectivity for popular values: it looks too complicated and your comment does not help to understand it.

            How is this part of this MDEV? Nor the testcase for this MDEV, neither any of the fixes touch that part of the code.

            psergei Sergei Petrunia added a comment - igor , Yet I really mind the code that calculates selectivity for popular values: it looks too complicated and your comment does not help to understand it. How is this part of this MDEV? Nor the testcase for this MDEV, neither any of the fixes touch that part of the code.

            ok to push variant 2. commit https://github.com/MariaDB/server/commit/a87c17ab566def817da2f79fb7badb7318a1e311

            It is a conservative upper-bound estimate, and is delivers lower estimates than a complex magic of popular values in the variant 3. Thus I see no reason to use the complex approach if the simple and safe approach produces better estimates.

            serg Sergei Golubchik added a comment - ok to push variant 2. commit https://github.com/MariaDB/server/commit/a87c17ab566def817da2f79fb7badb7318a1e311 It is a conservative upper-bound estimate, and is delivers lower estimates than a complex magic of popular values in the variant 3. Thus I see no reason to use the complex approach if the simple and safe approach produces better estimates.

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.