Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26886

Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#3)

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • N/A
    • 10.8.0
    • Optimizer
    • None

    Description

      --source include/have_sequence.inc
       
      # One third are zeros, the rest are unique values
      create or replace table t (a tinyint) as select if(seq%3,seq,0) as a from seq_1_to_100;
      select count(*) from t where a <= 0;
       
      set histogram_type = JSON_HB;
      analyze table t persistent for all;
      explain format=json select * from t where a <= 0;
       
      set histogram_type = DOUBLE_PREC_HB;
      analyze table t persistent for all;
      explain format=json select * from t where a <= 0;
       
      # Cleanup
      drop table t;
      

      preview-10.7-MDEV-26519-json-histograms 508f5f3f

      select count(*) from t where a <= 0;
      count(*)
      33
      

      JSON histogram

      {
        "query_block": {
          "select_id": 1,
          "table": {
            "table_name": "t",
            "access_type": "ALL",
            "rows": 100,
            "filtered": 1.470600009,
            "attached_condition": "t.a <= 0"
          }
        }
      }
      

      DOUBLE_PREC histogram

      {
        "query_block": {
          "select_id": 1,
          "table": {
            "table_name": "t",
            "access_type": "ALL",
            "rows": 100,
            "filtered": 32.8125,
            "attached_condition": "t.a <= 0"
          }
        }
      }
      

      So, the actual result set contains 33 rows out of 100, DOUBLE_PREC gives filtered=32.8 which is very close, JSON histogram gives filtered=1.47 which is far off.

      Reproducible with InnoDB and MyISAM alike.

      Attachments

        Issue Links

          Activity

            Raised to critical on procedural reasons. Feel free to demote and remove from "must-do" scope.

            elenst Elena Stepanova added a comment - Raised to critical on procedural reasons. Feel free to demote and remove from "must-do" scope.

            Histogram_size=68.

              "histogram_hb_v2": [
                {
                  "start": "0",
                  "size": 0.33,
                  "ndv": 1
                },
                {
                  "start": "1",
                  "size": 0.01,
                  "ndv": 1
                },
                {
                  "start": "2",
                  "size": 0.01,
                  "ndv": 1
                },
                ...
            

            The issue is that a condition

              ... WHERE  a <= 0
            

            produces a very small estimate that doesn't include the first bucket.

            psergei Sergei Petrunia added a comment - Histogram_size=68. "histogram_hb_v2": [ { "start": "0", "size": 0.33, "ndv": 1 }, { "start": "1", "size": 0.01, "ndv": 1 }, { "start": "2", "size": 0.01, "ndv": 1 }, ... The issue is that a condition ... WHERE a <= 0 produces a very small estimate that doesn't include the first bucket.

            The computations in Histogram_json_hb::range_selectivity go as follows:

            (gdb) print equal
              $1 = true
            (gdb) print inclusive_endp
              $2 = true
            (gdb) print idx
              $3 = 0
            

            Correct so far.

            Then, we get here:

                double sel= position_in_interval(field, max_key, max_key_len,
                                                 buckets[idx].start_value,
                                                 get_end_value(idx));
            

            In get_end_value(idx), it takes this path

                return buckets[idx+1].start_value;
            

            and returns "\001". (ISSUE-1)

            Then, position_in_interval() is called with

            position_in_interval(key="\0", left="\0", right="\001")
            

            which returns 0.0 (this is correct).

            with sel=0.0, this formmula

                max= left_fract + sel * (buckets[idx].cum_fract - left_fract);
            

            returns 0.0 also.

            psergei Sergei Petrunia added a comment - The computations in Histogram_json_hb::range_selectivity go as follows: (gdb) print equal $1 = true (gdb) print inclusive_endp $2 = true (gdb) print idx $3 = 0 Correct so far. Then, we get here: double sel= position_in_interval(field, max_key, max_key_len, buckets[idx].start_value, get_end_value(idx)); In get_end_value(idx) , it takes this path return buckets[idx+1].start_value; and returns "\001". (ISSUE-1) Then, position_in_interval() is called with position_in_interval(key="\0", left="\0", right="\001") which returns 0.0 (this is correct). with sel=0.0, this formmula max= left_fract + sel * (buckets[idx].cum_fract - left_fract); returns 0.0 also.

            ... the problem is marked with ISSUE-1. The code assumes that the values in the first bucket are "spread" between 0 and 1. It ignores the fact that the bucket has ndv=1.

            psergei Sergei Petrunia added a comment - ... the problem is marked with ISSUE-1. The code assumes that the values in the first bucket are "spread" between 0 and 1. It ignores the fact that the bucket has ndv=1.

            People

              psergei Sergei Petrunia
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.