Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26849

JSON Histograms: point selectivity estimates are off for non-existent values

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 10.7
    • Fix Version/s: 10.7.1
    • Component/s: Optimizer
    • Labels:
      None

      Description

      Suppose the number of values in the dataset is less than the number of buckets in the histogram.

      Which value should be returned when producing estimates for col='non-existant-value' ?

      Currently, MariaDB's code produces the average. It uses "inclusion assumption" and assumes that the value that is looked up exists in the database. This is not what other databases do.

      The deficiency comes from the fact that when one looks at MariaDB's histograms as they were specified in MDEV-21130 and MDEV-26519, there is no obvious way to tell whether the histogram has an exhaustive list of all values encountered in the table or not.

      (Technically, if one walks through the whole histogram and only sees buckets with ndv=1, this will guarantee that the histogram bucket endpoints are all the values that were encountered... but I won't count this as "obvious").

      drop table if exists ten, one_k, t1
      create table ten(a int);
      insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
      create table one_k(a int);
      insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;
      create table t1 ( col int);
      insert into t1 select 100*A.a+100 from ten A, one_k B;
      set histogram_type=json_hb;
      analyze table t1 persistent for all;
      

      Then run

      analyze select * from t1 where $WHERE
      

      and for various kinds of WHERE clause we get this:

      cond real mariadb mysql postgresql
      col=0 0 1000 1 1
      col=50 0 1000 1 1
      col=70 0 1000 1 1
      col=100 1000 1000 1000 1000
      col=150 0 1000 1 1
      col=200 1000 1000 1000 1000

      Apparently, MariaDB's assumption differs from that of other databases' and is worse...

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              psergei Sergei Petrunia
              Reporter:
              psergei Sergei Petrunia
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Git Integration