Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26709

JSON histogram may contain bucketS than histogram_size allows

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • N/A
    • 10.7.1
    • Optimizer
    • None

    Description

      One of points from our earlier slack discussion was that there can not be more buckets in a JSON histogram than histogram_size. The below example proves the contrary.

      It is possible that the conclusion was wrong or too generic and there are allowed exceptions from the rule. These exception will need to be a part of histogram_size description in the KB. The description has to be adjust anyway, as now it says that histogram_size means the number of buckets for JSON histograms, and it's certainly not the case.

      --source include/have_sequence.inc
       
      create or replace table t (a int);
      insert into t values (1),(3),(5),(7);
      insert into t select 2 from seq_1_to_25;
      insert into t select 4 from seq_1_to_25;
      insert into t select 6 from seq_1_to_25;
       
      set histogram_size=4, histogram_type=JSON_HB;
      analyze table t persistent for all;
       
      select histogram from mysql.column_stats where table_name = 't';
       
      drop table t;
      

      preview-10.7-MDEV-26519-json-histograms da8bb4b470

      select histogram from mysql.column_stats where table_name = 't';
      histogram
      {
        "histogram_hb_v2": [
          {
            "start": "1",
            "size": 0.240506329,
            "ndv": 2
          },
          {
            "start": "2",
            "size": 0.240506329,
            "ndv": 3
          },
          {
            "start": "4",
            "size": 0.189873418,
            "ndv": 2
          },
          {
            "start": "6",
            "size": 0.316455696,
            "ndv": 1
          },
          {
            "start": "7",
            "end": "7",
            "size": 0.012658228,
            "ndv": 1
          }
        ]
      }
      

      Attachments

        Issue Links

          Activity

            Another case of an extra bucket. This one also shows that this extra bucket is of a "wrong" size.

            --source include/have_sequence.inc
             
            create or replace table t (a int);
            insert into t select seq from seq_10_to_90;
            insert into t select * from t;
            insert into t select * from t;
            insert into t select * from t;
             
            set histogram_type= JSON_HB, histogram_size= 5;
            analyze table t persistent for all;
             
            select histogram from mysql.column_stats where table_name in ('t');
             
            # Cleanup
            drop table t;
            

            select histogram from mysql.column_stats where table_name in ('t');
            histogram
            {
              "histogram_hb_v2": [
                {
                  "start": "10",
                  "size": 0.199074074,
                  "ndv": 17
                },
                {
                  "start": "26",
                  "size": 0.199074074,
                  "ndv": 17
                },
                {
                  "start": "42",
                  "size": 0.199074074,
                  "ndv": 17
                },
                {
                  "start": "58",
                  "size": 0.199074074,
                  "ndv": 17
                },
                {
                  "start": "74",
                  "size": 0.199074074,
                  "ndv": 17
                },
                {
                  "start": "90",
                  "end": "90",
                  "size": 0.00462963,
                  "ndv": 1
                }
              ]
            }
            

            elenst Elena Stepanova added a comment - Another case of an extra bucket. This one also shows that this extra bucket is of a "wrong" size. --source include/have_sequence.inc   create or replace table t (a int ); insert into t select seq from seq_10_to_90; insert into t select * from t; insert into t select * from t; insert into t select * from t;   set histogram_type= JSON_HB, histogram_size= 5; analyze table t persistent for all ;   select histogram from mysql.column_stats where table_name in ( 't' );   # Cleanup drop table t; select histogram from mysql.column_stats where table_name in ( 't' ); histogram { "histogram_hb_v2" : [ { "start" : "10" , "size" : 0.199074074, "ndv" : 17 }, { "start" : "26" , "size" : 0.199074074, "ndv" : 17 }, { "start" : "42" , "size" : 0.199074074, "ndv" : 17 }, { "start" : "58" , "size" : 0.199074074, "ndv" : 17 }, { "start" : "74" , "size" : 0.199074074, "ndv" : 17 }, { "start" : "90" , "end" : "90" , "size" : 0.00462963, "ndv" : 1 } ] }
            psergei Sergei Petrunia added a comment - - edited

            ... no, it's not about "one more bucket". It can be way more buckets.

            Example:

            source mysql-test/include/world_schema.inc
            source mysql-test/include/world.inc
            set histogram_type='JSON_HB';
            set histogram_size=50;
            analyze table Country persistent for all;
             
            select JSON_LENGTH(histogram, '$.histogram_hb_v2') from mysql.column_stats where column_name='Code';
            +---------------------------------------------+
            | JSON_LENGTH(histogram, '$.histogram_hb_v2') |
            +---------------------------------------------+
            |                                          60 |
            +---------------------------------------------+
            1 row in set (0.00 sec)
            

            psergei Sergei Petrunia added a comment - - edited ... no, it's not about "one more bucket". It can be way more buckets. Example: source mysql-test/include/world_schema.inc source mysql-test/include/world.inc set histogram_type='JSON_HB'; set histogram_size=50; analyze table Country persistent for all;   select JSON_LENGTH(histogram, '$.histogram_hb_v2') from mysql.column_stats where column_name='Code'; +---------------------------------------------+ | JSON_LENGTH(histogram, '$.histogram_hb_v2') | +---------------------------------------------+ | 60 | +---------------------------------------------+ 1 row in set (0.00 sec)

            239 rows, 50 buckets...

            gives bucket_capacity=4.78 = 4

            The current code will put 4 rows in each bucket and will collect 500 buckets after having examined 50*4= 200 rows.
            39 rows that are left will need 39/4 = 9.8 rows. This will give 59.8 = 60 buckets.

            It looks like the idea that we should compute some integer number for rows_in_bucket is not a good one.
            How about allowing fractions in rows_in_bucket ...

            psergei Sergei Petrunia added a comment - 239 rows, 50 buckets... gives bucket_capacity=4.78 = 4 The current code will put 4 rows in each bucket and will collect 500 buckets after having examined 50*4= 200 rows. 39 rows that are left will need 39/4 = 9.8 rows. This will give 59.8 = 60 buckets. It looks like the idea that we should compute some integer number for rows_in_bucket is not a good one. How about allowing fractions in rows_in_bucket ...

            People

              psergei Sergei Petrunia
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.