Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21130

Histograms: use JSON as on-disk format

Details

    Description

      Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

      The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

      CREATE TABLE mysql.column_stats (
        min_value varbinary(255) DEFAULT NULL, 
        max_value varbinary(255) DEFAULT NULL, 
        ...
        hist_size tinyint unsigned, 
        hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'), 
        histogram varbinary(255), 
        ...
      

      This prevents us from supporting other kinds of histograms.

      The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

      The idea of this MDEV is to switch to JSON as storage format for histograms.

      If we do that, it will:

      • Improve the histogram precision
      • Allow the DBAs to examine the histograms
      • Enable other histogram types to be collected/used.

      Milestone-1:

      Let histogram_type have another possible value, tentative name "JSON"
      when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"

        { "hello":"world"}
      

      that is, the following should work:

      set histogram_type='json';
      analyze table t1 persisent for all;
      select histogram from mysql.column_stats where table_name='t1' ;
      

      this should produce {"hello":"world"}.

      Milestone-2: produce JSON with histogram.

      - the exact format is not specified, for now, print the bucket endpoints and produce output like this:

      [
        "value1",
        "value2",
        ...
      ]
      

      Milestone-2, part#2: make mysql.column_stats.histogram a blob.

      Milestone-3: Parse the JSON back into an array

      Figure out how to use the JSON parser.
      Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
      (Additional input provided on Zulip re parsing valid/invalid JSON histograms)

      Milestone-4: Make the code support different kinds of Histograms

      Currently, there's only one type of histogram.

      smaller issue: histogram lookup functions assume the histogram stores fractions, not values.
      bigger issue: memory allocation for histograms is de-coupled from reading the histograms. See alloc_statistics_for_table, read_histograms_for_table.

      The histogram object lives in a data structure that is bzero'ed first and then filled later (IIRC there was a bug (fixed) where the optimizer attempted to use bzero'ed histogram)

      Can histograms be collected or loaded in parallel by several threads? This was an (unintentional?) possibility but then it was disabled (see TABLE_STATISTICS_CB object and its use)

      Step #0: Make Histogram a real class

      Here's the commit:
      https://github.com/MariaDB/server/commit/3ac32917ab6c42a5a0f9ed817dd8d3c7e20ce34d

      Step 1: Separate classes for binary and JSON histograms

      Need to introduce

      class Histogram -- interface, no data members.
      class Histogram_binary : public Histogram
      class Histogram_json : public Histogram
      

      and a factory function

      Histogram *create_histogram(Histogram_type)
      

      for now, let Histogram_json::point_selectivity() and Histogram_json::range_selectivity() return 0.1 and 0.5, respectively.

      Step 2: Demonstrate saving/loading of histograms

      Now, the code already can:

      • collect a JSON histogram and save it.
      • when loading a histogram, figure from histogram_type column that this is JSON histogram being loaded, create Histogram_json and invoke the parse function.

      Parse function at the moment only prints to stderr.
      However, we should catch parse errors and make sure they are reported to the client.
      The test may look like this:

      INSERT INTO mysql.column_stats VALUES('test','t1','column1', .... '[invalid, json, data']);
      FLUSH TABLES;
      # this should print some descriptive test
      --error NNNN
      select * from test.t1;
      

      Milestone-5: Parse the JSON data into a structure that allows lookups.

      The structure is

      std::vector<std::string>
      

      and it holds the data in KeyTupleFormat (See the comments for reasoning. There was a suggestion to use in_vector (This is what IN subqueries use) but it didn't work out)

      Milestone 5.1 (aka Milestone 44)

      Make a function to estimate selectivity using the data structure specified in previous milestone.

      Make range_selectivity() accept key_range parameters.

      (currently, they accept fractions, which is only suitable for binary histograms)
      This means Histogram_binary will need to have access to min_value and max_value to compute the fractions.

      Attachments

        Issue Links

          Activity

            psergei Sergei Petrunia created issue -
            psergei Sergei Petrunia made changes -
            Field Original Value New Value
            Labels eits
            psergei Sergei Petrunia made changes -
            julien.fritsch Julien Fritsch made changes -
            Fixing Priority 250
            cvicentiu Vicențiu Ciorbaru made changes -
            Labels eits eits gsoc20
            cvicentiu Vicențiu Ciorbaru made changes -
            Labels eits gsoc20 eits gsoc20 gsoc21
            robertbindar Robert Bindar made changes -
            Assignee Robert Bindar [ robertbindar ]
            robertbindar Robert Bindar made changes -
            Assignee Robert Bindar [ robertbindar ] Sergei Petrunia [ psergey ]
            psergei Sergei Petrunia made changes -
            psergei Sergei Petrunia made changes -
            Description Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.
            Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3
            TODO
            psergei Sergei Petrunia made changes -
            Description Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3
            TODO
            Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3: Parse the JSON back
            (DRAFT)
            Figure out how to use the JSON parser.
            Parse the JSON data produced in Milestone-2 into something that allows binary search:

            {code}
            std::vector<std::string>
            std::vector<String>
            {code}

            h2. Milestone-4
            TODO
            psergei Sergei Petrunia made changes -
            Description Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3: Parse the JSON back
            (DRAFT)
            Figure out how to use the JSON parser.
            Parse the JSON data produced in Milestone-2 into something that allows binary search:

            {code}
            std::vector<std::string>
            std::vector<String>
            {code}

            h2. Milestone-4
            TODO
            Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3: Parse the JSON back
            (DRAFT)
            Figure out how to use the JSON parser.
            Parse the JSON data produced in Milestone-2 into something that allows binary search:

            {code}
            std::vector<std::string>
            std::vector<String>
            {code}

            Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
            h2. Milestone-4
            TODO
            psergei Sergei Petrunia made changes -
            Description Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3: Parse the JSON back
            (DRAFT)
            Figure out how to use the JSON parser.
            Parse the JSON data produced in Milestone-2 into something that allows binary search:

            {code}
            std::vector<std::string>
            std::vector<String>
            {code}

            Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
            h2. Milestone-4
            TODO
            Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3: Parse the JSON back into an array
            Figure out how to use the JSON parser.
            Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
            (Additional input provided on Zulip re parsing valid/invalid JSON histograms)

            h2. Milestone-4: Parse the JSON data into a structure that allows lookups.

            Candidates:
            {code}
            std::vector<std::string>
            std::vector<String>
            {code}

            Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
            TODO
            psergei Sergei Petrunia made changes -
            psergei Sergei Petrunia made changes -
            Description Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3: Parse the JSON back into an array
            Figure out how to use the JSON parser.
            Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
            (Additional input provided on Zulip re parsing valid/invalid JSON histograms)

            h2. Milestone-4: Parse the JSON data into a structure that allows lookups.

            Candidates:
            {code}
            std::vector<std::string>
            std::vector<String>
            {code}

            Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
            TODO
            Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3: Parse the JSON back into an array
            Figure out how to use the JSON parser.
            Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
            (Additional input provided on Zulip re parsing valid/invalid JSON histograms)

            h2. Milestone-4: Make the code support different kinds of Histograms
            Currently, there's only one type of histogram.

            smaller issue: histogram lookup functions assume the histogram stores fractions,
            not values.

            bigger issue: memory allocation for histograms is de-coupled from reading
            the histograms. See alloc_statistics_for_table, read_histograms_for_table.

            The histogram object lives in a data structure that is bzero'ed first and
            then filled later (IIRC there was a bug (fixed) where the optimizer attempted
            to use bzero'ed histogram)

            Can histograms be collected or loaded in parallel by several threads?
            This was an (unintentional?) possibility but then it was disabled (see
            TABLE_STATISTICS_CB object and its use)


            h2. Milestone-5: Parse the JSON data into a structure that allows lookups.

            Candidates:
            {code}
            std::vector<std::string>
            std::vector<String>
            {code}

            Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
            TODO
            psergei Sergei Petrunia made changes -
            Labels eits gsoc20 gsoc21 RM_107_OPTIMIZER_V1 eits gsoc20 gsoc21
            psergei Sergei Petrunia made changes -
            Labels RM_107_OPTIMIZER_V1 eits gsoc20 gsoc21 eits gsoc20 gsoc21
            serg Sergei Golubchik made changes -
            Fix Version/s 10.7 [ 24805 ]
            serg Sergei Golubchik made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Due Date 2021-09-14
            psergei Sergei Petrunia made changes -
            Description Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3: Parse the JSON back into an array
            Figure out how to use the JSON parser.
            Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
            (Additional input provided on Zulip re parsing valid/invalid JSON histograms)

            h2. Milestone-4: Make the code support different kinds of Histograms
            Currently, there's only one type of histogram.

            smaller issue: histogram lookup functions assume the histogram stores fractions,
            not values.

            bigger issue: memory allocation for histograms is de-coupled from reading
            the histograms. See alloc_statistics_for_table, read_histograms_for_table.

            The histogram object lives in a data structure that is bzero'ed first and
            then filled later (IIRC there was a bug (fixed) where the optimizer attempted
            to use bzero'ed histogram)

            Can histograms be collected or loaded in parallel by several threads?
            This was an (unintentional?) possibility but then it was disabled (see
            TABLE_STATISTICS_CB object and its use)


            h2. Milestone-5: Parse the JSON data into a structure that allows lookups.

            Candidates:
            {code}
            std::vector<std::string>
            std::vector<String>
            {code}

            Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
            TODO
            Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3: Parse the JSON back into an array
            Figure out how to use the JSON parser.
            Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
            (Additional input provided on Zulip re parsing valid/invalid JSON histograms)

            h2. Milestone-4: Make the code support different kinds of Histograms
            Currently, there's only one type of histogram.

            smaller issue: histogram lookup functions assume the histogram stores fractions, not values.
            bigger issue: memory allocation for histograms is de-coupled from reading the histograms. See alloc_statistics_for_table, read_histograms_for_table.

            The histogram object lives in a data structure that is bzero'ed first and then filled later (IIRC there was a bug (fixed) where the optimizer attempted to use bzero'ed histogram)

            Can histograms be collected or loaded in parallel by several threads? This was an (unintentional?) possibility but then it was disabled (see TABLE_STATISTICS_CB object and its use)

            h3. Step #0: Make Histogram a real class
            Here's the commit:
            https://github.com/MariaDB/server/commit/3ac32917ab6c42a5a0f9ed817dd8d3c7e20ce34d

            h3. Step 1: Separate classes for binary and JSON histograms

            Need to introduce
            {code}
            class Histogram -- interface, no data members.
            class Histogram_binary : public Histogram
            class Histogram_json : public Histogram
            {code}
            and a factory function
            {code}
            Histogram *create_histogram(Histogram_type)
            {code}

            for now, let Histogram_json::point_selectivity() and Histogram_json::range_selectivity() return 0.1 and 0.5, respectively.

            h3. Step 2: Make point_selectivity and range_selectivity to accept endpoints as parameters.

            (currently, they accept fractions, which is only suitable for binary histograms)
            This means Histogram_binary will need to have access to min_value and max_value to compute the fractions.

            h3. Step 3: Separate histogram construction from usage


            h2. Milestone-5: Parse the JSON data into a structure that allows lookups.

            Candidates:
            {code}
            std::vector<std::string>
            std::vector<String>
            {code}

            Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
            TODO
            psergei Sergei Petrunia made changes -
            Description Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3: Parse the JSON back into an array
            Figure out how to use the JSON parser.
            Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
            (Additional input provided on Zulip re parsing valid/invalid JSON histograms)

            h2. Milestone-4: Make the code support different kinds of Histograms
            Currently, there's only one type of histogram.

            smaller issue: histogram lookup functions assume the histogram stores fractions, not values.
            bigger issue: memory allocation for histograms is de-coupled from reading the histograms. See alloc_statistics_for_table, read_histograms_for_table.

            The histogram object lives in a data structure that is bzero'ed first and then filled later (IIRC there was a bug (fixed) where the optimizer attempted to use bzero'ed histogram)

            Can histograms be collected or loaded in parallel by several threads? This was an (unintentional?) possibility but then it was disabled (see TABLE_STATISTICS_CB object and its use)

            h3. Step #0: Make Histogram a real class
            Here's the commit:
            https://github.com/MariaDB/server/commit/3ac32917ab6c42a5a0f9ed817dd8d3c7e20ce34d

            h3. Step 1: Separate classes for binary and JSON histograms

            Need to introduce
            {code}
            class Histogram -- interface, no data members.
            class Histogram_binary : public Histogram
            class Histogram_json : public Histogram
            {code}
            and a factory function
            {code}
            Histogram *create_histogram(Histogram_type)
            {code}

            for now, let Histogram_json::point_selectivity() and Histogram_json::range_selectivity() return 0.1 and 0.5, respectively.

            h3. Step 2: Make point_selectivity and range_selectivity to accept endpoints as parameters.

            (currently, they accept fractions, which is only suitable for binary histograms)
            This means Histogram_binary will need to have access to min_value and max_value to compute the fractions.

            h3. Step 3: Separate histogram construction from usage


            h2. Milestone-5: Parse the JSON data into a structure that allows lookups.

            Candidates:
            {code}
            std::vector<std::string>
            std::vector<String>
            {code}

            Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
            TODO
            Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3: Parse the JSON back into an array
            Figure out how to use the JSON parser.
            Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
            (Additional input provided on Zulip re parsing valid/invalid JSON histograms)

            h2. Milestone-4: Make the code support different kinds of Histograms
            Currently, there's only one type of histogram.

            smaller issue: histogram lookup functions assume the histogram stores fractions, not values.
            bigger issue: memory allocation for histograms is de-coupled from reading the histograms. See alloc_statistics_for_table, read_histograms_for_table.

            The histogram object lives in a data structure that is bzero'ed first and then filled later (IIRC there was a bug (fixed) where the optimizer attempted to use bzero'ed histogram)

            Can histograms be collected or loaded in parallel by several threads? This was an (unintentional?) possibility but then it was disabled (see TABLE_STATISTICS_CB object and its use)

            h3. Step #0: Make Histogram a real class
            Here's the commit:
            https://github.com/MariaDB/server/commit/3ac32917ab6c42a5a0f9ed817dd8d3c7e20ce34d

            h3. Step 1: Separate classes for binary and JSON histograms

            Need to introduce
            {code}
            class Histogram -- interface, no data members.
            class Histogram_binary : public Histogram
            class Histogram_json : public Histogram
            {code}
            and a factory function
            {code}
            Histogram *create_histogram(Histogram_type)
            {code}

            for now, let Histogram_json::point_selectivity() and Histogram_json::range_selectivity() return 0.1 and 0.5, respectively.

            h3. Step 2: Demonstrate saving/loading of histograms
            Now, the code already can:
            - collect a JSON histogram and save it.
            - when loading a histogram, figure from {{histogram_type}} column that this is JSON histogram being loaded, create Histogram_json and invoke the parse function.

            Parse function at the moment only prints to stderr.
            However, we should catch parse errors and make sure they are reported to the client.
            The test may look like this:
            {code}
            INSERT INTO mysql.column_stats VALUES('test','t1','column1', .... '[invalid, json, data']);
            FLUSH TABLES;
            # this should print some descriptive test
            --error NNNN
            select * from test.t1;
            {code}

            h3. Step 3: Make point_selectivity and range_selectivity to accept endpoints as parameters.

            (currently, they accept fractions, which is only suitable for binary histograms)
            This means Histogram_binary will need to have access to min_value and max_value to compute the fractions.



            h3. Step 4: Separate histogram construction from usage (?)
            TODO

            h2. Milestone-5: Parse the JSON data into a structure that allows lookups.

            Candidates:
            {code}
            std::vector<std::string>
            std::vector<String>
            {code}

            Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
            TODO
            psergei Sergei Petrunia made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            psergei Sergei Petrunia made changes -
            Description Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3: Parse the JSON back into an array
            Figure out how to use the JSON parser.
            Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
            (Additional input provided on Zulip re parsing valid/invalid JSON histograms)

            h2. Milestone-4: Make the code support different kinds of Histograms
            Currently, there's only one type of histogram.

            smaller issue: histogram lookup functions assume the histogram stores fractions, not values.
            bigger issue: memory allocation for histograms is de-coupled from reading the histograms. See alloc_statistics_for_table, read_histograms_for_table.

            The histogram object lives in a data structure that is bzero'ed first and then filled later (IIRC there was a bug (fixed) where the optimizer attempted to use bzero'ed histogram)

            Can histograms be collected or loaded in parallel by several threads? This was an (unintentional?) possibility but then it was disabled (see TABLE_STATISTICS_CB object and its use)

            h3. Step #0: Make Histogram a real class
            Here's the commit:
            https://github.com/MariaDB/server/commit/3ac32917ab6c42a5a0f9ed817dd8d3c7e20ce34d

            h3. Step 1: Separate classes for binary and JSON histograms

            Need to introduce
            {code}
            class Histogram -- interface, no data members.
            class Histogram_binary : public Histogram
            class Histogram_json : public Histogram
            {code}
            and a factory function
            {code}
            Histogram *create_histogram(Histogram_type)
            {code}

            for now, let Histogram_json::point_selectivity() and Histogram_json::range_selectivity() return 0.1 and 0.5, respectively.

            h3. Step 2: Demonstrate saving/loading of histograms
            Now, the code already can:
            - collect a JSON histogram and save it.
            - when loading a histogram, figure from {{histogram_type}} column that this is JSON histogram being loaded, create Histogram_json and invoke the parse function.

            Parse function at the moment only prints to stderr.
            However, we should catch parse errors and make sure they are reported to the client.
            The test may look like this:
            {code}
            INSERT INTO mysql.column_stats VALUES('test','t1','column1', .... '[invalid, json, data']);
            FLUSH TABLES;
            # this should print some descriptive test
            --error NNNN
            select * from test.t1;
            {code}

            h3. Step 3: Make point_selectivity and range_selectivity to accept endpoints as parameters.

            (currently, they accept fractions, which is only suitable for binary histograms)
            This means Histogram_binary will need to have access to min_value and max_value to compute the fractions.



            h3. Step 4: Separate histogram construction from usage (?)
            TODO

            h2. Milestone-5: Parse the JSON data into a structure that allows lookups.

            Candidates:
            {code}
            std::vector<std::string>
            std::vector<String>
            {code}

            Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
            TODO
            Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3: Parse the JSON back into an array
            Figure out how to use the JSON parser.
            Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
            (Additional input provided on Zulip re parsing valid/invalid JSON histograms)

            h2. Milestone-4: Make the code support different kinds of Histograms
            Currently, there's only one type of histogram.

            smaller issue: histogram lookup functions assume the histogram stores fractions, not values.
            bigger issue: memory allocation for histograms is de-coupled from reading the histograms. See alloc_statistics_for_table, read_histograms_for_table.

            The histogram object lives in a data structure that is bzero'ed first and then filled later (IIRC there was a bug (fixed) where the optimizer attempted to use bzero'ed histogram)

            Can histograms be collected or loaded in parallel by several threads? This was an (unintentional?) possibility but then it was disabled (see TABLE_STATISTICS_CB object and its use)

            h3. Step #0: Make Histogram a real class
            Here's the commit:
            https://github.com/MariaDB/server/commit/3ac32917ab6c42a5a0f9ed817dd8d3c7e20ce34d

            h3. Step 1: Separate classes for binary and JSON histograms

            Need to introduce
            {code}
            class Histogram -- interface, no data members.
            class Histogram_binary : public Histogram
            class Histogram_json : public Histogram
            {code}
            and a factory function
            {code}
            Histogram *create_histogram(Histogram_type)
            {code}

            for now, let Histogram_json::point_selectivity() and Histogram_json::range_selectivity() return 0.1 and 0.5, respectively.

            h3. Step 2: Demonstrate saving/loading of histograms
            Now, the code already can:
            - collect a JSON histogram and save it.
            - when loading a histogram, figure from {{histogram_type}} column that this is JSON histogram being loaded, create Histogram_json and invoke the parse function.

            Parse function at the moment only prints to stderr.
            However, we should catch parse errors and make sure they are reported to the client.
            The test may look like this:
            {code}
            INSERT INTO mysql.column_stats VALUES('test','t1','column1', .... '[invalid, json, data']);
            FLUSH TABLES;
            # this should print some descriptive test
            --error NNNN
            select * from test.t1;
            {code}

            h3. Step 3: Make point_selectivity and range_selectivity to accept endpoints as parameters.

            (currently, they accept fractions, which is only suitable for binary histograms)
            This means Histogram_binary will need to have access to min_value and max_value to compute the fractions.

            h2. Milestone-5: Parse the JSON data into a structure that allows lookups.

            Candidates:
            {code}
            std::vector<std::string>
            std::vector<String>
            {code}

            Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
            TODO
            psergei Sergei Petrunia made changes -
            Description Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3: Parse the JSON back into an array
            Figure out how to use the JSON parser.
            Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
            (Additional input provided on Zulip re parsing valid/invalid JSON histograms)

            h2. Milestone-4: Make the code support different kinds of Histograms
            Currently, there's only one type of histogram.

            smaller issue: histogram lookup functions assume the histogram stores fractions, not values.
            bigger issue: memory allocation for histograms is de-coupled from reading the histograms. See alloc_statistics_for_table, read_histograms_for_table.

            The histogram object lives in a data structure that is bzero'ed first and then filled later (IIRC there was a bug (fixed) where the optimizer attempted to use bzero'ed histogram)

            Can histograms be collected or loaded in parallel by several threads? This was an (unintentional?) possibility but then it was disabled (see TABLE_STATISTICS_CB object and its use)

            h3. Step #0: Make Histogram a real class
            Here's the commit:
            https://github.com/MariaDB/server/commit/3ac32917ab6c42a5a0f9ed817dd8d3c7e20ce34d

            h3. Step 1: Separate classes for binary and JSON histograms

            Need to introduce
            {code}
            class Histogram -- interface, no data members.
            class Histogram_binary : public Histogram
            class Histogram_json : public Histogram
            {code}
            and a factory function
            {code}
            Histogram *create_histogram(Histogram_type)
            {code}

            for now, let Histogram_json::point_selectivity() and Histogram_json::range_selectivity() return 0.1 and 0.5, respectively.

            h3. Step 2: Demonstrate saving/loading of histograms
            Now, the code already can:
            - collect a JSON histogram and save it.
            - when loading a histogram, figure from {{histogram_type}} column that this is JSON histogram being loaded, create Histogram_json and invoke the parse function.

            Parse function at the moment only prints to stderr.
            However, we should catch parse errors and make sure they are reported to the client.
            The test may look like this:
            {code}
            INSERT INTO mysql.column_stats VALUES('test','t1','column1', .... '[invalid, json, data']);
            FLUSH TABLES;
            # this should print some descriptive test
            --error NNNN
            select * from test.t1;
            {code}

            h3. Step 3: Make point_selectivity and range_selectivity to accept endpoints as parameters.

            (currently, they accept fractions, which is only suitable for binary histograms)
            This means Histogram_binary will need to have access to min_value and max_value to compute the fractions.

            h2. Milestone-5: Parse the JSON data into a structure that allows lookups.

            Candidates:
            {code}
            std::vector<std::string>
            std::vector<String>
            {code}

            Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
            TODO
            Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

            The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

            {code:sql}
            CREATE TABLE mysql.column_stats (
              min_value varbinary(255) DEFAULT NULL,
              max_value varbinary(255) DEFAULT NULL,
              ...
              hist_size tinyint unsigned,
              hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
              histogram varbinary(255),
              ...
            {code}

            This prevents us from supporting other kinds of histograms.

            The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

            The idea of this MDEV is to switch to JSON as storage format for histograms.

            If we do that, it will:
            - Improve the histogram precision
            - Allow the DBAs to examine the histograms
            - Enable other histogram types to be collected/used.

            h2. Milestone-1:

            Let histogram_type have another possible value, tentative name "JSON"
            when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
            {code}
              { "hello":"world"}
            {code}
            that is, the following should work:

            {code:sql}
            set histogram_type='json';
            analyze table t1 persisent for all;
            select histogram from mysql.column_stats where table_name='t1' ;
            {code}
            this should produce \{"hello":"world"\}.

            h2. Milestone-2: produce JSON with histogram(*).

            (*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

            {code}
            [
              "value1",
              "value2",
              ...
            ]
            {code}

            Milestone-2, part#2: make mysql.column_stats.histogram a blob.

            h2. Milestone-3: Parse the JSON back into an array
            Figure out how to use the JSON parser.
            Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
            (Additional input provided on Zulip re parsing valid/invalid JSON histograms)

            h2. Milestone-4: Make the code support different kinds of Histograms
            Currently, there's only one type of histogram.

            smaller issue: histogram lookup functions assume the histogram stores fractions, not values.
            bigger issue: memory allocation for histograms is de-coupled from reading the histograms. See alloc_statistics_for_table, read_histograms_for_table.

            The histogram object lives in a data structure that is bzero'ed first and then filled later (IIRC there was a bug (fixed) where the optimizer attempted to use bzero'ed histogram)

            Can histograms be collected or loaded in parallel by several threads? This was an (unintentional?) possibility but then it was disabled (see TABLE_STATISTICS_CB object and its use)

            h3. Step #0: Make Histogram a real class
            Here's the commit:
            https://github.com/MariaDB/server/commit/3ac32917ab6c42a5a0f9ed817dd8d3c7e20ce34d

            h3. Step 1: Separate classes for binary and JSON histograms

            Need to introduce
            {code}
            class Histogram -- interface, no data members.
            class Histogram_binary : public Histogram
            class Histogram_json : public Histogram
            {code}
            and a factory function
            {code}
            Histogram *create_histogram(Histogram_type)
            {code}

            for now, let Histogram_json::point_selectivity() and Histogram_json::range_selectivity() return 0.1 and 0.5, respectively.

            h3. Step 2: Demonstrate saving/loading of histograms
            Now, the code already can:
            - collect a JSON histogram and save it.
            - when loading a histogram, figure from {{histogram_type}} column that this is JSON histogram being loaded, create Histogram_json and invoke the parse function.

            Parse function at the moment only prints to stderr.
            However, we should catch parse errors and make sure they are reported to the client.
            The test may look like this:
            {code}
            INSERT INTO mysql.column_stats VALUES('test','t1','column1', .... '[invalid, json, data']);
            FLUSH TABLES;
            # this should print some descriptive test
            --error NNNN
            select * from test.t1;
            {code}

            h2. Milestone-5: Parse the JSON data into a structure that allows lookups.
            The structure is
            {code}
            std::vector<std::string>
            {code}
            and it holds the data in KeyTupleFormat (See the comments for reasoning. There was a suggestion to use {{in_vector}} (This is what IN subqueries use) but it didn't work out)

            h2. Milestone 5.1 (aka Milestone 44)
            Make a function to estimate selectivity using the data structure specified in previous milestone.

            h2. Make range_selectivity() accept key_range parameters.

            (currently, they accept fractions, which is only suitable for binary histograms)
            This means Histogram_binary will need to have access to min_value and max_value to compute the fractions.
            psergei Sergei Petrunia made changes -
            elenst Elena Stepanova made changes -
            serg Sergei Golubchik made changes -
            Assignee Sergei Petrunia [ psergey ] Sergei Golubchik [ serg ]
            serg Sergei Golubchik made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            serg Sergei Golubchik made changes -
            Assignee Sergei Golubchik [ serg ] Elena Stepanova [ elenst ]
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            Due Date 2021-09-14
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            psergei Sergei Petrunia made changes -
            elenst Elena Stepanova made changes -
            Assignee Elena Stepanova [ elenst ] Sergei Petrunia [ psergey ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.8 [ 26121 ]
            Fix Version/s 10.7 [ 24805 ]
            julien.fritsch Julien Fritsch made changes -
            julien.fritsch Julien Fritsch made changes -
            julien.fritsch Julien Fritsch made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 101382 ] MariaDB v4 [ 131734 ]
            psergei Sergei Petrunia made changes -
            Component/s Optimizer [ 10200 ]
            Fix Version/s 10.8.1 [ 26815 ]
            Fix Version/s 10.8 [ 26121 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            psergei Sergei Petrunia made changes -
            elenst Elena Stepanova made changes -

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.