Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21829

Use packed sort keys in Unique objects

Details

    Description

      The task deals with packing the values stored in the Unique tree for each record.

      Background

      In the implementation of Unique for variable sized fields like VARCHAR,
      space allocated for each record is the maximum length for the field though
      in most practical cases the data that is being stored in such columns
      will be way less than the maximum length in most cases.

      For example: lets say there is a query like:

      SELECT COUNT(DISTINT col)  FROM t1;
      

      and column col is defined as VARCHAR(200) in latin character set, so for each
      record in table t1 the Unique tree would reserve 200 bytes to store the value
      for column col. This is not the best way to utilize the space because this would lead
      to filling of the Unique tree quickly forcing the Unique tree to be flushed
      to the disk and then start writing to the new Unique tree.
      To get the final result then a sort-merge pass needs to be done on the chunks
      of Unique trees written to the disk

      So this limitation prompted to store packed values inside the Unique tree.
      This task is an extension to MDEV-21580.

      Unique class is currently used in

      • agg_func(DISTINCT col)
        Here most aggregate functions like SUM, AVG accept only fixed size arguments
        so it is not beneficial to use packing for these.
        Currently packing is done for COUNT and GROUP_CONCAT (or JSON_ARRAYAGG)
        aggregate function as these are meaningful
      • index-merge stores row-ids
        index merge stores row-ids which are of fixed size, so packing makes no sense here
      • Engine Independent Table statistics
        Packing is done here for variable length data types

      Implementation details

      1) Packed key format for Unique tree

      The keys are stored in a new variable-size data format called "packed".

      Lets say the record in the Unique tree looks like:

        key_length packed_value_1 packed_value2
      

      key_length is the length of the whole key. key_length is stored in 4 bytes,
      that is for each record we reserve 4 bytes to store the length of the key.

      Each packed value is encoded as follows

      <length> <value>
      

      The format of the packed_value depends on the data type.
      For "non-packable" data types like (INT, DOUBLE) there is no packing.

      NULLS are ignored for COUNT(DISTINCT col) and GROUP_CONCAT(DISTINCT col1 ....) . For EITS we try
      to find the number of distinct NON NULL values, so NULLs are not required for that.

      NULLS are only considered for JSON_ARRAYAGG currently and the packed value
      in this case is encoded as

      <null_byte> <length> <value>
      

      Note:
      For NULL values we don't store anything in the length and value bytes, only the null byte is set to zero.

      Note:
      Let say we have a query like
      SELECT COUNT(DISTINCT a) FROM t1

      This is the case where the Unique tree would store only one column. So when the key is of single component
      then the format looks like <key_length> <length> <value>. The key_length here would hold the value of <length> + bytes allocated to store length (that is let say for VARCHAR , size less than 255 has 1 byte of length else it is 2).
      We could try to save the 4 bytes for key length in this case but we have ignored it for now as this would lead
      to lots of changes in the code (mostly in merge_buffers, this code is shared by filesort and Unique class).

      Note:
      The packing format is almost the same as in Filesort, with the exception that null bytes are stored only with JSON_ARRAYAGG.

      How is packing done ?
      A packed record is made by using the pack() function for all the fields in the record.
      and when the the duplicates are removed then unpack() is called to store the value back in the record pointers
      that is in Field::ptr.

      2) Which format to use

      Contrary to sort keys in Filesort, keys in Unique tree don't have mem-comparable
      form and the comparison of keys is always done on original values of columns.
      This means that the comparisons time would not be increased by using packed
      values. Due to this reasoning packed format is used always where possible.

      We may consider adding a heuristic to disallow packing for small VARCHAR's and CHARs.

      Attachments

        Issue Links

          Activity

            cvicentiu Vicențiu Ciorbaru added a comment - https://github.com/MariaDB/server/pull/2993
            psergei Sergei Petrunia added a comment - - edited

            Task for testing this MDEV-33305 (NOT including performance)

            psergei Sergei Petrunia added a comment - - edited Task for testing this MDEV-33305 (NOT including performance)

            I have reviewed about 35% of the code, plan to review the rest during this week.
            After that, I will give it back to Vicentiu to fix things found in review (should take 1-2 hours) and then we have to tackle
            https://jira.mariadb.org/browse/MDEV-33457 and then more testing.

            monty Michael Widenius added a comment - I have reviewed about 35% of the code, plan to review the rest during this week. After that, I will give it back to Vicentiu to fix things found in review (should take 1-2 hours) and then we have to tackle https://jira.mariadb.org/browse/MDEV-33457 and then more testing.

            Vicentiu started working on this again 2 weeks ago to solve the issues found in the review. I expect to get an updated version withing 3 weeks after Vicentiu is back from the Google summer of code event in USA.
            What is left is second review (I do not expect any future changes needed) and then QA before getting pushed into 11.7 or 11.8

            monty Michael Widenius added a comment - Vicentiu started working on this again 2 weeks ago to solve the issues found in the review. I expect to get an updated version withing 3 weeks after Vicentiu is back from the Google summer of code event in USA. What is left is second review (I do not expect any future changes needed) and then QA before getting pushed into 11.7 or 11.8

            This code is in review and rework by Vicentiu.
            Vicentiu expect to have the rework solved next week, after which it comes back to me for a last review, which should be quick.

            monty Michael Widenius added a comment - This code is in review and rework by Vicentiu. Vicentiu expect to have the rework solved next week, after which it comes back to me for a last review, which should be quick.

            People

              cvicentiu Vicențiu Ciorbaru
              psergei Sergei Petrunia
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.