Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26664

Store UUIDs in a more efficient manner

Details

    Description

      UUIDs are stored in a binary form, but not byte-swapped like in MySQL's uuid_to_bin https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-functions.html#function_uuid-to-bin

      The order provided by UUID_TO_BIN() is not really friendly to index prefix compression:
      it only reorders the timestamp segments, while the clock-seq and the node-ID segments are still at the end. This ruins the whole idea of prefix compression.

      Let's sort UUID in a really index-friendly order, by moving the constant part to the left and the variable part to the right.

      So for comparison purposes we'll reorder segments, but keep the byte order inside the segments (i.e. in big endian order):

      MariaDB binary UUIDv1 representation, as returned by the UUID() function:
       
        llllllll-mmmm-Vhhh-vsss-nnnnnnnnnnnn
       
      Binary sortable representation:
       
         nnnnnnnnnnnn-vsss-Vhhh-mmmm-llllllll
       
      Sign           Section               Bits   Bytes  Pos   PosBinSortable
      -------------  -------               ----   -----  ---   --------------
      llllllll       time low              32     4        0   12
      mmmm           time mid              16     2        4   10
      Vhhh           version and time hi   16     2        6   8
      vsss           variant and clock seq 16     2        8   6
      nnnnnnnnnnnn   node ID               48     6       10   0
      

      Attachments

        Issue Links

          Activity

            bar Alexander Barkov added a comment - - edited

            SQL Server has a data type uniqueidentifier with a good sort order.

            Tested with SQL Server 2017:

            CREATE TABLE t1 (id uniqueidentifier default NewSequentialId());
            INSERT INTO t1 VALUES ('00000000-0000-0000-0000-000000000000');
            INSERT INTO t1 VALUES ('F0000000-0000-0000-0000-000000000000');
            INSERT INTO t1 VALUES ('0000000F-0000-0000-0000-000000000000');
            INSERT INTO t1 VALUES ('00000000-F000-0000-0000-000000000000');
            INSERT INTO t1 VALUES ('00000000-000F-0000-0000-000000000000');
            INSERT INTO t1 VALUES ('00000000-0000-F000-0000-000000000000');
            INSERT INTO t1 VALUES ('00000000-0000-000F-0000-000000000000');
            INSERT INTO t1 VALUES ('00000000-0000-0000-F000-000000000000');
            INSERT INTO t1 VALUES ('00000000-0000-0000-000F-000000000000');
            INSERT INTO t1 VALUES ('00000000-0000-0000-0000-F00000000000');
            INSERT INTO t1 VALUES ('00000000-0000-0000-0000-00000000000F');
            INSERT INTO t1 VALUES ('01C0423E-F36B-1410-8001-F00000000001');
            INSERT INTO t1 VALUES ('01C0423E-F36B-1410-F001-F00000000001');
            INSERT INTO t1 VALUES ('01C0423E-F36B-F410-8001-F00000000001');
            INSERT INTO t1 VALUES ('01C0423E-FF6B-F410-8001-F00000000001');
            INSERT INTO t1 VALUES ('01C0423E-F36B-1410-8001-400000000001');
            INSERT INTO t1 VALUES ('01C0423E-F36B-1410-8001-400000000002');
            INSERT INTO t1 VALUES ('01C0424E-F36B-1410-8001-500000000001');
            INSERT INTO t1 VALUES ('02C0424E-F36B-1410-8001-500000000001');
            INSERT INTO t1 VALUES (default);
            INSERT INTO t1 VALUES (default);
            SELECT * FROM t1 ORDER BY id;
            

            00000000-0000-0000-0000-000000000000
            F0000000-0000-0000-0000-000000000000
            0000000F-0000-0000-0000-000000000000
            00000000-F000-0000-0000-000000000000
            00000000-000F-0000-0000-000000000000
            00000000-0000-F000-0000-000000000000
            00000000-0000-000F-0000-000000000000
            00000000-0000-0000-000F-000000000000
            00000000-0000-0000-F000-000000000000
            00000000-0000-0000-0000-00000000000F
            01C0423E-F36B-1410-8001-400000000001
            01C0423E-F36B-1410-8001-400000000002
            01C0424E-F36B-1410-8001-500000000001
            02C0424E-F36B-1410-8001-500000000001
            0BC0423E-F36B-1410-800B-800000000000
            0CC0423E-F36B-1410-800C-800000000000
            00000000-0000-0000-0000-F00000000000
            01C0423E-F36B-1410-8001-F00000000001
            01C0423E-F36B-F410-8001-F00000000001
            01C0423E-FF6B-F410-8001-F00000000001
            01C0423E-F36B-1410-F001-F00000000001
            

            However, the NewSequentialId() function in SQL Server puts the timestamp segments of the UUID in little endian order, while MariaDB function UUID() puts the timestamp segments in big endian order.

            bar Alexander Barkov added a comment - - edited SQL Server has a data type uniqueidentifier with a good sort order. Tested with SQL Server 2017: CREATE TABLE t1 (id uniqueidentifier default NewSequentialId()); INSERT INTO t1 VALUES ( '00000000-0000-0000-0000-000000000000' ); INSERT INTO t1 VALUES ( 'F0000000-0000-0000-0000-000000000000' ); INSERT INTO t1 VALUES ( '0000000F-0000-0000-0000-000000000000' ); INSERT INTO t1 VALUES ( '00000000-F000-0000-0000-000000000000' ); INSERT INTO t1 VALUES ( '00000000-000F-0000-0000-000000000000' ); INSERT INTO t1 VALUES ( '00000000-0000-F000-0000-000000000000' ); INSERT INTO t1 VALUES ( '00000000-0000-000F-0000-000000000000' ); INSERT INTO t1 VALUES ( '00000000-0000-0000-F000-000000000000' ); INSERT INTO t1 VALUES ( '00000000-0000-0000-000F-000000000000' ); INSERT INTO t1 VALUES ( '00000000-0000-0000-0000-F00000000000' ); INSERT INTO t1 VALUES ( '00000000-0000-0000-0000-00000000000F' ); INSERT INTO t1 VALUES ( '01C0423E-F36B-1410-8001-F00000000001' ); INSERT INTO t1 VALUES ( '01C0423E-F36B-1410-F001-F00000000001' ); INSERT INTO t1 VALUES ( '01C0423E-F36B-F410-8001-F00000000001' ); INSERT INTO t1 VALUES ( '01C0423E-FF6B-F410-8001-F00000000001' ); INSERT INTO t1 VALUES ( '01C0423E-F36B-1410-8001-400000000001' ); INSERT INTO t1 VALUES ( '01C0423E-F36B-1410-8001-400000000002' ); INSERT INTO t1 VALUES ( '01C0424E-F36B-1410-8001-500000000001' ); INSERT INTO t1 VALUES ( '02C0424E-F36B-1410-8001-500000000001' ); INSERT INTO t1 VALUES ( default ); INSERT INTO t1 VALUES ( default ); SELECT * FROM t1 ORDER BY id; 00000000-0000-0000-0000-000000000000 F0000000-0000-0000-0000-000000000000 0000000F-0000-0000-0000-000000000000 00000000-F000-0000-0000-000000000000 00000000-000F-0000-0000-000000000000 00000000-0000-F000-0000-000000000000 00000000-0000-000F-0000-000000000000 00000000-0000-0000-000F-000000000000 00000000-0000-0000-F000-000000000000 00000000-0000-0000-0000-00000000000F 01C0423E-F36B-1410-8001-400000000001 01C0423E-F36B-1410-8001-400000000002 01C0424E-F36B-1410-8001-500000000001 02C0424E-F36B-1410-8001-500000000001 0BC0423E-F36B-1410-800B-800000000000 0CC0423E-F36B-1410-800C-800000000000 00000000-0000-0000-0000-F00000000000 01C0423E-F36B-1410-8001-F00000000001 01C0423E-F36B-F410-8001-F00000000001 01C0423E-FF6B-F410-8001-F00000000001 01C0423E-F36B-1410-F001-F00000000001 However, the NewSequentialId() function in SQL Server puts the timestamp segments of the UUID in little endian order, while MariaDB function UUID() puts the timestamp segments in big endian order.
            bar Alexander Barkov added a comment - - edited

            It seems if we change the byte order in Field_uuid::store() to make it better sortable, we'll need to put the exact data type name to the row binary log - implement some extension on top of MDEV-20477, so replication can distinguish UUID from BINARY(16). This is needed to fix the byte order properly:

            • change byte order from binary to sortable when replicating from BINARY(16) to UUID
            • change byte order from sortable to binary when replicating from UUID to BINARY(16)
            • preserve byte order as is when replicating from UUID to UUID

            Another option is to put UUID to binlog in "visible" binary representation instead of "sortable" binary representation. This is simpler to implement.

            bar Alexander Barkov added a comment - - edited It seems if we change the byte order in Field_uuid::store() to make it better sortable, we'll need to put the exact data type name to the row binary log - implement some extension on top of MDEV-20477 , so replication can distinguish UUID from BINARY(16). This is needed to fix the byte order properly: change byte order from binary to sortable when replicating from BINARY(16) to UUID change byte order from sortable to binary when replicating from UUID to BINARY(16) preserve byte order as is when replicating from UUID to UUID Another option is to put UUID to binlog in "visible" binary representation instead of "sortable" binary representation. This is simpler to implement.

            commit 5cfb31854a4 is ok to push (to the preview-* branch, of course)

            serg Sergei Golubchik added a comment - commit 5cfb31854a4 is ok to push (to the preview-* branch, of course)
            danblack Daniel Black added a comment -

            As noted by rjames, the clock sequence number looks out of place. The current implementation of UUID uses a random number https://github.com/MariaDB/server/blob/10.7/mysys/my_uuid.c#L71 (allowed as fallback option in https://datatracker.ietf.org/doc/html/rfc4122#section-4.1.5).

            So to ensure its randomness is localized and doesn't prevent big splits/sorts of index pages shouldn't the sss be on the far right like this:

            nnnnnnnnnnnn-vVhh-hmmm-mllllllllsss

            This would mean inserts that happen to have the same except (unlikely as time low is in a 100ns interval) for the sss occur minor local inserts.

            Otherwise in the nnnnnnnnnnnn-vsss-Vhhh-mmmm-llllllll from, sequential uuids from UUIDv1 (or v2) generators, MariaDB or not, could on a very different index page.

            danblack Daniel Black added a comment - As noted by rjames , the clock sequence number looks out of place. The current implementation of UUID uses a random number https://github.com/MariaDB/server/blob/10.7/mysys/my_uuid.c#L71 (allowed as fallback option in https://datatracker.ietf.org/doc/html/rfc4122#section-4.1.5 ). So to ensure its randomness is localized and doesn't prevent big splits/sorts of index pages shouldn't the sss be on the far right like this: nnnnnnnnnnnn-vVhh-hmmm-mllllllllsss This would mean inserts that happen to have the same except (unlikely as time low is in a 100ns interval) for the sss occur minor local inserts. Otherwise in the nnnnnnnnnnnn-vsss-Vhhh-mmmm-llllllll from, sequential uuids from UUIDv1 (or v2) generators, MariaDB or not, could on a very different index page.
            bar Alexander Barkov added a comment - - edited

            The "sss" part is initialized only once, during the server start up and is copied to all UUID values
            generated by the UUID() function, until the server shutdown. So it rather belongs to the constant part of the value.

            Moving it to the right is desirable on one hand, to make sure the rule "the earlier UUID was generated, the higher it goes in the ORDER BY output" is true even after the server restart.

            Moving it to the right is not desirable on the other hand: It will make the prefix compression worse.

            serg, what do you think?

            bar Alexander Barkov added a comment - - edited The "sss" part is initialized only once, during the server start up and is copied to all UUID values generated by the UUID() function, until the server shutdown. So it rather belongs to the constant part of the value. Moving it to the right is desirable on one hand, to make sure the rule "the earlier UUID was generated, the higher it goes in the ORDER BY output" is true even after the server restart. Moving it to the right is not desirable on the other hand: It will make the prefix compression worse. serg , what do you think?
            rjasdfiii Rick James added a comment -

            The format I pushed for (as far back as 2012) was

            Vhhh-mmmm-llllllll-vsss-nnnnnnnnnnnn

            That shuffled the main time parts so that the index would have good "locality of reference" when being interested in records created about the same time. I did not worry about vsss or nnnnnnnnnnnn.

            Putting nnnnnnnnnnnn first is not "bad". It clusters the rows from the same source in preference to the time parts.

            V and v are constants since only MySQL/MySQL is building these ids. Their location is not relevant because no one is likely to ever see that the storage has them in the "wrong" places. As a related note, no one cares that INTs are stored big-endian versus little-endian.

            There may be a binary compatibility argument for shuffling the bits exactly as MySQL 8.0 does. It might be exactly what I pushed for 9 years go, which is about when I started lobbying Oracle folks.

            rjasdfiii Rick James added a comment - The format I pushed for (as far back as 2012) was Vhhh-mmmm-llllllll-vsss-nnnnnnnnnnnn That shuffled the main time parts so that the index would have good "locality of reference" when being interested in records created about the same time. I did not worry about vsss or nnnnnnnnnnnn. Putting nnnnnnnnnnnn first is not "bad". It clusters the rows from the same source in preference to the time parts. V and v are constants since only MySQL/MySQL is building these ids. Their location is not relevant because no one is likely to ever see that the storage has them in the "wrong" places. As a related note, no one cares that INTs are stored big-endian versus little-endian. There may be a binary compatibility argument for shuffling the bits exactly as MySQL 8.0 does. It might be exactly what I pushed for 9 years go, which is about when I started lobbying Oracle folks.
            rjasdfiii Rick James added a comment -

            Alexander Barkov's point about prefix compression is interesting. Does that imply that all of Vvsssnnnnnnnnnnnn should go first?

            rjasdfiii Rick James added a comment - Alexander Barkov's point about prefix compression is interesting. Does that imply that all of Vvsssnnnnnnnnnnnn should go first?
            bar Alexander Barkov added a comment - - edited

            Yes, from the index size point of view it's more efficient if we put Vvsssnnnnnnnnnnnn first.

            The exact segment order in the current prototype
            (in https://github.com/MariaDB/server/tree/bb-10.7-bar-uuid)
            is:

            nnnnnnnnnnnn-vsss-Vhhh-mmmm-llllllll

            "nnnnnnnnnnnn-vsss-V" is a constant part (at least between the server restarts).

            "hhh-mmmm-llllllll" is the timestamp, with big endian byte order.

            bar Alexander Barkov added a comment - - edited Yes, from the index size point of view it's more efficient if we put Vvsssnnnnnnnnnnnn first. The exact segment order in the current prototype (in https://github.com/MariaDB/server/tree/bb-10.7-bar-uuid ) is: nnnnnnnnnnnn-vsss-Vhhh-mmmm-llllllll "nnnnnnnnnnnn-vsss-V" is a constant part (at least between the server restarts). "hhh-mmmm-llllllll" is the timestamp, with big endian byte order.
            rjasdfiii Rick James added a comment -

            While on the topic, can someone tell me whether InnoDB always uses index-prefixing? And are there cases where it does not? Is the prefixing done as some number of bytes? Would different forms of column-compression lend themselves better to index-prefixing?

            rjasdfiii Rick James added a comment - While on the topic, can someone tell me whether InnoDB always uses index-prefixing? And are there cases where it does not? Is the prefixing done as some number of bytes? Would different forms of column-compression lend themselves better to index-prefixing?
            danblack Daniel Black added a comment -

            > The "sss" part is initialized only once, during the server start up and is copied to all UUID values
            > generated by the UUID() function, until the server shutdown. So it rather belongs to the constant part of the value.

            Ok. But a UUIDv1 generation that occurs in an application language (https://github.com/python/cpython/blob/main/Lib/uuid.py#L694 could just be random).
            Or the more_entropy (opt on uniqid) random bit of https://github.com/php/php-src/blob/master/ext/standard/uniqid.c#L82

            Breaking apart the 4 bit version from Vhhh or the 3 bit variant from vsss seems too excessive. How does this look?

            nnnnnnnnnnnn-Vhhh-mmmm-llllllll-vsss

            The node + Vhhh comes to 8 bytes, probably more as mmmm (2 bytes) is pretty constant too for rapid insertion. So we've sacrified 2 bytes of potentially constant (if MariaDB) generated.

            > Moving it to the right is desirable on one hand, to make sure the rule "the earlier UUID was generated, the higher it goes in the ORDER BY output" is true even after the server restart.

            I think I've still got that above.

            > Moving it to the right is not desirable on the other hand: It will make the prefix compression worse.

            Sacrifice the prefix compression of 2 bytes out of 10 -> 12 bytes because non-us generators could be random sss. I think that's fair. Thoughts?

            danblack Daniel Black added a comment - > The "sss" part is initialized only once, during the server start up and is copied to all UUID values > generated by the UUID() function, until the server shutdown. So it rather belongs to the constant part of the value. Ok. But a UUIDv1 generation that occurs in an application language ( https://github.com/python/cpython/blob/main/Lib/uuid.py#L694 could just be random). Or the more_entropy (opt on uniqid) random bit of https://github.com/php/php-src/blob/master/ext/standard/uniqid.c#L82 Breaking apart the 4 bit version from Vhhh or the 3 bit variant from vsss seems too excessive. How does this look? nnnnnnnnnnnn-Vhhh-mmmm-llllllll-vsss The node + Vhhh comes to 8 bytes, probably more as mmmm (2 bytes) is pretty constant too for rapid insertion. So we've sacrified 2 bytes of potentially constant (if MariaDB) generated. > Moving it to the right is desirable on one hand, to make sure the rule "the earlier UUID was generated, the higher it goes in the ORDER BY output" is true even after the server restart. I think I've still got that above. > Moving it to the right is not desirable on the other hand: It will make the prefix compression worse. Sacrifice the prefix compression of 2 bytes out of 10 -> 12 bytes because non-us generators could be random sss. I think that's fair. Thoughts?

            rjasdfiii, InnoDB specifically does not perform prefix compression. But some other engines do.

            bar Alexander Barkov added a comment - rjasdfiii , InnoDB specifically does not perform prefix compression. But some other engines do.
            bar Alexander Barkov added a comment - - edited

            danblack, I agree, for externally generated UUIDs, it's better to place "vsss" at the end.

            Just tested this script:

            SET default_storage_engine=MyISAM;
            CREATE OR REPLACE TABLE t1 (a UUID, INDEX(a));
            SHOW CREATE TABLE t1;
            BEGIN;
            DELIMITER $$
            FOR i IN 0..0xFFFF
            DO
              INSERT INTO t1 VALUES (UUID());
            END FOR
            $$
            DELIMITER ;
            COMMIT;
            show table status like 't1'\G
            

            nnnnnnnnnnnn-vsss-Vhhh-mmmm-llllllll gives the ratio of 0.56:

                 Data_length: 1114112
                Index_length: 624640
            

            nnnnnnnnnnnn-Vhhh-mmmm-llllllll-vsss gives the ratio of 0.68:

                 Data_length: 1114112
                Index_length: 764928
            

            It's probably still good enough.

            Note, the "user visible" order llllllll-mmmm-Vhhh-vsss-nnnnnnnnnnnn, which is now in the preview branch, gives the ratio of 1.44. This one is really bad.

                 Data_length: 1114112
                Index_length: 1611776
            

            bar Alexander Barkov added a comment - - edited danblack , I agree, for externally generated UUIDs, it's better to place "vsss" at the end. Just tested this script: SET default_storage_engine=MyISAM; CREATE OR REPLACE TABLE t1 (a UUID, INDEX (a)); SHOW CREATE TABLE t1; BEGIN ; DELIMITER $$ FOR i IN 0..0xFFFF DO INSERT INTO t1 VALUES (UUID()); END FOR $$ DELIMITER ; COMMIT ; show table status like 't1' \G nnnnnnnnnnnn-vsss-Vhhh-mmmm-llllllll gives the ratio of 0.56: Data_length: 1114112 Index_length: 624640 nnnnnnnnnnnn-Vhhh-mmmm-llllllll-vsss gives the ratio of 0.68: Data_length: 1114112 Index_length: 764928 It's probably still good enough. Note, the "user visible" order llllllll-mmmm-Vhhh-vsss-nnnnnnnnnnnn , which is now in the preview branch, gives the ratio of 1.44. This one is really bad. Data_length: 1114112 Index_length: 1611776

            The UUIDv1 standard specifies that

            • sss is initialized to a random value
            • every time the system clock (hhh-mmmm-llllllll) goes backward (a new generated value is less than the previous one), sss is incremented.

            So clearly sss should go before the system clock value, as in sss-hhh-mmmm-llllllll

            serg Sergei Golubchik added a comment - The UUIDv1 standard specifies that sss is initialized to a random value every time the system clock ( hhh-mmmm-llllllll ) goes backward (a new generated value is less than the previous one), sss is incremented. So clearly sss should go before the system clock value, as in sss-hhh-mmmm-llllllll
            serg Sergei Golubchik added a comment - - edited

            ok to push (into a preview branch)

            serg Sergei Golubchik added a comment - - edited ok to push (into a preview branch)

            Pushed into preview-10.7-MDEV-4958-uuid.

            bar Alexander Barkov added a comment - Pushed into preview-10.7- MDEV-4958 -uuid.
            someniatko Illia Somov added a comment -

            This should have been only applied to UUIDv1...

            someniatko Illia Somov added a comment - This should have been only applied to UUIDv1...

            People

              bar Alexander Barkov
              greenman Ian Gilfillan
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.