[MDEV-27588] MyISAM/Aria descending index compression is not as efficient as ascending Created: 2022-01-23  Updated: 2022-01-26  Resolved: 2022-01-24

Status: Closed
Project: MariaDB Server
Component/s: Data types
Affects Version/s: N/A
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Sergei Golubchik
Resolution: Won't Fix Votes: 0
Labels: None

Issue Links:
Problem/Incident
is caused by MDEV-13756 Implement descending index: KEY (a DE... Closed
Relates
relates to MDEV-26664 Store UUIDs in a more efficient manner Closed

 Description   

There was some effort made in 10.7 to store UUID efficiently (MDEV-26664).
Apparently the logic does not apply (or doesn't apply well enough) to descending indexes which are being introduced in 10.8 in the scope of MDEV-13756.

Below is based on type_uuid.type_uuid_myisam test.

CREATE TABLE t1 (a UUID, KEY(a)) ENGINE=MyISAM;
 
--delimiter $
FOR i IN 0..8191 DO
INSERT INTO t1 VALUES (UUID());
END FOR $
--delimiter ;
 
SELECT
INDEX_LENGTH, DATA_LENGTH, INDEX_LENGTH/DATA_LENGTH
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA='test' AND TABLE_NAME='t1';
 
DROP TABLE t1;

With a standard ascending key it gives the result

INDEX_LENGTH	DATA_LENGTH	INDEX_LENGTH/DATA_LENGTH
79872	139264	0.5735

With a descending index (KEY(a DESC)) it produces

preview-10.8-MDEV-13756-desc-indexes c10e10c6

INDEX_LENGTH	DATA_LENGTH	INDEX_LENGTH/DATA_LENGTH
150528	139264	1.0809

Which, according to the same test's criteria, is a middle ground between "packed" and "not packed".



 Comments   
Comment by Alexander Barkov [ 2022-01-24 ]

The problem is not UUID related. MyISAM does not compress descending indexes well.

This script demonstrates the same problem with VARCHAR:

SET NAMES latin1;
CREATE OR REPLACE TABLE t1 (a VARCHAR(8) CHARACTER SET latin1, KEY(a)) ENGINE=MyISAM;
delimiter $
FOR c0 IN 0x61..0x7A DO
FOR c1 IN 0x61..0x7A DO
FOR c2 IN 0x61..0x7A DO
FOR c3 IN 0x61..0x7A DO
  INSERT INTO t1 VALUES (CONCAT('aaaa',CHAR(c0), CHAR(c1), CHAR(c2), CHAR(c3)));
END FOR;
END FOR;
END FOR;
END FOR;
$
delimiter ;
SELECT INDEX_LENGTH, DATA_LENGTH, INDEX_LENGTH/DATA_LENGTH
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA='test' AND TABLE_NAME='t1';

+--------------+-------------+--------------------------+
| INDEX_LENGTH | DATA_LENGTH | INDEX_LENGTH/DATA_LENGTH |
+--------------+-------------+--------------------------+
|      3836928 |     9139520 |                   0.4198 |
+--------------+-------------+--------------------------+

SET NAMES latin1;
CREATE OR REPLACE TABLE t1 (a VARCHAR(8) CHARACTER SET latin1, KEY(a DESC)) ENGINE=MyISAM;
delimiter $
FOR c0 IN 0x61..0x7A DO
FOR c1 IN 0x61..0x7A DO
FOR c2 IN 0x61..0x7A DO
FOR c3 IN 0x61..0x7A DO
  INSERT INTO t1 VALUES (CONCAT('aaaa', CHAR(c0), CHAR(c1), CHAR(c2), CHAR(c3)));
END FOR;
END FOR;
END FOR;
END FOR;
$
delimiter ;
SELECT INDEX_LENGTH, DATA_LENGTH, INDEX_LENGTH/DATA_LENGTH
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA='test' AND TABLE_NAME='t1';

+--------------+-------------+--------------------------+
| INDEX_LENGTH | DATA_LENGTH | INDEX_LENGTH/DATA_LENGTH |
+--------------+-------------+--------------------------+
|      7424000 |     9139520 |                   0.8123 |
+--------------+-------------+--------------------------+

Notice:

  • Ascending index size is 0.4198 of the data size
  • Descending index size is 0.8123 of the data size
Comment by Alexander Barkov [ 2022-01-24 ]

The problem is also repeatable with Aria+VARCHAR:

SET NAMES latin1;
CREATE OR REPLACE TABLE t1 (a VARCHAR(6) CHARACTER SET latin1, KEY(a)) ENGINE=Aria;
delimiter $
FOR c0 IN 0x61..0x7A DO
FOR c1 IN 0x61..0x7A DO
FOR c2 IN 0x61..0x7A DO
  INSERT INTO t1 VALUES (CONCAT('aaa', CHAR(c0), CHAR(c1), CHAR(c2)));
END FOR;
END FOR;
END FOR;
$
delimiter ;
SELECT INDEX_LENGTH, DATA_LENGTH, INDEX_LENGTH/DATA_LENGTH
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA='test' AND TABLE_NAME='t1';

+--------------+-------------+--------------------------+
| INDEX_LENGTH | DATA_LENGTH | INDEX_LENGTH/DATA_LENGTH |
+--------------+-------------+--------------------------+
|       253952 |      770048 |                   0.3298 |
+--------------+-------------+--------------------------+

SET NAMES latin1;
CREATE OR REPLACE TABLE t1 (a VARCHAR(6) CHARACTER SET latin1, KEY(a DESC)) ENGINE=Aria;
delimiter $
FOR c0 IN 0x61..0x7A DO
FOR c1 IN 0x61..0x7A DO
FOR c2 IN 0x61..0x7A DO
  INSERT INTO t1 VALUES (CONCAT('aaa', CHAR(c0), CHAR(c1), CHAR(c2)));
END FOR;
END FOR;
END FOR;
$
delimiter ;
SELECT INDEX_LENGTH, DATA_LENGTH, INDEX_LENGTH/DATA_LENGTH
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA='test' AND TABLE_NAME='t1';

+--------------+-------------+--------------------------+
| INDEX_LENGTH | DATA_LENGTH | INDEX_LENGTH/DATA_LENGTH |
+--------------+-------------+--------------------------+
|       466944 |      770048 |                   0.6064 |
+--------------+-------------+--------------------------+

Comment by Sergei Golubchik [ 2022-01-24 ]

Key packing is completely symmetrical. You can get better or worse results depending on the order of insertion. Here's a complete test case, based on yours:

SET NAMES latin1;
CREATE OR REPLACE TABLE t1 (a VARCHAR(8) CHARACTER SET latin1, KEY(a)) ENGINE=MyISAM;
CREATE OR REPLACE TABLE t2 (a VARCHAR(8) CHARACTER SET latin1, KEY(a DESC)) ENGINE=MyISAM;
CREATE OR REPLACE TABLE t3 (a VARCHAR(8) CHARACTER SET latin1, KEY(a)) ENGINE=MyISAM;
CREATE OR REPLACE TABLE t4 (a VARCHAR(8) CHARACTER SET latin1, KEY(a DESC)) ENGINE=MyISAM;
delimiter $;
FOR c0 IN 0x61..0x7A DO
  FOR c1 IN 0x61..0x7A DO
    FOR c2 IN 0x61..0x7A DO
      FOR c3 IN 0x61..0x7A DO
        INSERT INTO t1 VALUES (CONCAT('aaaa',CHAR(c0), CHAR(c1), CHAR(c2), CHAR(c3)));
        INSERT INTO t2 VALUES (CONCAT('aaaa',CHAR(c0), CHAR(c1), CHAR(c2), CHAR(c3)));
        INSERT INTO t3 VALUES (CONCAT('aaaa',CHAR(219-c0), CHAR(219-c1), CHAR(219-c2), CHAR(219-c3)));
        INSERT INTO t4 VALUES (CONCAT('aaaa',CHAR(219-c0), CHAR(219-c1), CHAR(219-c2), CHAR(219-c3)));
      END FOR;
    END FOR;
  END FOR;
END FOR;
$
delimiter ;$
SELECT INDEX_LENGTH, DATA_LENGTH, INDEX_LENGTH/DATA_LENGTH
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA='test' AND TABLE_NAME LIKE 't_';
drop table t1, t2, t3, t4;

as you can see, t1 is like in your first test case, ascending indexes, insertions happen in the ascending order. t2 has descending indexes, t3 has ascending indexes but insertions happen in the descending order, t4 has descending indexes again.

The result is

t1 3836928 9139520 0.4198 ASC index ASC insert
t2 7424000 9139520 0.8123 DESC index ASC insert
t3 7424000 9139520 0.8123 ASC index DESC insert
t4 3836928 9139520 0.4198 DESC index DESC insert

as you can see, values are packed more tightly when you insert in the index order.

Comment by Michael Widenius [ 2022-01-25 ]

The reason for this is that when you insert a new key, we only consider packing that key compared to the previous key.
To keep the code simple and not increase insert time too much, we are not considering packing the next key.

This means that if you insert ANDER and then ANDERSON, this will be stored approximately as:
ANDER <ref_to_record> <5>SON <ref_to_record>

If we insert ANDERSON and then ANDER, we get:
ANDER <ref_to_record> ANDERSON <ref_to_record>

It is of course possible to (optionally?) add repacking of the next key, but I am not sure if the increased insert time is worth it.

Note that OPTIMIZE TABLE will sort the index before inserting, which will ensure maxium key packing.

Comment by Alexander Barkov [ 2022-01-26 ]

The problem is also repeatable with not compressed keys:

DELIMITER $
CREATE OR REPLACE PROCEDURE p1(insert_desc BOOL, dt TEXT)
BEGIN
  DECLARE cmd TEXT DEFAULT 'CREATE TABLE t1 '
                           '(a DATATYPE CHARACTER SET latin1, KEY(a)) '
                           'ENGINE=MyISAM PACK_KEYS=0';
  SET NAMES latin1;
  DROP TABLE IF EXISTS t1;
  EXECUTE IMMEDIATE REPLACE(cmd, 'DATATYPE', dt);
  SHOW CREATE TABLE t1;
  FOR c0 IN 0x61..0x7A DO
  FOR c1 IN 0x61..0x7A DO
  FOR c2 IN 0x61..0x7A DO
    IF NOT insert_desc
    THEN
      INSERT INTO t1 VALUES (CONCAT('aaaa',CHAR(c0), CHAR(c1), CHAR(c2)));
    ELSE
      INSERT INTO t1 VALUES (CONCAT('aaaa',CHAR(219-c0), CHAR(219-c1), CHAR(219-c2)));
    END IF;
  END FOR;
  END FOR;
  END FOR;
  SELECT INDEX_LENGTH, DATA_LENGTH, INDEX_LENGTH/DATA_LENGTH
  FROM INFORMATION_SCHEMA.TABLES
  WHERE TABLE_SCHEMA='test' AND TABLE_NAME='t1';
END;
$
DELIMITER ;

with VARCHAR:

CALL p1(false, 'VARCHAR(8)');

+--------------+-------------+--------------------------+
| INDEX_LENGTH | DATA_LENGTH | INDEX_LENGTH/DATA_LENGTH |
+--------------+-------------+--------------------------+
|       277504 |      351520 |                   0.7894 |
+--------------+-------------+--------------------------+

CALL p1(true, 'VARCHAR(8)');

+--------------+-------------+--------------------------+
| INDEX_LENGTH | DATA_LENGTH | INDEX_LENGTH/DATA_LENGTH |
+--------------+-------------+--------------------------+
|       536576 |      351520 |                   1.5264 |
+--------------+-------------+--------------------------+

and with CHAR:

CALL p1(false, 'CHAR(8)');

+--------------+-------------+--------------------------+
| INDEX_LENGTH | DATA_LENGTH | INDEX_LENGTH/DATA_LENGTH |
+--------------+-------------+--------------------------+
|       276480 |      158184 |                   1.7478 |
+--------------+-------------+--------------------------+

CALL p1(true, 'CHAR(8)');

+--------------+-------------+--------------------------+
| INDEX_LENGTH | DATA_LENGTH | INDEX_LENGTH/DATA_LENGTH |
+--------------+-------------+--------------------------+
|       535552 |      158184 |                   3.3856 |
+--------------+-------------+--------------------------+

Generated at Thu Feb 08 09:54:03 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.