[MDEV-19028] Addressing the contraction problem with Engine Independent statistics Created: 2019-03-23  Updated: 2021-03-19

Status: Open
Project: MariaDB Server
Component/s: Optimizer
Affects Version/s: 10.2, 10.3, 10.4
Fix Version/s: 10.4

Type: Bug Priority: Major
Reporter: Varun Gupta (Inactive) Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 0
Labels: None


 Description   

Filing the contraction problem mentioned in MDEV-18899 as a seperate issue

Contraction problem

Also, the underlying code should be checked for contraction compatibility. The code copying to column_statistics.min_value should make sure not to break contractions in the middle, otherwise max_value can be very far from the actual maximum value.

For example, consider this data in combination with Czech collation:

CONCAT(REPEAT('x',254), 'ch'))

'ch' is a separate letter which is sorted between 'h' and 'i':
http://collation-charts.org/mysql60/mysql604.utf8_czech_ci.html

'ch' should not be broken into parts when copying to column_statistics.min_value:
'c' cannot be the last 255-th byte in column_statistics.min_value, because it was followed by 'h' in the original full-length data. The copying code should store only the REPEAT('x',254) part.

For column_statistics.max_value, the copying code will be even harder: it should replace 'ch' to the character which immediately follows 'ch' in the collation, which is 'i'.

An example

CREATE OR REPLACE TABLE t1 (a VARCHAR(10) CHARACTER SET utf8, comment TEXT);
INSERT INTO t1 VALUES ('aa','This is MIN'), ('aë','This is MAX');

SELECT a,comment FROM t1 ORDER BY a;
+------+-------------+
| a    | comment     |
+------+-------------+
| aa   | This is MIN |
| aë   | This is MAX |
+------+-------------+
 
SELECT CASE WHEN a='aë' THEN 'a' ELSE a END AS a_2_byte,comment FROM t1 ORDER BY 1;
+----------+-------------+
| a_2_byte | comment     |
+----------+-------------+
| a        | This is MAX |
| aa       | This is MIN |
+----------+-------------+

The limit used is 2 bytes here, instead of 255, for simplicity.

Notice, the original MIN and MAX values are 2 bytes ('aa') and 3 bytes ('aë') respectively
Now if we cut them to 2 bytes in the multi-byte safe way, we get:

  • 'aa' is still 'aa'
  • 'aë' becomes 'a' which makes it smaller than 'aa'
    So when the cut in a multi-byte safe way min and max can change

Generated at Thu Feb 08 08:48:32 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.