[MDEV-34417] Wrong result set with utf8mb4_danish_ci and BNLH join - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.5, 11.6(EOL)
Fix Version/s: 10.5.26, 10.6.19, 10.11.9, 11.1.6, 11.2.5, 11.4.3, 11.5.2, 11.6.0
Component/s: Character Sets, Optimizer
Labels:
None

Description

CREATE OR REPLACE TABLE t1 (a VARCHAR(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_danish_ci);

INSERT INTO t1 VALUES ('aaaa'),('åå');

SELECT * FROM t1 WHERE a='aaaa';

+------+

| a    |

+------+

| aaaa |

| åå   |

+------+

Looks good so far. It returns two records, because in Danish collation 'å' is equal to 'aa'.

Now I join the table to itself:

SET join_cache_level=1;

SELECT * FROM t1 NATURAL JOIN t1 t2;

+------+

| a    |

+------+

| aaaa |

| åå   |

| aaaa |

| åå   |

+------+

Looks good so far.

Now I enable the BNLH join:

SET join_cache_level=3;

SELECT * FROM t1 NATURAL JOIN t1 t2;

+------+

| a    |

+------+

| aaaa |

| åå   |

+------+

Half of the records disappeared. Looks wrong.

Attachments

Issue Links

causes

MDEV-34580 Assertion `(key_part->key_part_flag & 4) == 0' failed key_hashnr

Closed

is duplicated by

MDEV-34352 Execution time for Block Nested Loop Hash (BNLH) join with charset utf8mb4 is 7-20 times slower than with latin1

Closed

relates to

MDEV-34232 Check performance of BNL-H vs MySQL's hash join

Open

MDEV-34352 Execution time for Block Nested Loop Hash (BNLH) join with charset utf8mb4 is 7-20 times slower than with latin1

Closed

MDEV-34427 BNL-H has not optimal implementation for varchar type

Open

Activity

Ascending order - Click to sort in descending order

Sergei Petrunia added a comment - 2024-06-19 17:36 - edited

As far as I understand, the code assumed that length argument of CHARSET_INFO::hash_sort()
was 'number of characters', not 'bytes' . It would be nice to add a comment to this function about this.

Sergei Petrunia added a comment - 2024-06-19 17:36 - edited As far as I understand, the code assumed that length argument of CHARSET_INFO::hash_sort() was 'number of characters', not 'bytes' . It would be nice to add a comment to this function about this.

Sergei Petrunia added a comment - 2024-06-19 17:42

The patch looks good to me.

Sergei Petrunia added a comment - 2024-06-19 17:42 The patch looks good to me.

Alexander Barkov added a comment - 2024-06-20 08:11

psergei, thanks for the review.

I think the code was copied from storage/heap/hp_hash.c

key_hashnr() from hp_rb_make_key()
key_buf_cmp() from hp_rec_key_cmp()

However in a wrong way.

The original code divides seg->length to mbmaxlen, while the code behind the BNLH divided the actual octet length of the key to mbmaxlen, which was meaningless and wrong.

Perhaps we could share the code in BNLH and in hp_hash.c somehow.
The question is where to put it...

Alexander Barkov added a comment - 2024-06-20 08:11 psergei , thanks for the review. I think the code was copied from storage/heap/hp_hash.c key_hashnr() from hp_rb_make_key() key_buf_cmp() from hp_rec_key_cmp() However in a wrong way. The original code divides seg->length to mbmaxlen, while the code behind the BNLH divided the actual octet length of the key to mbmaxlen, which was meaningless and wrong. Perhaps we could share the code in BNLH and in hp_hash.c somehow. The question is where to put it...

Alexander Barkov added a comment - 2024-06-21 05:52

Tried to further improve the performance of my_hash_sort_utf8mb4_nopad() using this loop:

#define my_hash_add_ascii4_fast(nr, m1, m2) \

do \

{ \

  MY_HASH_ADD(m1, m2, (uchar) (nr)); \

  MY_HASH_ADD(m1, m2, (uchar) (nr >> 8)); \

  MY_HASH_ADD(m1, m2, (uchar) (nr >> 16)); \

  MY_HASH_ADD(m1, m2, (uchar) (nr >> 24)); \

} while(0)

...

for ( ; s + 4 < e ; s+= 4)

  ulonglong nr= uint4korr(s);

  if (nr & 0x80808080)

    break;

  nr= my_ascii_to_upper_magic_uint64(nr);

  my_hash_add_ascii4_fast((uint32) nr, m1, m2);

This gives improvement about 5-10% but introduces a hash incompatibility with older versions because does not hash the hi byte (which is 0 in this case).
This improvement can be used only if we introduce a new function in MY_COLLATION_HANDLER - to return hash which is used in memory only and does not affect disk files (e.g. doesn't affect long uniques or partition distribution).

Alexander Barkov added a comment - 2024-06-21 05:52 Tried to further improve the performance of my_hash_sort_utf8mb4_nopad() using this loop: #define my_hash_add_ascii4_fast(nr, m1, m2) \ do \ { \ MY_HASH_ADD(m1, m2, (uchar) (nr)); \ MY_HASH_ADD(m1, m2, (uchar) (nr >> 8)); \ MY_HASH_ADD(m1, m2, (uchar) (nr >> 16)); \ MY_HASH_ADD(m1, m2, (uchar) (nr >> 24)); \ } while(0) ... for ( ; s + 4 < e ; s+= 4) { ulonglong nr= uint4korr(s); if (nr & 0x80808080) break; nr= my_ascii_to_upper_magic_uint64(nr); my_hash_add_ascii4_fast((uint32) nr, m1, m2); } This gives improvement about 5-10% but introduces a hash incompatibility with older versions because does not hash the hi byte (which is 0 in this case). This improvement can be used only if we introduce a new function in MY_COLLATION_HANDLER - to return hash which is used in memory only and does not affect disk files (e.g. doesn't affect long uniques or partition distribution).

People

Assignee:: Alexander Barkov

Reporter:: Alexander Barkov

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2024-06-18 08:15

Updated:: 2024-07-30 14:49

Resolved:: 2024-06-20 08:45

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server