Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34417

Wrong result set with utf8mb4_danish_ci and BNLH join

Details

    Description

      CREATE OR REPLACE TABLE t1 (a VARCHAR(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_danish_ci);
      INSERT INTO t1 VALUES ('aaaa'),('åå');
      SELECT * FROM t1 WHERE a='aaaa';
      

      +------+
      | a    |
      +------+
      | aaaa |
      | åå   |
      +------+
      

      Looks good so far. It returns two records, because in Danish collation 'å' is equal to 'aa'.

      Now I join the table to itself:

      SET join_cache_level=1;
      SELECT * FROM t1 NATURAL JOIN t1 t2;
      

      +------+
      | a    |
      +------+
      | aaaa |
      | åå   |
      | aaaa |
      | åå   |
      +------+
      

      Looks good so far.

      Now I enable the BNLH join:

      SET join_cache_level=3;
      SELECT * FROM t1 NATURAL JOIN t1 t2;
      

      +------+
      | a    |
      +------+
      | aaaa |
      | åå   |
      +------+
      

      Half of the records disappeared. Looks wrong.

      Attachments

        Issue Links

          Activity

            psergei Sergei Petrunia added a comment - - edited

            As far as I understand, the code assumed that length argument of CHARSET_INFO::hash_sort()
            was 'number of characters', not 'bytes' . It would be nice to add a comment to this function about this.

            psergei Sergei Petrunia added a comment - - edited As far as I understand, the code assumed that length argument of CHARSET_INFO::hash_sort() was 'number of characters', not 'bytes' . It would be nice to add a comment to this function about this.

            The patch looks good to me.

            psergei Sergei Petrunia added a comment - The patch looks good to me.

            psergei, thanks for the review.

            I think the code was copied from storage/heap/hp_hash.c

            • key_hashnr() from hp_rb_make_key()
            • key_buf_cmp() from hp_rec_key_cmp()

            However in a wrong way.

            The original code divides seg->length to mbmaxlen, while the code behind the BNLH divided the actual octet length of the key to mbmaxlen, which was meaningless and wrong.

            Perhaps we could share the code in BNLH and in hp_hash.c somehow.
            The question is where to put it...

            bar Alexander Barkov added a comment - psergei , thanks for the review. I think the code was copied from storage/heap/hp_hash.c key_hashnr() from hp_rb_make_key() key_buf_cmp() from hp_rec_key_cmp() However in a wrong way. The original code divides seg->length to mbmaxlen, while the code behind the BNLH divided the actual octet length of the key to mbmaxlen, which was meaningless and wrong. Perhaps we could share the code in BNLH and in hp_hash.c somehow. The question is where to put it...

            Tried to further improve the performance of my_hash_sort_utf8mb4_nopad() using this loop:

            #define my_hash_add_ascii4_fast(nr, m1, m2) \
            do \
            { \
              MY_HASH_ADD(m1, m2, (uchar) (nr)); \
              MY_HASH_ADD(m1, m2, (uchar) (nr >> 8)); \
              MY_HASH_ADD(m1, m2, (uchar) (nr >> 16)); \
              MY_HASH_ADD(m1, m2, (uchar) (nr >> 24)); \
            } while(0)
             
            ...
             
            for ( ; s + 4 < e ; s+= 4)
            {
              ulonglong nr= uint4korr(s);
              if (nr & 0x80808080)
                break;
              nr= my_ascii_to_upper_magic_uint64(nr);
              my_hash_add_ascii4_fast((uint32) nr, m1, m2);
            }
            

            This gives improvement about 5-10% but introduces a hash incompatibility with older versions because does not hash the hi byte (which is 0 in this case).
            This improvement can be used only if we introduce a new function in MY_COLLATION_HANDLER - to return hash which is used in memory only and does not affect disk files (e.g. doesn't affect long uniques or partition distribution).

            bar Alexander Barkov added a comment - Tried to further improve the performance of my_hash_sort_utf8mb4_nopad() using this loop: #define my_hash_add_ascii4_fast(nr, m1, m2) \ do \ { \ MY_HASH_ADD(m1, m2, (uchar) (nr)); \ MY_HASH_ADD(m1, m2, (uchar) (nr >> 8)); \ MY_HASH_ADD(m1, m2, (uchar) (nr >> 16)); \ MY_HASH_ADD(m1, m2, (uchar) (nr >> 24)); \ } while(0)   ...   for ( ; s + 4 < e ; s+= 4) { ulonglong nr= uint4korr(s); if (nr & 0x80808080) break; nr= my_ascii_to_upper_magic_uint64(nr); my_hash_add_ascii4_fast((uint32) nr, m1, m2); } This gives improvement about 5-10% but introduces a hash incompatibility with older versions because does not hash the hi byte (which is 0 in this case). This improvement can be used only if we introduce a new function in MY_COLLATION_HANDLER - to return hash which is used in memory only and does not affect disk files (e.g. doesn't affect long uniques or partition distribution).

            People

              bar Alexander Barkov
              bar Alexander Barkov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.