Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6353

my_ismbchar() and my_mbcharlen() refactoring

    Details

    • Sprint:
      10.2.0-6, 10.2.0-7, 10.2.0-9, 10.2.0-10, 10.2.0-11, 10.2.1-1, 10.2.1-2

      Description

      Currently MY_CHARSET_HANDLER has two functions to handle
      multi-byte character lengths:

        uint    (*ismbchar)(CHARSET_INFO *, const char *, const char *);
        uint    (*mbcharlen)(CHARSET_INFO *, uint c);
      

      This API is not flexible enough.

      1. Problems to detect invalid bytes sequences:

      mbcharlen() reports invalid bytes as characters with length=1.
      ismbchar() returns 0 both for single byte character and for invalid bytes.

      It's not possible to detect invalid byte sequences using this API,
      which makes it challenging to fix bugs like this:

      MDEV-6218 Wrong result of CHAR_LENGTH(non-BMP-character) with 3-byte utf8

      2. The first byte is not always enough to detect a character length.
      For example, in gb18030, 0xFEFE is a two-byte character, while
      0xFE308130 is a four-byte character. Notice, both start with 0xFE.
      Using mbcharlen(cs, first_byte) is useless in combination with gb18030,
      because at least two bytes are needed to make a decision for 0xFE??.

      This API should be changed into a single function:

        int (*charlen)(CHARSET_INFO *cs, const char *str, const char *strend);
      

      For performance purposes, the caller must supply a string consisting of
      at least one byte. Non-zero length will be asserted:

         DBUG_ASSERT(str <  strend);
      

      The function will return the same codes that mb_wc() does:

      1. Positive numbers on success:

      1a. 1 in case of a one-byte character found starting at "str"
      1b. 2 in case of a two-byte character
      1c. 3 in case of a three-byte character
      1d. 4 in case of a four-byte character
      1e. 5 in case of a five-byte character

      2. Non-positive number on error:
      2a. MY_CS_ILSEQ (0) in case if a wrong byte sequence is met
      2b. MY_CS_TOOSMALL(-101) in case if the supplied string is too short
      to detect a character length. The caller must append one more byte to the
      end of the tested string to make detection of the leading character
      length possible.
      2c. MY_CS_TOOSMALL2...MY_CS_TOOSMALL4 (-102..-104).
      The same meaning as in the as previous one,
      but the caller must supply two..four more bytes.

      Note, the function will ask the caller for as few more bytes as possible.
      For example, for the character set gb18030 (which is not in MariaDB yet):

      a. charlen(0xFE) will return MY_CS_TOOSMALL, asking for one more byte only.
      Note, although 0xFE can start both 2-byte and 4-byte sequences,
      charlen(0xFE) will ask for only one more byte,
      it will not ask for 3 more bytes immediately. This is to give a chance
      to the caller to read a 2-byte character from the incoming stream
      without having to read extra bytes when they are not really necessary.

      b. charlen(0xFEFE) will return 2, meaning a two byte-character.

      c. charlen(0xFE30) will return MY_CS_TOOSMALL2, asking for two more bytes.

      d. charlen(0xFE3081) will return MY_CS_TOOSMALL, askibg for one more byte.

      e. charlen(0xFE308130) will return 4, meaning a four-byte character.

      Note, the affected charset handler functions are currently almost not used directly:

        cs->cset->ismbchar(cs, str, strend);
        cs->cset->mbcharlen(cs, ch);
      

      Instead, they are used through the macros my_mbcharlen() and my_ismbchar():

      #define my_ismbchar(s, a, b)          ((s)->cset->ismbchar((s), (a), (b)))
      #define my_mbcharlen(s, a)            ((s)->cset->mbcharlen((s),(a)))
      

      This is very fortunate.
      To avoid major code changes, we'll rewrite these macros
      (either into new macros, or into functions) to mimic the old API.

      my_ismbchar(cs, a, b) will call cs->cset->charlen(), but then will
      change the return value as follows:

      uint my_ismbchar(CHARSET_INFO *cs, const char *str, const char *strend)
      {
        int rc= cs->cset->charlen(cs, str, strend);
        return (uint) (rc > 1 ? rc : 0);
      }
      

      This is to return 0 for all cases where the old macros my_ismbchar()
      returned 0:

      • single byte characters (return value 1)
      • invalid byte sequences (return value 0)
      • too short strings (negative return values MY_CS_TOOSMALL*)

      my_mbcharlen(cs, ch) will return 1 for single byte characters and for
      unknown multi-byte heads. It will also convert MY_CS_TOOSMALL to 2,
      MY_CS_TOOSMALL2 to 3 and so on.

      Later, when adding gb18030, will replace some calls for my_ismbchar() and my_mbcharlen()
      to direct calls for cs->cset->charlen(), in the places where it is important.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bar Alexander Barkov
                Reporter:
                bar Alexander Barkov
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: