Currently MY_CHARSET_HANDLER has two functions to handle
multi-byte character lengths:
This API is not flexible enough.
1. Problems to detect invalid bytes sequences:
mbcharlen() reports invalid bytes as characters with length=1.
ismbchar() returns 0 both for single byte character and for invalid bytes.
It's not possible to detect invalid byte sequences using this API,
which makes it challenging to fix bugs like this:
MDEV-6218 Wrong result of CHAR_LENGTH(non-BMP-character) with 3-byte utf8
2. The first byte is not always enough to detect a character length.
For example, in gb18030, 0xFEFE is a two-byte character, while
0xFE308130 is a four-byte character. Notice, both start with 0xFE.
Using mbcharlen(cs, first_byte) is useless in combination with gb18030,
because at least two bytes are needed to make a decision for 0xFE??.
This API should be changed into a single function:
For performance purposes, the caller must supply a string consisting of
at least one byte. Non-zero length will be asserted:
The function will return the same codes that mb_wc() does:
1. Positive numbers on success:
1a. 1 in case of a one-byte character found starting at "str"
1b. 2 in case of a two-byte character
1c. 3 in case of a three-byte character
1d. 4 in case of a four-byte character
1e. 5 in case of a five-byte character
2. Non-positive number on error:
2a. MY_CS_ILSEQ (0) in case if a wrong byte sequence is met
2b. MY_CS_TOOSMALL(-101) in case if the supplied string is too short
to detect a character length. The caller must append one more byte to the
end of the tested string to make detection of the leading character
2c. MY_CS_TOOSMALL2...MY_CS_TOOSMALL4 (-102..-104).
The same meaning as in the as previous one,
but the caller must supply two..four more bytes.
Note, the function will ask the caller for as few more bytes as possible.
For example, for the character set gb18030 (which is not in MariaDB yet):
a. charlen(0xFE) will return MY_CS_TOOSMALL, asking for one more byte only.
Note, although 0xFE can start both 2-byte and 4-byte sequences,
charlen(0xFE) will ask for only one more byte,
it will not ask for 3 more bytes immediately. This is to give a chance
to the caller to read a 2-byte character from the incoming stream
without having to read extra bytes when they are not really necessary.
b. charlen(0xFEFE) will return 2, meaning a two byte-character.
c. charlen(0xFE30) will return MY_CS_TOOSMALL2, asking for two more bytes.
d. charlen(0xFE3081) will return MY_CS_TOOSMALL, askibg for one more byte.
e. charlen(0xFE308130) will return 4, meaning a four-byte character.
Note, the affected charset handler functions are currently almost not used directly:
Instead, they are used through the macros my_mbcharlen() and my_ismbchar():
This is very fortunate.
To avoid major code changes, we'll rewrite these macros
(either into new macros, or into functions) to mimic the old API.
my_ismbchar(cs, a, b) will call cs->cset->charlen(), but then will
change the return value as follows:
This is to return 0 for all cases where the old macros my_ismbchar()
- single byte characters (return value 1)
- invalid byte sequences (return value 0)
- too short strings (negative return values MY_CS_TOOSMALL*)
my_mbcharlen(cs, ch) will return 1 for single byte characters and for
unknown multi-byte heads. It will also convert MY_CS_TOOSMALL to 2,
MY_CS_TOOSMALL2 to 3 and so on.
Later, when adding gb18030, will replace some calls for my_ismbchar() and my_mbcharlen()
to direct calls for cs->cset->charlen(), in the places where it is important.