[MDEV-6353] my_ismbchar() and my_mbcharlen() refactoring - Jira

XML

Word

Printable

Details

Type: Task
Status: Closed (View Workflow)
Priority: Minor
Resolution: Fixed
Fix Version/s: 10.2.1
Component/s: Character Sets
Labels:
- refactoring

Sprint:
10.2.0-6, 10.2.0-7, 10.2.0-9, 10.2.0-10, 10.2.0-11, 10.2.1-1, 10.2.1-2

Description

Currently MY_CHARSET_HANDLER has two functions to handle
multi-byte character lengths:

  uint    (*ismbchar)(CHARSET_INFO *, const char *, const char *);

  uint    (*mbcharlen)(CHARSET_INFO *, uint c);

This API is not flexible enough.

1. Problems to detect invalid bytes sequences:

mbcharlen() reports invalid bytes as characters with length=1.
ismbchar() returns 0 both for single byte character and for invalid bytes.

It's not possible to detect invalid byte sequences using this API,
which makes it challenging to fix bugs like this:

MDEV-6218 Wrong result of CHAR_LENGTH(non-BMP-character) with 3-byte utf8

2. The first byte is not always enough to detect a character length.
For example, in gb18030, 0xFEFE is a two-byte character, while
0xFE308130 is a four-byte character. Notice, both start with 0xFE.
Using mbcharlen(cs, first_byte) is useless in combination with gb18030,
because at least two bytes are needed to make a decision for 0xFE??.

This API should be changed into a single function:

  int (*charlen)(CHARSET_INFO *cs, const char *str, const char *strend);

For performance purposes, the caller must supply a string consisting of
at least one byte. Non-zero length will be asserted:

   DBUG_ASSERT(str <  strend);

The function will return the same codes that mb_wc() does:

1. Positive numbers on success:

1a. 1 in case of a one-byte character found starting at "str"
1b. 2 in case of a two-byte character
1c. 3 in case of a three-byte character
1d. 4 in case of a four-byte character
1e. 5 in case of a five-byte character

2. Non-positive number on error:
2a. MY_CS_ILSEQ (0) in case if a wrong byte sequence is met
2b. MY_CS_TOOSMALL(-101) in case if the supplied string is too short
to detect a character length. The caller must append one more byte to the
end of the tested string to make detection of the leading character
length possible.
2c. MY_CS_TOOSMALL2...MY_CS_TOOSMALL4 (-102..-104).
The same meaning as in the as previous one,
but the caller must supply two..four more bytes.

Note, the function will ask the caller for as few more bytes as possible.
For example, for the character set gb18030 (which is not in MariaDB yet):

a. charlen(0xFE) will return MY_CS_TOOSMALL, asking for one more byte only.
Note, although 0xFE can start both 2-byte and 4-byte sequences,
charlen(0xFE) will ask for only one more byte,
it will not ask for 3 more bytes immediately. This is to give a chance
to the caller to read a 2-byte character from the incoming stream
without having to read extra bytes when they are not really necessary.

b. charlen(0xFEFE) will return 2, meaning a two byte-character.

c. charlen(0xFE30) will return MY_CS_TOOSMALL2, asking for two more bytes.

d. charlen(0xFE3081) will return MY_CS_TOOSMALL, askibg for one more byte.

e. charlen(0xFE308130) will return 4, meaning a four-byte character.

Note, the affected charset handler functions are currently almost not used directly:

  cs->cset->ismbchar(cs, str, strend);

  cs->cset->mbcharlen(cs, ch);

Instead, they are used through the macros my_mbcharlen() and my_ismbchar():

#define my_ismbchar(s, a, b)          ((s)->cset->ismbchar((s), (a), (b)))

#define my_mbcharlen(s, a)            ((s)->cset->mbcharlen((s),(a)))

This is very fortunate.
To avoid major code changes, we'll rewrite these macros
(either into new macros, or into functions) to mimic the old API.

my_ismbchar(cs, a, b) will call cs->cset->charlen(), but then will
change the return value as follows:

uint my_ismbchar(CHARSET_INFO *cs, const char *str, const char *strend)

  int rc= cs->cset->charlen(cs, str, strend);

  return (uint) (rc > 1 ? rc : 0);

This is to return 0 for all cases where the old macros my_ismbchar()
returned 0:

single byte characters (return value 1)
invalid byte sequences (return value 0)
too short strings (negative return values MY_CS_TOOSMALL*)

my_mbcharlen(cs, ch) will return 1 for single byte characters and for
unknown multi-byte heads. It will also convert MY_CS_TOOSMALL to 2,
MY_CS_TOOSMALL2 to 3 and so on.

Later, when adding gb18030, will replace some calls for my_ismbchar() and my_mbcharlen()
to direct calls for cs->cset->charlen(), in the places where it is important.

Attachments

Issue Links

blocks

MDEV-7495 Support GB18030 character set

Open

MDEV-7769 MY_CHARSET_INFO refactoring

Closed

is blocked by

MDEV-9665 Remove cs->cset->ismbchar()

Closed

MDEV-9811 LOAD DATA INFILE does not work well with gbk in some cases

Closed

MDEV-9823 LOAD DATA INFILE silently truncates incomplete byte sequences

Closed

MDEV-9824 LOAD DATA does not work with multi-byte strings in LINES TERMINATED BY when IGNORE is specified

Closed

MDEV-9842 LOAD DATA INFILE does not work well with a TEXT column when using sjis

Closed

MDEV-9874 LOAD XML INFILE does not handle well broken multi-byte characters

Closed

(3 is blocked by)

Activity

People

Assignee:: Alexander Barkov

Reporter:: Alexander Barkov

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2014-06-17 17:07

Updated:: 2020-05-05 07:18

Resolved:: 2016-05-17 11:28

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.