Details
-
Task
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.2.0-6, 10.2.0-7, 10.2.1-1, 10.2.1-2
Description
Some functions in MY_CHARSET_HANDLER are not good enough and new more powerful functions have been added as replacements. This task is to clean-up MY_CHARSET_HANDLER, to remove the functions that have replacements.
We'll try to preserve API as much as possible, in case some plugins use the old functions (but ABI will change!).
1. Remove ismbchar() from MY_CHARSET_HANDLER:
(This part was done under terms of
MDEV-6353(task) andMDEV-9665(subtask))
uint (*ismbchar)(CHARSET_INFO *, const char *, const char *);
|
and fix the code to use a new function added in 10.1 instead:
int (*charlen)(CHARSET_INFO *cs, const uchar *str, const uchar *end);
|
charlen() is a more powerful replacement for ismbchar(), as it can additionally:
- distinguish between a valid single byte (return value 1) character vs a broken byte (return value 0)
- report incomplete characters (premature end-of-line) with return values MY_CS_TOOSMALXXX
For API compatibility purposes, the macros my_ismbchar() can be restored as a wrapper
function around cs->cset->charlen() instead of cs->cset->ismbchar(), something like this:
uint my_ismbchar(CHARSET_INFO *cs, const uchar *str, const uchar *end)
|
{
|
int rc= cs->cset->charlen(cs, str, end);
|
return rc < 2 ? 0 : rc;
|
}
|
2. Remove mbcharlen() from MY_CHARSET_HANDLER:
(This part was done under terms of
MDEV-6353(task) andMDEV-9665(subtask), without the function byte_property though)
uint (*mbcharlen)(CHARSET_INFO *, uint c);
|
and add a new function added in 10.1 instead:
uint (*byte_property)(CHARSET_INFO *, uint c);
|
Which will return a combination of flags, e.g.:
- the byte is a stand-anlone valid character
- the byte is a MB2 head
- the byte is a MB3 head
- the byte is a MB4 head
- the byte is a MB5 head
- the byte is a MB2 tail
- the byte is a MB3 tail
- the byte is a MB4 tail
- the byte is a MB5 tail
- the byte is MB23 continuation (e.g. the second byte in a 3-byte character)
- the byte is MB24 continuation (e.g. the second byte in a 4-byte character)
- the byte is MB34 continuation (e.g. the third byte in a 4-byte character)
- the byte is MBxy continuation (for all possible x and y combinations)
- and maybe some other flags
For API compatibility purposes, the old macros my_mbcharlen() can be rewritten as a wrapper around cs->cset->byte_property().
3. Remove well_formed_len:
size_t (*well_formed_len)(CHARSET_INFO *,
|
const char *b,const char *e,
|
size_t nchars, int *error);
|
and use a new function added in 10.1 instead:
size_t (*well_formed_char_length)(CHARSET_INFO *cs,
|
const char *str, const char *end,
|
size_t nchars,
|
MY_STRCOPY_STATUS *status);
|
The new function is a replacement for well_formed_len() and numchars() at the same time, it can return:
a. "number of characters" as a return value
b. "number of bytes" which is directly calculated from status->m_source_end_pos.
c. "there are bad bytes" in status->m_well_formed_error_pos, or NULL if no bad bytes.