[MDEV-7769] MY_CHARSET_INFO refactoring - Jira

XML

Word

Printable

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Fix Version/s: 10.2.3
Component/s: Character Sets
Labels:
- refactoring

Sprint:
10.2.0-6, 10.2.0-7, 10.2.1-1, 10.2.1-2

Description

Some functions in MY_CHARSET_HANDLER are not good enough and new more powerful functions have been added as replacements. This task is to clean-up MY_CHARSET_HANDLER, to remove the functions that have replacements.

We'll try to preserve API as much as possible, in case some plugins use the old functions (but ABI will change!).

1. Remove ismbchar() from MY_CHARSET_HANDLER:

(This part was done under terms of ~~MDEV-6353~~ (task) and ~~MDEV-9665~~ (subtask))

uint    (*ismbchar)(CHARSET_INFO *, const char *, const char *);

and fix the code to use a new function added in 10.1 instead:

 int (*charlen)(CHARSET_INFO *cs, const uchar *str, const uchar *end);

charlen() is a more powerful replacement for ismbchar(), as it can additionally:

distinguish between a valid single byte (return value 1) character vs a broken byte (return value 0)
report incomplete characters (premature end-of-line) with return values MY_CS_TOOSMALXXX

For API compatibility purposes, the macros my_ismbchar() can be restored as a wrapper
function around cs->cset->charlen() instead of cs->cset->ismbchar(), something like this:

uint my_ismbchar(CHARSET_INFO *cs, const uchar *str, const uchar *end)

  int rc= cs->cset->charlen(cs, str, end);

  return rc < 2 ? 0 : rc;

2. Remove mbcharlen() from MY_CHARSET_HANDLER:

(This part was done under terms of ~~MDEV-6353~~ (task) and ~~MDEV-9665~~ (subtask), without the function byte_property though)

  uint    (*mbcharlen)(CHARSET_INFO *, uint c);

and add a new function added in 10.1 instead:

  uint    (*byte_property)(CHARSET_INFO *, uint c);

Which will return a combination of flags, e.g.:

the byte is a stand-anlone valid character
the byte is a MB2 head
the byte is a MB3 head
the byte is a MB4 head
the byte is a MB5 head
the byte is a MB2 tail
the byte is a MB3 tail
the byte is a MB4 tail
the byte is a MB5 tail
the byte is MB23 continuation (e.g. the second byte in a 3-byte character)
the byte is MB24 continuation (e.g. the second byte in a 4-byte character)
the byte is MB34 continuation (e.g. the third byte in a 4-byte character)
the byte is MBxy continuation (for all possible x and y combinations)
and maybe some other flags

For API compatibility purposes, the old macros my_mbcharlen() can be rewritten as a wrapper around cs->cset->byte_property().

3. Remove well_formed_len:

  size_t  (*well_formed_len)(CHARSET_INFO *,

                             const char *b,const char *e,

                             size_t nchars, int *error);

and use a new function added in 10.1 instead:

  size_t (*well_formed_char_length)(CHARSET_INFO *cs,

                                    const char *str, const char *end,

                                    size_t nchars,

                                    MY_STRCOPY_STATUS *status);

The new function is a replacement for well_formed_len() and numchars() at the same time, it can return:
a. "number of characters" as a return value
b. "number of bytes" which is directly calculated from status->m_source_end_pos.
c. "there are bad bytes" in status->m_well_formed_error_pos, or NULL if no bad bytes.

Attachments

Issue Links

blocks

MDEV-7768 create a "charset" service

Open

is blocked by

MDEV-6353 my_ismbchar() and my_mbcharlen() refactoring

Closed

Activity

People

Assignee:: Alexander Barkov

Reporter:: Alexander Barkov

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2015-03-13 06:11

Updated:: 2016-10-10 10:37

Resolved:: 2016-10-10 10:37

Time Tracking

Estimated:

Remaining:

1.25d

Logged:

0.75d

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.