[MDEV-7769] MY_CHARSET_INFO refactoring Created: 2015-03-13  Updated: 2016-10-10  Resolved: 2016-10-10

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Fix Version/s: 10.2.3

Type: Task Priority: Major
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Fixed Votes: 0
Labels: refactoring

Issue Links:
Blocks
blocks MDEV-7768 create a "charset" service Open
is blocked by MDEV-6353 my_ismbchar() and my_mbcharlen() refa... Closed
Sprint: 10.2.0-6, 10.2.0-7, 10.2.1-1, 10.2.1-2

 Description   

Some functions in MY_CHARSET_HANDLER are not good enough and new more powerful functions have been added as replacements. This task is to clean-up MY_CHARSET_HANDLER, to remove the functions that have replacements.

We'll try to preserve API as much as possible, in case some plugins use the old functions (but ABI will change!).

1. Remove ismbchar() from MY_CHARSET_HANDLER:

(This part was done under terms of MDEV-6353 (task) and MDEV-9665 (subtask))

uint    (*ismbchar)(CHARSET_INFO *, const char *, const char *);

and fix the code to use a new function added in 10.1 instead:

 int (*charlen)(CHARSET_INFO *cs, const uchar *str, const uchar *end);

charlen() is a more powerful replacement for ismbchar(), as it can additionally:

  • distinguish between a valid single byte (return value 1) character vs a broken byte (return value 0)
  • report incomplete characters (premature end-of-line) with return values MY_CS_TOOSMALXXX

For API compatibility purposes, the macros my_ismbchar() can be restored as a wrapper
function around cs->cset->charlen() instead of cs->cset->ismbchar(), something like this:

uint my_ismbchar(CHARSET_INFO *cs, const uchar *str, const uchar *end)
{
  int rc= cs->cset->charlen(cs, str, end);
  return rc < 2 ? 0 : rc;
}

2. Remove mbcharlen() from MY_CHARSET_HANDLER:

(This part was done under terms of MDEV-6353 (task) and MDEV-9665 (subtask), without the function byte_property though)

  uint    (*mbcharlen)(CHARSET_INFO *, uint c);

and add a new function added in 10.1 instead:

  uint    (*byte_property)(CHARSET_INFO *, uint c);

Which will return a combination of flags, e.g.:

  • the byte is a stand-anlone valid character
  • the byte is a MB2 head
  • the byte is a MB3 head
  • the byte is a MB4 head
  • the byte is a MB5 head
  • the byte is a MB2 tail
  • the byte is a MB3 tail
  • the byte is a MB4 tail
  • the byte is a MB5 tail
  • the byte is MB23 continuation (e.g. the second byte in a 3-byte character)
  • the byte is MB24 continuation (e.g. the second byte in a 4-byte character)
  • the byte is MB34 continuation (e.g. the third byte in a 4-byte character)
  • the byte is MBxy continuation (for all possible x and y combinations)
  • and maybe some other flags

For API compatibility purposes, the old macros my_mbcharlen() can be rewritten as a wrapper around cs->cset->byte_property().

3. Remove well_formed_len:

  size_t  (*well_formed_len)(CHARSET_INFO *,
                             const char *b,const char *e,
                             size_t nchars, int *error); 

and use a new function added in 10.1 instead:

  size_t (*well_formed_char_length)(CHARSET_INFO *cs,
                                    const char *str, const char *end,
                                    size_t nchars,
                                    MY_STRCOPY_STATUS *status);

The new function is a replacement for well_formed_len() and numchars() at the same time, it can return:
a. "number of characters" as a return value
b. "number of bytes" which is directly calculated from status->m_source_end_pos.
c. "there are bad bytes" in status->m_well_formed_error_pos, or NULL if no bad bytes.


Generated at Thu Feb 08 07:22:07 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.