Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7769

MY_CHARSET_INFO refactoring

    XMLWordPrintable

    Details

    • Sprint:
      10.2.0-6, 10.2.0-7, 10.2.1-1, 10.2.1-2

      Description

      Some functions in MY_CHARSET_HANDLER are not good enough and new more powerful functions have been added as replacements. This task is to clean-up MY_CHARSET_HANDLER, to remove the functions that have replacements.

      We'll try to preserve API as much as possible, in case some plugins use the old functions (but ABI will change!).

      1. Remove ismbchar() from MY_CHARSET_HANDLER:

      (This part was done under terms of MDEV-6353 (task) and MDEV-9665 (subtask))

      uint    (*ismbchar)(CHARSET_INFO *, const char *, const char *);
      

      and fix the code to use a new function added in 10.1 instead:

       int (*charlen)(CHARSET_INFO *cs, const uchar *str, const uchar *end);
      

      charlen() is a more powerful replacement for ismbchar(), as it can additionally:

      • distinguish between a valid single byte (return value 1) character vs a broken byte (return value 0)
      • report incomplete characters (premature end-of-line) with return values MY_CS_TOOSMALXXX

      For API compatibility purposes, the macros my_ismbchar() can be restored as a wrapper
      function around cs->cset->charlen() instead of cs->cset->ismbchar(), something like this:

      uint my_ismbchar(CHARSET_INFO *cs, const uchar *str, const uchar *end)
      {
        int rc= cs->cset->charlen(cs, str, end);
        return rc < 2 ? 0 : rc;
      }
      

      2. Remove mbcharlen() from MY_CHARSET_HANDLER:

      (This part was done under terms of MDEV-6353 (task) and MDEV-9665 (subtask), without the function byte_property though)

        uint    (*mbcharlen)(CHARSET_INFO *, uint c);
      

      and add a new function added in 10.1 instead:

        uint    (*byte_property)(CHARSET_INFO *, uint c);
      

      Which will return a combination of flags, e.g.:

      • the byte is a stand-anlone valid character
      • the byte is a MB2 head
      • the byte is a MB3 head
      • the byte is a MB4 head
      • the byte is a MB5 head
      • the byte is a MB2 tail
      • the byte is a MB3 tail
      • the byte is a MB4 tail
      • the byte is a MB5 tail
      • the byte is MB23 continuation (e.g. the second byte in a 3-byte character)
      • the byte is MB24 continuation (e.g. the second byte in a 4-byte character)
      • the byte is MB34 continuation (e.g. the third byte in a 4-byte character)
      • the byte is MBxy continuation (for all possible x and y combinations)
      • and maybe some other flags

      For API compatibility purposes, the old macros my_mbcharlen() can be rewritten as a wrapper around cs->cset->byte_property().

      3. Remove well_formed_len:

        size_t  (*well_formed_len)(CHARSET_INFO *,
                                   const char *b,const char *e,
                                   size_t nchars, int *error); 
      

      and use a new function added in 10.1 instead:

        size_t (*well_formed_char_length)(CHARSET_INFO *cs,
                                          const char *str, const char *end,
                                          size_t nchars,
                                          MY_STRCOPY_STATUS *status);
      

      The new function is a replacement for well_formed_len() and numchars() at the same time, it can return:
      a. "number of characters" as a return value
      b. "number of bytes" which is directly calculated from status->m_source_end_pos.
      c. "there are bad bytes" in status->m_well_formed_error_pos, or NULL if no bad bytes.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              bar Alexander Barkov
              Reporter:
              bar Alexander Barkov
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: