[MDEV-21581] Helper functions and methods for CHARSET_INFO Created: 2020-01-28  Updated: 2020-01-28  Resolved: 2020-01-28

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Fix Version/s: 10.5.1

Type: Task Priority: Major
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
blocks MDEV-8334 Rename utf8 to utf8mb3 Closed
blocks MDEV-21504 Collation: Create shared library for ... Closed

 Description   

The call notation of CHARSET_INFO routines has some disadvantages:

  • it looks cumbersome
  • it exposes internal structure of CHARSET_INFO to the caller

Examples:

  // An example from storage/innodb/
  tmp_length = charset->coll->strnxfrm(charset, str, str_length,
                                       str_length, tmp_str,
                                       tmp_length, 0);

  // An example from storage/innodb/
  mbl = cs->cset->ctype(cs, &ctype, (uchar*) doc, (uchar*) end);

  // An example from storage/myisam/
  mbl= cs->cset->ctype(cs, &ctype, (uchar*)doc, (uchar*)end);

  // An example from storage/myisam/
  keyseg->charset->cset->fill(keyseg->charset,
                              (char*) pos + length,
                              keyseg->length - length,
                              ' ');

To make the call notation simple and proof to changes in CHARSET_INFO, lets do the following:

  • Add pure C wrappers for all virtual functions in MY_CHARSET_HANDLER and MY_COLLATION_HANDLER, e.g.

    static inline void
    my_ci_fill(CHARSET_INFO *cs, char *to, size_t len, int ch)
    {
      (cs->cset->fill)(cs, to, len, ch);
    }
    

    Let's call all new functions using the my_ci_ prefix, to make it clear that the first argument is CHARSET_INFO.

  • Add C++ methods into struct charset_info_st, like this:

    struct charset_info_st
    {
    #ifdef __cplusplus
    ...
      void fill(char *to, size_t len, int ch) const
      {
        (cset->fill)(this, to, len, ch);
      }
    ...
      size_t strnxfrm(uchar *dst, size_t dstlen, uint nweights,
                      const uchar *src, size_t srclen, uint flags) const
      {
        return (coll->strnxfrm)(this,
                                dst, dstlen, nweights,
                                src, srclen, flags);
      }
    ...
    #endif
    };
    

so the code in the above examples will turn into:

  // C++ code
  tmp_length = charset->strnxfrm(str, str_length,
                                 str_length, tmp_str,
                                 tmp_length, 0);

  // C++ code
  mbl = cs->ctype(&ctype, (uchar*) doc, (uchar*) end);

  /* Pure C code */
  mbl= my_ci_ctype(cs, &ctype, (uchar*)doc, (uchar*)end);

  /* Pure C code */
  my_ci_fill(keyseg->charset, (char*) pos + length,
                              keyseg->length - length,
                              ' ');

The new notation is better, as it does not contain sequences like cs->cset-> and cs->coll->, and the CHARSET_INFO parameter is mentioned only one time (instead of two times), so the new style of the caller code:

  • is shorter
  • is less bug prone
  • is future proof: it won't change if we change the structure CHARSET_INFO, e.g. decompose CHARSET_INFO into smaller pieces responsible for character set and collation properties. Only the wrapper functions and methods will change, the caller code will remain the same.

Generated at Thu Feb 08 09:08:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.