Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26764

JSON_HB Histograms: handle BINARY and unassigned characters

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 10.7
    • Component/s: Optimizer
    • Labels:
      None

      Description

      This is a follow-up to the discussion with Alexander Barkov.

      Part #1: unassigned characters

      UTF8MB4 charset can represent all known characters. However, other charsets may have so-called un-assigned characters: byte combinations that are not mapped to any particular unicode characters. (These holes are used e.g. for introducing new characters. For example, the EURO sign was initially not present in charsets but then newer versions of charsets have introduced it).

      As for histogram collection: Histogram collection code will try to convert unassigned characters to UTF-8. This will fail, and the result will be not what we need. This needs to be fixed.

      Part #2 [VAR]BINARY

      Conceptually, [VAR]BINARY data does not represent UTF-8 characters.

      Technically, one can store it in UTF-8, as UTF-8 has a character for every (my_wc_t)NUM for each NUM in the [0x00, 0xFF] range.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              psergei Sergei Petrunia
              Reporter:
              psergei Sergei Petrunia
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:

                  Git Integration