Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
None
-
None
Description
This is a follow-up to the discussion with bar.
Part #1: unassigned characters
UTF8MB4 charset can represent all known characters. However, other charsets may have so-called un-assigned characters: byte combinations that are not mapped to any particular unicode characters. (These holes are used e.g. for introducing new characters. For example, the EURO sign was initially not present in charsets but then newer versions of charsets have introduced it).
As for histogram collection: Histogram collection code will try to convert unassigned characters to UTF-8. This will fail, and the result will be not what we need. This needs to be fixed.
Part #2 [VAR]BINARY
Conceptually, [VAR]BINARY data does not represent UTF-8 characters.
Technically, one can store it in UTF-8, as UTF-8 has a character for every (my_wc_t)NUM for each NUM in the [0x00, 0xFF] range.
Attachments
Issue Links
- relates to
-
MDEV-26519 JSON Histograms: improve histogram collection
- Closed
-
MDEV-26724 Endless loop in json_escape_to_string upon collecting JSON histograms with empty string in a column
- Closed