Part #1: unassigned characters

UTF8MB4 charset can represent all known characters. However, other charsets may have so-called un-assigned characters: byte combinations that are not mapped to any particular unicode characters. (These holes are used e.g. for introducing new characters. For example, the EURO sign was initially not present in charsets but then newer versions of charsets have introduced it).

As for histogram collection: Histogram collection code will try to convert unassigned characters to UTF-8. This will fail, and the result will be not what we need. This needs to be fixed.

Part #2 [VAR]BINARY

Conceptually, [VAR]BINARY data does not represent UTF-8 characters.

Technically, one can store it in UTF-8, as UTF-8 has a character for every (my_wc_t)NUM for each NUM in the [0x00, 0xFF] range.

Attachments

Issue Links

relates to

MDEV-26519 JSON Histograms: improve histogram collection

Closed

MDEV-26724 Endless loop in json_escape_to_string upon collecting JSON histograms with empty string in a column

Closed

Activity

People

Assignee:: Sergei Petrunia

Reporter:: Sergei Petrunia

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2021-10-04 20:00

Updated:: 2022-01-19 15:10

Resolved:: 2021-12-03 17:14

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.25d

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.