[MDEV-26764] JSON_HB Histograms: handle BINARY and unassigned characters Created: 2021-10-04  Updated: 2022-01-19  Resolved: 2021-12-03

Status: Closed
Project: MariaDB Server
Component/s: Optimizer
Affects Version/s: None
Fix Version/s: 10.8.0

Type: Bug Priority: Major
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-26519 JSON Histograms: improve histogram co... Closed
relates to MDEV-26724 Endless loop in json_escape_to_string... Closed

 Description   

This is a follow-up to the discussion with bar.

Part #1: unassigned characters

UTF8MB4 charset can represent all known characters. However, other charsets may have so-called un-assigned characters: byte combinations that are not mapped to any particular unicode characters. (These holes are used e.g. for introducing new characters. For example, the EURO sign was initially not present in charsets but then newer versions of charsets have introduced it).

As for histogram collection: Histogram collection code will try to convert unassigned characters to UTF-8. This will fail, and the result will be not what we need. This needs to be fixed.

Part #2 [VAR]BINARY

Conceptually, [VAR]BINARY data does not represent UTF-8 characters.

Technically, one can store it in UTF-8, as UTF-8 has a character for every (my_wc_t)NUM for each NUM in the [0x00, 0xFF] range.



 Comments   
Comment by Sergei Petrunia [ 2021-10-18 ]

Note:

  • I've made the code to produce an error when trying to collect a JSON histogram and encountering an unassigned character. I don't think handling these characters is a priority.
  • BINARY is handling by presenting it through UTF-8. This does work, although might not be what is intended from a purist' point of view.
Generated at Thu Feb 08 09:47:46 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.