[MDEV-27497] DOUBLE_PREC_HB histogram produces wrong estimates for COMPRESSED columns Created: 2022-01-13  Updated: 2023-12-05

Status: Stalled
Project: MariaDB Server
Component/s: Optimizer
Affects Version/s: 10.2, 10.3, 10.4, 10.5, 10.6, 10.7
Fix Version/s: 10.4, 10.5, 10.6

Type: Bug Priority: Minor
Reporter: Elena Stepanova Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Problem/Incident
is caused by MDEV-26519 JSON Histograms: improve histogram co... Closed

 Description   

--source include/have_sequence.inc
 
CREATE TABLE t (a VARCHAR(16) COMPRESSED COLLATE ucs2_sinhala_ci);
INSERT INTO t SELECT REPEAT(CHR(seq%26+97),16) FROM seq_1_to_100;
 
set histogram_type= DOUBLE_PREC_HB;
analyze table t persistent for all;
ANALYZE FORMAT=JSON SELECT a FROM t WHERE 'c' BETWEEN 'l' AND a ;
 
set histogram_type= JSON_HB;
analyze table t persistent for all;
ANALYZE FORMAT=JSON SELECT a FROM t WHERE 'c' BETWEEN 'l' AND a ;
 
# Cleanup
DROP TABLE t;

DOUBLE_PREC_HB preview-10.8-MDEV-26519-json-histograms 21a81025

        "table": {
          "table_name": "t",
          "access_type": "ALL",
          "r_loops": 1,
          "rows": 100,
          "r_rows": 100,
          "r_table_time_ms": 0.006661524,
          "r_other_time_ms": 0.006350519,
          "filtered": 3.846199989,
          "r_filtered": 0,
          "attached_condition": "'c' between 'l' and t.a"
        }

JSON_HB

        "table": {
          "table_name": "t",
          "access_type": "ALL",
          "r_loops": 1,
          "rows": 100,
          "r_rows": 100,
          "r_table_time_ms": 0.005778671,
          "r_other_time_ms": 0.005598088,
          "filtered": 93,
          "r_filtered": 0,
          "attached_condition": "'c' between 'l' and t.a"
        }

The actual result set is naturally empty. So, DOUBLE_PREC is close, JSON is far away.

Reproducible with at least MyISAM, InnoDB, Aria.



 Comments   
Comment by Sergei Petrunia [ 2022-01-14 ]

Summarizing the report data:

DOUBLE_PREC_HB
          "filtered": 3.846199989,
          "r_filtered": 0,
 
JSON
          "filtered": 93,
          "r_filtered": 0,

If I try to remove the COMPRESSED attribute, I get this:

DOUBLE_PREC_HB
          "filtered": 92.96875,
          "r_filtered": 0,
 
JSON_HB
          "filtered": 93,
          "r_filtered": 0,

DOUBLE_PREC_HB becomes just as bad as JSON_HB. It looks like the good estimate in the original report has been an accident.

Comment by Sergei Petrunia [ 2022-01-14 ]

The column is actually too short to be compressed.
Status counters confirm this: no [de]compressions happen during the test run:

MariaDB [test]> show status like '%compressions';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| Column_compressions   | 0     |
| Column_decompressions | 0     |
+-----------------------+-------+
2 rows in set (0.003 sec)

Comment by Sergei Petrunia [ 2022-01-14 ]

I compare the histograms that were produced when

  • the testcase used COMPRESSED column
  • the testcase didn't use COMPRESSED column.

The JSON_HB histograms are the same. The DOUBLE_PREC_HB histograms are different.
Compressed:
(I've done some manual reformatting to make the differences more apparent)

MariaDB [test]> select decode_histogram(hist_type,histogram) from save10;
0.00000,0.00000,0.00000,0.03999,
0.00000,0.00000,0.00000,0.00000,0.03999,
0.00000,0.00000,0.00000,0.00000,0.00000,0.04001,
0.00000,0.00000,0.00000,0.00000,0.03999,
0.00000,0.00000,0.00000,0.00000,0.04001,
0.00000,0.00000,0.00000,0.00000,0.03999,
0.00000,0.00000,0.00000,0.00000,0.03999,
0.00000,0.00000,0.00000,0.00000,0.04001,
0.00000,0.00000,0.00000,0.00000,0.03999,
0.00000,0.00000,0.00000,0.00000,0.04001,
0.00000,0.00000,0.00000,0.00000,0.00000,0.03999,
0.00000,0.00000,0.00000,0.00000,0.03999,
0.00000,0.00000,0.00000,0.00000,0.04001,
0.00000,0.00000,0.00000,0.00000,0.03999,0.00000,0.00000,0.00000,0.00000,0.04001,0.00000,0.00000,0.00000,0.00000,0.03999,0.00000,0.00000,0.00000,0.00000,0.03999,0.00000,0.00000,0.00000,0.00000,0.04001,0.00000,0.00000,0.00000,0.00000,0.03999,0.00000,0.00000,0.00000,0.00000,0.00000,0.04001,0.00000,0.00000,0.00000,0.00000,0.03999,0.00000,0.00000,0.00000,0.00000,0.03999,0.00000,0.00000,0.00000,0.00000,0.04001,0.00000,0.00000,0.00000,0.03999,0.00000,0.00000,0.00000,0.04001,0.00000,0.00000,0.00000 |

Non-compressed:

MariaDB [test]> select decode_histogram(hist_type,histogram) from save12;
0.00000,0.00000,0.00000,0.04056,
0.00000,0.00000,0.00000,0.00000,0.03880,
0.00000,0.00000,0.00000,0.00000,0.00000,0.02292,
0.00000,0.00000,0.00000,0.00000,0.05292,
0.00000,0.00000,0.00000,0.00000,0.08112,
0.00000,0.00000,0.00000,0.00000,0.01411,
0.00000,0.00000,0.00000,0.00000,0.05644,
0.00000,0.00000,0.00000,0.00000,0.04585,
0.00000,0.00000,0.00000,0.00000,0.03703,
0.00000,0.00000,0.00000,0.00000,0.02998,
0.00000,0.00000,0.00000,0.00000,0.00000, 0.02293,
0.00000,0.00000,0.00000,0.00000,0.07936,
0.00000,0.00000,0.00000,0.00000,0.01587,
0.00000,0.00000,0.00000,0.00000,0.05290,
0.00000,0.00000,0.00000,0.00000,0.06526,0.00000,0.00000,0.00000,0.00000,0.02293,0.00000,0.00000,0.00000,0.00000,0.02116,0.00000,0.00000,0.00000,0.00000,0.07407,0.00000,0.00000,0.00000,0.00000,0.04233,0.00000,0.00000,0.00000,0.00000,0.00000,0.05115,0.00000,0.00000,0.00000,0.00000,0.06525,0.00000,0.00000,0.00000,0.00000,0.02293,0.00000,0.00000,0.00000,0.00000,0.01587,0.00000,0.00000,0.00000,0.00706,0.00000,0.00000,0.00000,0.02116,0.00000,0.00000,0.00000 |

Comment by Sergei Petrunia [ 2022-01-14 ]

The condition is impossible to satisfy but the optimizer doesn't recognize that.

The ranges it is estimating and the estimates:

DOUBLE_PREC_HB, non-COMPRESSED:

            "rows_estimation": [
              {
                "selectivity_for_indexes": [],
                "selectivity_for_columns": [
                  {
                    "column_name": "a",
                    "ranges": ["c <= a"],
                    "selectivity_from_histogram": 0.9296875
                  }
                ],
                "cond_selectivity": 0.9296875
              },

DOUBLE_PREC_HB, COMPRESSED:

            "rows_estimation": [
              {
                "selectivity_for_indexes": [],
                "selectivity_for_columns": [
                  {
                    "column_name": "a",
                    "ranges": ["c <= a"],
                    "selectivity_from_histogram": 0.038462
                  }
                ],
                "cond_selectivity": 0.038462
              },

JSON:

                "selectivity_for_columns": [
                  {
                    "column_name": "a",
                    "ranges": ["c <= a"],
                    "selectivity_from_histogram": 0.93
                  }
                ],
                "cond_selectivity": 0.93
              },

what would be the right selectivity for t.a >= 'c'? The two values that are less are aaaaa and bbbbb.
sel = (26 - 2)/26.= 0.92, which agrees with the JSON_HB selectivity.

Comment by Julien Fritsch [ 2023-12-05 ]

Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

Comment by JiraAutomate [ 2023-12-05 ]

Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

Generated at Thu Feb 08 09:53:21 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.