[MDEV-19474] Histogram statistics are used even with optimizer_use_condition_selectivity=3 Created: 2019-05-14  Updated: 2021-06-10  Resolved: 2021-02-16

Status: Closed
Project: MariaDB Server
Component/s: Optimizer
Affects Version/s: 10.1, 10.2, 10.3, 10.4
Fix Version/s: 10.2.38, 10.3.29, 10.4.19, 10.5.10

Type: Bug Priority: Major
Reporter: Varun Gupta (Inactive) Assignee: Varun Gupta (Inactive)
Resolution: Fixed Votes: 0
Labels: None


 Description   

Dataset

create table t1(a int);
insert into t1 values (1),(2),(2),(3),(4);
set use_stat_tables='preferably';
set optimizer_use_condition_selectivity=4;
set histogram_size= 255;
analyze table t1;

MariaDB [test]> analyze select * from t1 where a=2;
+------+-------------+-------+------+---------------+------+---------+------+------+--------+----------+------------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | r_rows | filtered | r_filtered | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+------+--------+----------+------------+-------------+
|    1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL | 5    | 5.00   |    39.84 |      40.00 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+------+--------+----------+------------+-------------+
1 row in set (0.002 sec)

So with optimizer_use_condition_selectivity=4 , we get filtered as 39.84

MariaDB [test]> set optimizer_use_condition_selectivity=3;
Query OK, 0 rows affected (0.000 sec)
 
MariaDB [test]> analyze select * from t1 where a=2;
+------+-------------+-------+------+---------------+------+---------+------+------+--------+----------+------------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | r_rows | filtered | r_filtered | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+------+--------+----------+------------+-------------+
|    1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL | 5    | 5.00   |    39.84 |      40.00 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+------+--------+----------+------------+-------------+
1 row in set (0.002 sec)

So with optimizer_use_condition_selectivity=3 , we get * filtered as 39.84*

MariaDB [test]> flush tables;  // Statistics are flushed, need to read again to use them
Query OK, 0 rows affected (0.002 sec)
 
MariaDB [test]> analyze select * from t1 where a=2;
+------+-------------+-------+------+---------------+------+---------+------+------+--------+----------+------------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | r_rows | filtered | r_filtered | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+------+--------+----------+------------+-------------+
|    1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL | 5    | 5.00   |    25.00 |      40.00 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+------+--------+----------+------------+-------------+
1 row in set (0.008 sec)

So the problem here is we don't stop using histogram statistics with optimizer_use_condition_selectivity=3.
When the statistics are flushed, we don't re-read the histogram statistics and that is why we get the different value for filtered after flushing the EITS



 Comments   
Comment by Varun Gupta (Inactive) [ 2020-05-31 ]

So here the case is when optimizer_use_condition_selectivity=4 , we read the histogram statistics and then we try to use these statistics even when the optimizer_use_condition_selectivity=3, I think we need to add a check of not using the available histogram statistics for optimizer_use_condition_selectivity=3.

Comment by Varun Gupta (Inactive) [ 2020-05-31 ]

Patch
https://github.com/MariaDB/server/commit/1774f852c61cb83b50028dea0170aac1761b7fa3

Comment by Varun Gupta (Inactive) [ 2020-06-01 ]

for the first patch the argument was to fix it in a way such that we cache the result if the histogram can be used or not.
I think we can use the TABLE::histograms_are_read.

Made a small patch for this:
https://github.com/MariaDB/server/commit/e485ac2b6014b2b6b521870e04d9f16a18630e31

Comment by Michael Widenius [ 2021-01-29 ]

Patch ok
The one suggestion I have is renaming 'histograms_are_read' to 'histograms_needed' or 'histograms_should_be_used'

Comment by Michael Widenius [ 2021-01-29 ]

ok to push after review comments are handled

Comment by Varun Gupta (Inactive) [ 2021-02-02 ]

Latest patch addressing the review
http://lists.askmonty.org/pipermail/commits/2021-January/014466.html

Comment by Varun Gupta (Inactive) [ 2021-02-02 ]

Analysis

ANALYZE TABLE t1 PERSISTENT FOR ALL; 

this is run and so we have EITS for table t1 in the statistical tables.

So lets try to explain the cases:

set optimizer_use_condition_selectivity=4;
SELECT * from t1 where a=2;   // execution #1

So for the execution #1 the statistics in the TABLE structure are not read from
TABLE_SHARE*, so when we call the function read_statistics_for_tables
we come to the snippet

if (table_share->stats_cb.stats_are_ready())

which return false

(gdb) p tl->table->alias.Ptr
$5 = 0x60c000017b70 "t1"
(gdb) p table_share->stats_cb.stats_are_ready()
$6 = false
(gdb) p tl->table
$7 = (TABLE *) 0x61f000032cf0

and we set statistics_for_tables_is_needed= true;
this means that we need to read statistics for table t1

Now comes to the part where we start to read the statistics from table share
into the TABLE structure

We have this snippet:

    if (!tl->is_view_or_derived() && tl->table && (table_share= tl->table->s) &&
        table_share->tmp_table == NO_TMP_TABLE &&
        table_share->table_category == TABLE_CATEGORY_USER)
    {
      if (!tl->table->stats_is_read)
      {
        if (!read_statistics_for_table(thd, tl->table, stat_tables))
          dump_stats_from_share_to_table(tl->table);
        else
          continue;
      }
      if (thd->variables.optimizer_use_condition_selectivity > 3)
        (void) read_histograms_for_table(thd, tl->table, stat_tables);
    }

So as optimizer_use_condition_selectivity > 3 here we would also read
histograms for the fields of the table t1.

Now coming up with the second case:

set optimizer_use_condition_selectivity=3;
SELECT * from t1 where a=2;   // execution #2

So for the execution #2 we read the TABLE structure from the internal memory
cache and when we come to the function read_statistics_for_tables
we see that

if (table_share->stats_cb.stats_are_ready())

which returns true as the stats have been read

(gdb) p table_share->stats_cb.stats_are_ready()
$10 = true
(gdb) p tl->table->alias.Ptr
$8 = 0x60c000017b70 "t1"
(gdb) p tl->table
$9 = (TABLE *) 0x61f000032cf0

and then we execute the branch inside this if condition

        if (table_share->stats_cb.stats_are_ready())
        {
          if (!tl->table->stats_is_read)
            dump_stats_from_share_to_table(tl->table);
          tl->table->histograms_are_read=
            table_share->stats_cb.histograms_are_ready();
          if (table_share->stats_cb.histograms_are_ready() ||
              thd->variables.optimizer_use_condition_selectivity <= 3)
            continue;
        }

(gdb) p tl->table->histograms_are_read
$11 = true
(gdb) p tl->table->stats_is_read
$12 = true

so tl->table->stats_is_read is set to TRUE
and we set tl->table->histograms_are_read to true as
table_share->stats_cb.histograms_are_ready() returns true;
So here we see that both histograms_are_read and stats_are_read are true for table t1.

Now when we call *get_column_range_cardinality * which is called when we calculate selectivity for the predicate a= 2 then we check

if (hist->is_available())

which returns true and we use the histograms to evaluate the selectivity for a=2
Though optimizer_use_condition_selectivity does not allow us to use the
histograms for the execution #2.

Comment by Igor Babaev [ 2021-02-03 ]

I think that the following straightforward patch is the most appropriate for resolution of this bug:

diff --git a/sql/sql_statistics.cc b/sql/sql_statistics.cc
index b631720..71dca62 100644
--- a/sql/sql_statistics.cc
+++ b/sql/sql_statistics.cc
@@ -3703,6 +3703,7 @@ double get_column_range_cardinality(Field *field,
 {
   double res;
   TABLE *table= field->table;
+  THD *thd= table->in_use;
   Column_statistics *col_stats= field->read_stats;
   double tab_records= (double)table->stat_records();
 
@@ -3748,7 +3749,8 @@ double get_column_range_cardinality(Field *field,
           col_stats->min_max_values_are_provided())
       {
         Histogram *hist= &col_stats->histogram;
-        if (hist->is_available())
+        if (thd->variables.optimizer_use_condition_selectivity > 3  &&
+            hist->is_available())
         {
           store_key_image_to_rec(field, (uchar *) min_endp->key,
                                  field->key_length());
@@ -3792,7 +3794,8 @@ double get_column_range_cardinality(Field *field,
         max_mp_pos= 1.0;
 
       Histogram *hist= &col_stats->histogram;
-      if (!hist->is_available())
+      if (!(thd->variables.optimizer_use_condition_selectivity > 3 &&
+            hist->is_available()))
         sel= (max_mp_pos - min_mp_pos);
       else
         sel= hist->range_selectivity(min_mp_pos, max_mp_pos);

Generated at Thu Feb 08 08:51:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.