[MDEV-19314] Join performance on composite secondary index lookup with inequality Created: 2019-04-23  Updated: 2019-05-15  Resolved: 2019-05-15

Status: Closed
Project: MariaDB Server
Component/s: Optimizer
Affects Version/s: 10.1, 10.1.38, 10.3.13, 10.2, 10.3
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Arnaud Adant Assignee: Igor Babaev
Resolution: Not a Bug Votes: 1
Labels: None

Attachments: PNG File DOUBLE_PREC.png     PNG File DOUBLE_PREC_AFTER_9999.png     PNG File SINGLE_PREC.png     PNG File SINGLE_PREC_AFTER_9999.png    

 Description   

Joins with ids and dates are very common especially with temporal tables.
Looks like MariaDB optimizer only uses the id and not the date for the lookup leading to poor join performance when the date is discriminant.
Here is the test case.

create table t(
id int not null auto_increment primary key,
id2 int not null, 
valid_to datetime(6) not null,
index id2_valid_to (id2, valid_to)
) engine=InnoDB;
 
truncate table t;
 
insert into t(id2, valid_to) values(rand()*1000, date_add(now(), interval - rand()*1000 day));
insert into t(id2, valid_to) values(rand()*1000, date_add(now(), interval - rand()*1000 day));
insert into t(id2, valid_to) values(rand()*1000, date_add(now(), interval - rand()*1000 day));
insert into t(id2, valid_to) values(rand()*1000, date_add(now(), interval - rand()*1000 day));
replace into t(id2, valid_to) select rand()*1000, date_add(now(), interval - rand()*1000 day) from t t1, t t2, t t3, t t4, t t5, t t6, t t7, t t8, t t9;
drop temporary table if exists t_max;
create temporary table t_max as select max(id) id from t group by id2;
update t, t_max set t.valid_to='9999-12-31' where t.id = t_max.id;
commit;
analyze table t;
select count(*) from t join t t2 on t.id2 = t2.id2 and t2.valid_to >='9999-12-31';

MySQL 8.0 is fast, MariaDB 10.1.29, 10.1.38, 10.3.13 are slow.

select count(*) from t join t t2 on t.id2 = t2.id2 and t2.valid_to ='9999-12-31'; 

is fast



 Comments   
Comment by Arnaud Adant [ 2019-04-23 ]

Workaround :

 select count(*) from t join (select distinct id2 from t t2 where t2.valid_to >='9999-12-31')  t2 on t.id2 = t2.id2;

Comment by Alice Sherepa [ 2019-04-29 ]

Mysql 8.0.15

mysql> select count(*) from t join t t2 on t.id2 = t2.id2 and t2.valid_to >='9999-12-31';
+----------+
| count(*) |
+----------+
|   262740 |
+----------+
1 row in set (0.12 sec)
 
mysql> explain  select count(*) from t join t t2 on t.id2 = t2.id2 and t2.valid_to >='9999-12-31';
+----+-------------+-------+------------+-------+---------------+--------------+---------+-------------+--------+----------+--------------------------+
| id | select_type | table | partitions | type  | possible_keys | key          | key_len | ref         | rows   | filtered | Extra                    |
+----+-------------+-------+------------+-------+---------------+--------------+---------+-------------+--------+----------+--------------------------+
|  1 | SIMPLE      | t2    | NULL       | index | id2_valid_to  | id2_valid_to | 12      | NULL        | 261800 |    33.33 | Using where; Using index |
|  1 | SIMPLE      | t     | NULL       | ref   | id2_valid_to  | id2_valid_to | 4       | test.t2.id2 |    268 |   100.00 | Using index              |
+----+-------------+-------+------------+-------+---------------+--------------+---------+-------------+--------+----------+--------------------------+
2 rows in set, 1 warning (0.00 sec)
 
 
| select count(*) from t join t t2 on t.id2 = t2.id2 and t2.valid_to >='9999-12-31' | {
  "steps": [
    {
      "join_preparation": {
        "select#": 1,
        "steps": [
          {
            "expanded_query": "/* select#1 */ select count(0) AS `count(*)` from (`t` join `t` `t2` on(((`t`.`id2` = `t2`.`id2`) and (`t2`.`valid_to` >= '9999-12-31'))))"
          },
          {
            "transformations_to_nested_joins": {
              "transformations": [
                "JOIN_condition_to_WHERE",
                "parenthesis_removal"
              ],
              "expanded_query": "/* select#1 */ select count(0) AS `count(*)` from `t` join `t` `t2` where ((`t`.`id2` = `t2`.`id2`) and (`t2`.`valid_to` >= '9999-12-31'))"
            }
          }
        ]
      }
    },
    {
      "join_optimization": {
        "select#": 1,
        "steps": [
          {
            "condition_processing": {
              "condition": "WHERE",
              "original_condition": "((`t`.`id2` = `t2`.`id2`) and (`t2`.`valid_to` >= '9999-12-31'))",
              "steps": [
                {
                  "transformation": "equality_propagation",
                  "resulting_condition": "((`t2`.`valid_to` >= '9999-12-31') and multiple equal(`t`.`id2`, `t2`.`id2`))"
                },
                {
                  "transformation": "constant_propagation",
                  "resulting_condition": "((`t2`.`valid_to` >= '9999-12-31') and multiple equal(`t`.`id2`, `t2`.`id2`))"
                },
                {
                  "transformation": "trivial_condition_removal",
                  "resulting_condition": "((`t2`.`valid_to` >= '9999-12-31') and multiple equal(`t`.`id2`, `t2`.`id2`))"
                }
              ]
            }
          },
          {
            "substitute_generated_columns": {
            }
          },
          {
            "table_dependencies": [
              {
                "table": "`t`",
                "row_may_be_null": false,
                "map_bit": 0,
                "depends_on_map_bits": [
                ]
              },
              {
                "table": "`t` `t2`",
                "row_may_be_null": false,
                "map_bit": 1,
                "depends_on_map_bits": [
                ]
              }
            ]
          },
          {
            "ref_optimizer_key_uses": [
              {
                "table": "`t`",
                "field": "id2",
                "equals": "`t2`.`id2`",
                "null_rejecting": false
              },
              {
                "table": "`t` `t2`",
                "field": "id2",
                "equals": "`t`.`id2`",
                "null_rejecting": false
              }
            ]
          },
          {
            "rows_estimation": [
              {
                "table": "`t`",
                "table_scan": {
                  "rows": 261800,
                  "cost": 152.25
                }
              },
              {
                "table": "`t` `t2`",
                "table_scan": {
                  "rows": 261800,
                  "cost": 152.25
                }
              }
            ]
          },
          {
            "considered_execution_plans": [
              {
                "plan_prefix": [
                ],
                "table": "`t`",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "access_type": "ref",
                      "index": "id2_valid_to",
                      "usable": false,
                      "chosen": false
                    },
                    {
                      "rows_to_scan": 261800,
                      "filtering_effect": [
                      ],
                      "final_filtering_effect": 1,
                      "access_type": "scan",
                      "resulting_rows": 261800,
                      "cost": 26332,
                      "chosen": true
                    }
                  ]
                },
                "condition_filtering_pct": 100,
                "rows_for_plan": 261800,
                "cost_for_plan": 26332,
                "rest_of_plan": [
                  {
                    "plan_prefix": [
                      "`t`"
                    ],
                    "table": "`t` `t2`",
                    "best_access_path": {
                      "considered_access_paths": [
                        {
                          "access_type": "ref",
                          "index": "id2_valid_to",
                          "rows": 268.79,
                          "cost": 7.14e6,
                          "chosen": true
                        },
                        {
                          "access_type": "scan",
                          "chosen": false,
                          "cause": "covering_index_better_than_full_scan"
                        }
                      ]
                    },
                    "condition_filtering_pct": 100,
                    "rows_for_plan": 7.04e7,
                    "cost_for_plan": 7.16e6,
                    "chosen": true
                  }
                ]
              },
              {
                "plan_prefix": [
                ],
                "table": "`t` `t2`",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "access_type": "ref",
                      "index": "id2_valid_to",
                      "usable": false,
                      "chosen": false
                    },
                    {
                      "rows_to_scan": 261800,
                      "filtering_effect": [
                      ],
                      "final_filtering_effect": 0.3333,
                      "access_type": "scan",
                      "resulting_rows": 87258,
                      "cost": 26332,
                      "chosen": true
                    }
                  ]
                },
                "condition_filtering_pct": 100,
                "rows_for_plan": 87258,
                "cost_for_plan": 26332,
                "rest_of_plan": [
                  {
                    "plan_prefix": [
                      "`t` `t2`"
                    ],
                    "table": "`t`",
                    "best_access_path": {
                      "considered_access_paths": [
                        {
                          "access_type": "ref",
                          "index": "id2_valid_to",
                          "rows": 268.79,
                          "cost": 2.38e6,
                          "chosen": true
                        },
                        {
                          "access_type": "scan",
                          "chosen": false,
                          "cause": "covering_index_better_than_full_scan"
                        }
                      ]
                    },
                    "condition_filtering_pct": 100,
                    "rows_for_plan": 2.35e7,
                    "cost_for_plan": 2.4e6,
                    "chosen": true
                  }
                ]
              }
            ]
          },
          {
            "attaching_conditions_to_tables": {
              "original_condition": "((`t`.`id2` = `t2`.`id2`) and (`t2`.`valid_to` >= '9999-12-31'))",
              "attached_conditions_computation": [
              ],
              "attached_conditions_summary": [
                {
                  "table": "`t` `t2`",
                  "attached": "(`t2`.`valid_to` >= '9999-12-31')"
                },
                {
                  "table": "`t`",
                  "attached": "(`t`.`id2` = `t2`.`id2`)"
                }
              ]
            }
          },
          {
            "optimizing_distinct_group_by_order_by": {
            }
          },
          {
            "finalizing_table_conditions": [
              {
                "table": "`t` `t2`",
                "original_table_condition": "(`t2`.`valid_to` >= '9999-12-31')",
                "final_table_condition   ": "(`t2`.`valid_to` >= '9999-12-31')"
              },
              {
                "table": "`t`",
                "original_table_condition": "(`t`.`id2` = `t2`.`id2`)",
                "final_table_condition   ": null
              }
            ]
          },
          {
            "refine_plan": [
              {
                "table": "`t` `t2`"
              },
              {
                "table": "`t`"
              }
            ]
          },
          {
            "considering_tmp_tables": [
            ]
          }
        ]
      }
    },
    {
      "join_execution": {
        "select#": 1,
        "steps": [
        ]
      }
    }
  ]
} |                                 0 |                       0 |

MariaDB 10.4.3

 
MariaDB [test]> select count(*) from t join t t2 on t.id2 = t2.id2 and t2.valid_to >='9999-12-31';
+----------+
| count(*) |
+----------+
|   262148 |
+----------+
1 row in set (15.226 sec)
 
MariaDB [test]> analyze select count(*) from t join t t2 on t.id2 = t2.id2 and t2.valid_to >='9999-12-31';
+------+-------------+-------+-------+---------------+--------------+---------+------------+--------+-----------+----------+------------+--------------------------+
| id   | select_type | table | type  | possible_keys | key          | key_len | ref        | rows   | r_rows    | filtered | r_filtered | Extra                    |
+------+-------------+-------+-------+---------------+--------------+---------+------------+--------+-----------+----------+------------+--------------------------+
|    1 | SIMPLE      | t     | index | id2_valid_to  | id2_valid_to | 12      | NULL       | 261800 | 262148.00 |   100.00 |     100.00 | Using index              |
|    1 | SIMPLE      | t2    | ref   | id2_valid_to  | id2_valid_to | 4       | test.t.id2 | 138    | 263.07    |   100.00 |       0.38 | Using where; Using index |
+------+-------------+-------+-------+---------------+--------------+---------+------------+--------+-----------+----------+------------+--------------------------+
2 rows in set (14.841 sec)
 
| {
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 15141,
    "table": {
      "table_name": "t",
      "access_type": "index",
      "possible_keys": ["id2_valid_to"],
      "key": "id2_valid_to",
      "key_length": "12",
      "used_key_parts": ["id2", "valid_to"],
      "r_loops": 1,
      "rows": 261800,
      "r_rows": 262148,
      "r_total_time_ms": 46.387,
      "filtered": 100,
      "r_filtered": 100,
      "using_index": true
    },
    "table": {
      "table_name": "t2",
      "access_type": "ref",
      "possible_keys": ["id2_valid_to"],
      "key": "id2_valid_to",
      "key_length": "4",
      "used_key_parts": ["id2"],
      "ref": ["test.t.id2"],
      "r_loops": 262148,
      "rows": 138,
      "r_rows": 263.07,
      "r_total_time_ms": 10883,
      "filtered": 100,
      "r_filtered": 0.3801,
      "attached_condition": "t2.valid_to >= '9999-12-31'",
      "using_index": true
    }
  }
} |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (15.139 sec)
 
| select count(*) from t join t t2 on t.id2 = t2.id2 and t2.valid_to >='9999-12-31' | {
  "steps": [
    {
      "join_preparation": {
        "select_id": 1,
        "steps": [
          {
            "expanded_query": "select count(0) AS `count(*)` from (t join t t2 on(t.id2 = t2.id2 and t2.valid_to >= '9999-12-31'))"
          }
        ]
      }
    },
    {
      "join_optimization": {
        "select_id": 1,
        "steps": [
          {
            "condition_processing": {
              "condition": "WHERE",
              "original_condition": "t.id2 = t2.id2 and t2.valid_to >= '9999-12-31'",
              "steps": [
                {
                  "transformation": "equality_propagation",
                  "resulting_condition": "t2.valid_to >= '9999-12-31' and multiple equal(t.id2, t2.id2)"
                },
                {
                  "transformation": "constant_propagation",
                  "resulting_condition": "t2.valid_to >= '9999-12-31' and multiple equal(t.id2, t2.id2)"
                },
                {
                  "transformation": "trivial_condition_removal",
                  "resulting_condition": "t2.valid_to >= '9999-12-31' and multiple equal(t.id2, t2.id2)"
                }
              ]
            }
          },
          {
            "table_dependencies": [
              {
                "table": "t",
                "row_may_be_null": false,
                "map_bit": 0,
                "depends_on_map_bits": []
              },
              {
                "table": "t2",
                "row_may_be_null": false,
                "map_bit": 1,
                "depends_on_map_bits": []
              }
            ]
          },
          {
            "ref_optimizer_key_uses": [
              {
                "table": "t",
                "field": "id2",
                "equals": "t2.id2",
                "null_rejecting": false
              },
              {
                "table": "t2",
                "field": "id2",
                "equals": "t.id2",
                "null_rejecting": false
              }
            ]
          },
          {
            "rows_estimation": [
              {
                "table": "t",
                "table_scan": {
                  "rows": 261800,
                  "cost": 609
                }
              },
              {
                "selectivity_for_indexes": [],
                "selectivity_for_columns": [],
                "cond_selectivity": 1
              },
              {
                "table": "t2",
                "table_scan": {
                  "rows": 261800,
                  "cost": 609
                }
              }
            ]
          },
          {
            "considered_execution_plans": [
              {
                "plan_prefix": [],
                "table": "t",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "access_type": "scan",
                      "resulting_rows": 261800,
                      "cost": 609,
                      "chosen": true
                    }
                  ]
                },
                "rest_of_plan": [
                  {
                    "plan_prefix": ["t"],
                    "table": "t2",
                    "best_access_path": {
                      "considered_access_paths": [
                        {
                          "access_type": "ref",
                          "index": "id2_valid_to",
                          "rows": 138,
                          "cost": 268856,
                          "chosen": true
                        },
                        {
                          "type": "scan",
                          "chosen": false,
                          "cause": "cost"
                        }
                      ]
                    }
                  }
                ]
              },
              {
                "plan_prefix": [],
                "table": "t2",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "access_type": "scan",
                      "resulting_rows": 261800,
                      "cost": 609,
                      "chosen": true
                    }
                  ]
                },
                "rest_of_plan": [
                  {
                    "plan_prefix": ["t2"],
                    "table": "t",
                    "best_access_path": {
                      "considered_access_paths": [
                        {
                          "access_type": "ref",
                          "index": "id2_valid_to",
                          "rows": 138,
                          "cost": 268856,
                          "chosen": true
                        },
                        {
                          "type": "scan",
                          "chosen": false,
                          "cause": "cost"
                        }
                      ]
                    },
                    "pruned_by_cost": true
                  }
                ]
              }
            ]
          },
          {
            "attaching_conditions_to_tables": {
              "original_condition": "t2.id2 = t.id2 and t2.valid_to >= '9999-12-31'",
              "attached_conditions_computation": [],
              "attached_conditions_summary": [
                {
                  "table": "t",
                  "attached": null
                },
                {
                  "table": "t2",
                  "attached": "t2.valid_to >= '9999-12-31'"
                }
              ]
            }
          }
        ]
      }
    },
    {
      "join_execution": {
        "select_id": 1,
        "steps": []
      }
    }
  ]
} |                                 0 |                       0 |

Comment by Arnaud Adant [ 2019-04-29 ]

The join order is different in MariaDB : Forcing it shows the issue :

 
MariaDB [test]> select count(*) from t straight_join t t2 on t.id2 = t2.id2 and t2.valid_to >='9999-12-31';
+----------+
| count(*) |
+----------+
|   262148 |
+----------+
1 row in set (18.00 sec)
 
MariaDB [test]> show session status like 'Handler%';
+----------------------------+----------+
| Variable_name              | Value    |
+----------------------------+----------+
| Handler_commit             | 1        |
| Handler_delete             | 0        |
| Handler_discover           | 0        |
| Handler_external_lock      | 0        |
| Handler_icp_attempts       | 0        |
| Handler_icp_match          | 0        |
| Handler_mrr_init           | 0        |
| Handler_mrr_key_refills    | 0        |
| Handler_mrr_rowid_refills  | 0        |
| Handler_prepare            | 0        |
| Handler_read_first         | 1        |
| Handler_read_key           | 262148   |
| Handler_read_last          | 0        |
| Handler_read_next          | 69208298 |
| Handler_read_prev          | 0        |
| Handler_read_retry         | 0        |
| Handler_read_rnd           | 0        |
| Handler_read_rnd_deleted   | 0        |
| Handler_read_rnd_next      | 0        |
| Handler_rollback           | 0        |
| Handler_savepoint          | 0        |
| Handler_savepoint_rollback | 0        |
| Handler_tmp_update         | 0        |
| Handler_tmp_write          | 0        |
| Handler_update             | 0        |
| Handler_write              | 0        |
+----------------------------+----------+
26 rows in set (0.07 sec)
 
If the same join order is forced in MySQL 8.0, 
 
MySQL [test]> select count(*) from t straight_join t t2 on t.id2 = t2.id2 and t2.valid_to >='9999-12-31';
+----------+
| count(*) |
+----------+
|   262148 |
+----------+
1 row in set (12.92 sec)
 
MySQL [test]> show session status like 'Handler%';
+----------------------------+----------+
| Variable_name              | Value    |
+----------------------------+----------+
| Handler_commit             | 3        |
| Handler_delete             | 0        |
| Handler_discover           | 0        |
| Handler_external_lock      | 10       |
| Handler_mrr_init           | 0        |
| Handler_prepare            | 0        |
| Handler_read_first         | 1        |
| Handler_read_key           | 262150   |
| Handler_read_last          | 0        |
| Handler_read_next          | 69212496 |
| Handler_read_prev          | 0        |
| Handler_read_rnd           | 0        |
| Handler_read_rnd_next      | 0        |
| Handler_rollback           | 0        |
| Handler_savepoint          | 0        |
| Handler_savepoint_rollback | 0        |
| Handler_update             | 0        |
| Handler_write              | 0        |
+----------------------------+----------+
18 rows in set (0.00 sec)

So it is either an upstream bug or something that a join with a b-tree can not do ...

Comment by Igor Babaev [ 2019-04-29 ]

Hi Arnaud!
You query requires the join order t2,t for a fast execution. To choose this join order the optimizer need statistics on non-index columns and non-major index columns. Starting from 10.0 MariaDB can use this statistics and stores it in special system tables (mysql.*_stats).
To collect statistical data on table t you have to execute the following command:

analyze table t persistent for all;

or alternatevely:

set @@use_stat_tables=preferably;
analyze table t;

To use data from statistical tables you have to use the following settings:

set @@use_stat_tables=preferably;
set @@optimizer_use_condition_selectivity=3;

With this settings you'll have the following execution plan for you query:

MariaDB [test]> explain extended select count(*) from t join t t2 on t.id2 = t2.id2 and t2.valid_to >='9999-12-31';
+------+-------------+-------+-------+---------------+--------------+---------+-------------+--------+----------+--------------------------+
| id   | select_type | table | type  | possible_keys | key          | key_len | ref         | rows   | filtered | Extra                    |
+------+-------------+-------+-------+---------------+--------------+---------+-------------+--------+----------+--------------------------+
|    1 | SIMPLE      | t2    | index | id2_valid_to  | id2_valid_to | 12      | NULL        | 261800 |     0.39 | Using where; Using index |
|    1 | SIMPLE      | t     | ref   | id2_valid_to  | id2_valid_to | 4       | test.t2.id2 |    127 |   100.00 | Using index              |
+------+-------------+-------+-------+---------------+--------------+---------+-------------+--------+----------+--------------------------+

This plan is as good as of MySQL 8.0. I compared the execution times: they are practically the same.
If you use histograms by adding the settings:

set @@histogram_size=255;
set @@optimizer_use_condition_selectivity=4;

you'll have even a better explain:

MariaDB [test]> explain extended select count(*) from t join t t2 on t.id2 = t2.id2 and t2.valid_to >='9999-12-31';
+------+-------------+-------+-------+---------------+--------------+---------+-------------+--------+----------+--------------------------+
| id   | select_type | table | type  | possible_keys | key          | key_len | ref         | rows   | filtered | Extra                    |
+------+-------------+-------+-------+---------------+--------------+---------+-------------+--------+----------+--------------------------+
|    1 | SIMPLE      | t2    | index | id2_valid_to  | id2_valid_to | 12      | NULL        | 261800 |     0.39 | Using where; Using index |
|    1 | SIMPLE      | t     | ref   | id2_valid_to  | id2_valid_to | 4       | test.t2.id2 |    127 |   100.00 | Using index              |
+------+-------------+-------+-------+---------------+--------------+---------+-------------+--------+----------+--------------------------+

(Compare 0.39 in the column filtered with

MariaDB [test]> select (select count(*) from t where valid_to >='9999-12-31') / (select count(*) from t) * 100;
+-----------------------------------------------------------------------------------------+
| (select count(*) from t where valid_to >='9999-12-31') / (select count(*) from t) * 100 |
+-----------------------------------------------------------------------------------------+
|                                                                                  0.3818 |
+-----------------------------------------------------------------------------------------+
)

FYI: 10.4 has the following default settings:
@@use_stat_tables=preferably_for_queries;
set @@optimizer_use_condition_selectivity=4;
set @@histogram_size=255;

If you are satisfied with the answer I will close this ticket.

Comment by Arnaud Adant [ 2019-04-29 ]

Indeed this solves the problem even in MariaDB 10.1 :

analyze table t persistent for all;

In 10.4, you still need to run the above statement. It would be nice if analyze table also computes engine independent stats by default.

Regarding my other question, it looks like a documented feature of the ref method :

https://dev.mysql.com/doc/refman/8.0/en/explain-output.html#jointype_ref

when joining with this condition :

t.id2 = t2.id2 and t2.valid_to >='9999-12-31'

it can only use ref with the first key part (id2) due to the inequality

while

t.id2 = t2.id2 and t2.valid_to ='9999-12-31'

can use the first and second key parts thanks to the 2 = comparisons.

Comment by Arnaud Adant [ 2019-04-29 ]

Regarding the use of = or <=> with ref, would not it be possible to use a B-tree to evaluate this for each row of t ?

(t2.id2 = const1 and t2.valid_to >= const2) , as this is perfectly feasible with a range ...

Comment by Igor Babaev [ 2019-04-30 ]

Arnaud,
Yes it's possible and we have a task for it (see MDEV-19153: Support ref+range access method).

Comment by Arnaud Adant [ 2019-04-30 ]

Thanks Igor. I think we can close this bug.

Comment by Arnaud Adant [ 2019-04-30 ]

See also https://bugs.mysql.com/bug.php?id=95200

Comment by Vicențiu Ciorbaru [ 2019-04-30 ]

For this particular test case is would seem that SINGLE precision histograms provide a better estimate for the filtered clause than DOUBLE precision histograms, however it is a simple coincidence because:
The total number of unique values in id2 column (1001) is 0.38% of the total number of rows inserted which matches closely with a histogram with 254 buckets, where each bucket holds roughly 0.39% of the total rows in the set.

A histogram with 127 buckets will have each bucket hold double the number of rows.

Next I will describe how we arrive to the selectivity values:

Histogram::range_selectivity ends up being called with the parameters min_pos=1 and max_pos=1

Thread 36 "mysqld" hit Breakpoint 2, Histogram::range_selectivity (this=0x7ffe2007f9b8, min_pos=1, max_pos=1)
 
// Relevant code executed below:
 
   │270         double bucket_sel= 1.0/(get_width() + 1);
   │271         uint min= find_bucket(min_pos, TRUE);
   │272         uint max= find_bucket(max_pos, FALSE);
   │273         sel= bucket_sel * (max - min + 1);
   │274         return sel; 

With these parameters, and the values in the the "found" buckets (min and max) are always the last bucket in the set, which means that the selectivity is always bucket_sel.

bucket_sel is based off the width of a bucket. The width of a bucket means the percentage of rows from the table that fit into each bucket.

It is very easy to "trick" this piece of code into outputting a bad selectivity value (compared to what is the true selectivity of the condition). Just double the number of rows with valid_to='9999-12-31', by running this set of statements before ANALYZE TABLE t PERSISTENT FOR ALL in the test case and you'll get:

create temporary table t_min as select min(id) id from t group by id2;
update t, t_min set t.valid_to='9999-12-31' where t.id = t_min.id;
commit;

Now we have both the "minimum ids" and the "maximum ids" with valid_to set to '9999-12-31'.

Compute SIMPLE_PREC_HB histograms and run the Select query again:

analyze select count(*) from t join t t2 on t.id2 = t2.id2 and t2.valid_to >='9999-12-31';
+------+-------------+-------+-------+---------------+--------------+---------+-------------+--------+-----------+----------+------------+--------------------------+
| id   | select_type | table | type  | possible_keys | key          | key_len | ref         | rows   | r_rows    | filtered | r_filtered | Extra                    |
+------+-------------+-------+-------+---------------+--------------+---------+-------------+--------+-----------+----------+------------+--------------------------+
|    1 | SIMPLE      | t2    | index | id2_valid_to  | id2_valid_to | 12      | NULL        | 262152 | 262152.00 |     0.39 |       0.76 | Using where; Using index |
|    1 | SIMPLE      | t     | ref   | id2_valid_to  | id2_valid_to | 4       | test.t2.id2 | 261    | 261.89    |   100.00 |     100.00 | Using index              |
+------+-------------+-------+-------+---------------+--------------+---------+-------------+--------+-----------+----------+------------+--------------------------+

Another way to look at a SINGLE_PREC histogram:

The interval between min_value and max_value for the column we have collected a histogram is split into 255 segments. We can then compute how many rows fall into each segment. With the current implementation the smallest value we can get when it comes to selectivity estimates for a range condition is equal to 0.39% (The selectivity corresponding to 1 bucket out of 255 buckets).

For DOUBLE_PREC histograms, the smallest selectivity value we can get for a range condition is double that of SINGLE_PREC histograms as we can only hold half the number of buckets.

However, the advantage of DOUBLE_PREC histograms is that the min_value, max_value interval is split into 255 * 255 segments. This means that for range conditions spanning less than 1/255th of the range [min_value, max_value], the histogram will better convey the true percentage of rows that match said range.

igor let me know if this Analysis makes sense to you and if it helps.

Comment by Vicențiu Ciorbaru [ 2019-04-30 ]

To show the difference between single precision and double precision histograms I've added the following 4 graphs that plot the distribution based on the data. For all pictures the X axis represents the range min-value, max value. The Y axis absolute values are not relevant, the height difference shown by the line is.



The first 2 pictures are before updating the initial t table with 9999 date values (basically no long tail).
Immediately we can notice that double prec histograms show variations in density where we have more buckets. This is not the case for single precision histograms due to their limited range.

Next we add the long tail of 1001 elements with 9999 date values.


Here we can see an even worse result! We are missing a lot of information in the SINGLE PREC version with values between 2016 and 2019, where most values are. Again, this is because of limited range for SINGLE PREC histograms.

Comment by Igor Babaev [ 2019-05-01 ]

This shows me that we badly need linear approximation when calculating selectivity. What's the point to have a good histogram and a poor calculation of selectivity? Optimizer uses selectivity after all.

Comment by Igor Babaev [ 2019-05-15 ]

This is not a bug: see my explanation in comments

Generated at Thu Feb 08 08:50:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.