[MDEV-30898] Optimizer ref access regression Created: 2023-03-22  Updated: 2023-11-28

Status: Open
Project: MariaDB Server
Component/s: Optimizer
Affects Version/s: 10.9.5
Fix Version/s: 10.11

Type: Bug Priority: Minor
Reporter: VAROQUI Stephane Assignee: Rex Johnston
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Linux version 4.9.0-6-amd64 (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) ) #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02)
ldd (Debian GLIBC 2.31-13+deb11u5) 2.31
Jemalloc



 Description   

Following migration from 10.6.x to 10.9.5
At some point different type of queries start taking hours to execute while they should be fast what ever theorical execution plan was chosen.

We have been observing massive amount of innodb row reads

One query is more relevant to dig into the issue because it involved only 2 tables one small (100K) and one large (300M).

HERE IS THE BAD PLAN that run for 3 hours before being killed

explain  SELECT     count(*) FROM     `data_entity` AS `e` INNER JOIN `data_value` `sort` ON `e`.`uri` = `sort`.`entity_uri`                         AND `sort`.`predicate_uri` = "population" and `sort`.`date` = "2014-01-01" WHERE     `e`.`type_uri` = "Town"     AND `e`.`status` = 1;
+------+-------------+-------+------+----------------------------------+--------------------------+---------+-----------------------------------------------+------+------------------------------------+
| id   | select_type | table | type | possible_keys                    | key                      | key_len | ref                                           | rows | Extra                              |
+------+-------------+-------+------+----------------------------------+--------------------------+---------+-----------------------------------------------+------+------------------------------------+
|    1 | SIMPLE      | e     | ref  | PRIMARY,data_entity_type_uri_idx | data_entity_type_uri_idx | 767     | const                                         | 1    | Using index condition; Using where |
|    1 | SIMPLE      | sort  | ref  | entity_uri,dpne,pne,pve,ped      | ped                      | 1538    | const,directory_figaro_immobilier.e.uri,const | 1    | Using where; Using index           |
+------+-------------+-------+------+----------------------------------+--------------------------+---------+-----------------------------------------------+------+------------------------------------+

Note there is a ref access on small table using data_entity_type_uri_idx and join the big table using ped index with an other ref .

As one can see there is issue with statistics reporting one row for every part of the plan

Now we are fixing the statistics issue

ANALYZE TABLE data_entity PERSISTENT FOR ALL;

explain  SELECT     count(*) FROM     `data_entity` AS `e` INNER JOIN `data_value` `sort` ON `e`.`uri` = `sort`.`entity_uri`                         AND `sort`.`predicate_uri` = "population" and `sort`.`date` = "2014-01-01" WHERE     `e`.`type_uri` = "Town"     AND `e`.`status` = 1;
+------+-------------+-------+------+----------------------------------+------+---------+-----------------------------------------------+-------+--------------------------+
| id   | select_type | table | type | possible_keys                    | key  | key_len | ref                                           | rows  | Extra                    |
+------+-------------+-------+------+----------------------------------+------+---------+-----------------------------------------------+-------+--------------------------+
|    1 | SIMPLE      | e     | ALL  | PRIMARY,data_entity_type_uri_idx | NULL | NULL    | NULL                                          | 98035 | Using where              |
|    1 | SIMPLE      | sort  | ref  | entity_uri,dpne,pne,pve,ped      | ped  | 1538    | const,directory_figaro_immobilier.e.uri,const | 1     | Using where; Using index |
+------+-------------+-------+------+----------------------------------+------+---------+-----------------------------------------------+-------+--------------------------+
2 rows in set (0.001 sec)
 
SELECT
    ->     `sort`.`value` AS `sortingValue`,
    ->     uri,
    ->     type_uri,
    ->     longitude,
    ->     latitude,
    ->     status
    -> FROM
    ->     `data_entity` AS `e`
    -> INNER JOIN `data_value` `sort` ON `e`.`uri` = `sort`.`entity_uri`
    ->                         AND `sort`.`predicate_uri` = "population" and `sort`.`date` = "2014-01-01"
    -> WHERE
    ->     `e`.`type_uri` = "Town"
    ->     AND `e`.`status` = 1
    -> ORDER BY
    ->     `sort`.`numeric_value` DESC
    -> LIMIT 5;
+--------------+-------------+----------+-----------+----------+--------+
| sortingValue | uri         | type_uri | longitude | latitude | status |
+--------------+-------------+----------+-----------+----------+--------+
| 2243739      | ville-75056 | Town     |   2.33828 |  48.8589 |      1 |
| 866644       | ville-13055 | Town     |   5.37479 |  43.2977 |      1 |
| 514707       | ville-69123 | Town     |   4.83287 |  45.7612 |      1 |
| 474246       | ville-31555 | Town     |   1.43167 |  43.5964 |      1 |
| 347636       | ville-06088 | Town     |   7.23827 |   43.712 |      1 |
+--------------+-------------+----------+-----------+----------+--------+
5 rows in set (0.370 sec)

From here we know we have a serious bug as the difference to FULL SCAN vs REF on 100K records can not lead to 3 hours difference in execution time.

But more problematic is that now that EITS statistics tables are in place i could not reproduce the issue

explain SELECT     `sort`.`value` AS `sortingValue`,     uri,     type_uri,     longitude,     latitude,     status FROM     `data_entity` AS `e` force index(data_entity_type_uri_idx) STRAIGHT_JOIN `data_value` `sort` ON `e`.`uri` = `sort`.`entity_uri`                         AND `sort`.`predicate_uri` = "population" and `sort`.`date` = "2014-01-01" WHERE     `e`.`type_uri` = "Town"     AND `e`.`status` = 1 ORDER BY     `sort`.`numeric_value` DESC LIMIT 5\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: e
         type: ref
possible_keys: data_entity_type_uri_idx
          key: data_entity_type_uri_idx
      key_len: 767
          ref: const
         rows: 49463
        Extra: Using index condition; Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: sort
         type: ref
possible_keys: entity_uri,dpne,pne,pve,ped
          key: ped
      key_len: 1538
          ref: const,directory_figaro_immobilier.e.uri,const
         rows: 1
        Extra: Using index condition
2 rows in set (0.001 sec)
 
SELECT     `sort`.`value` AS `sortingValue`,     uri,     type_uri,     longitude,     latitude,     status FROM     `data_entity` AS `e` force index(data_entity_type_uri_idx) STRAIGHT_JOIN `data_value` `sort` ON `e`.`uri` = `sort`.`entity_uri`                         AND `sort`.`predicate_uri` = "population" and `sort`.`date` = "2014-01-01" WHERE     `e`.`type_uri`
= "Town"     AND `e`.`status` = 1 ORDER BY     `sort`.`numeric_value` DESC LIMIT 5\G
*************************** 1. row ***************************
sortingValue: 2243739
         uri: ville-75056
    type_uri: Town
   longitude: 2.33828
    latitude: 48.8589
      status: 1
*************************** 2. row ***************************
sortingValue: 866644
         uri: ville-13055
    type_uri: Town
   longitude: 5.37479
    latitude: 43.2977
      status: 1
*************************** 3. row ***************************
sortingValue: 514707
         uri: ville-69123
    type_uri: Town
   longitude: 4.83287
    latitude: 45.7612
      status: 1
*************************** 4. row ***************************
sortingValue: 474246
         uri: ville-31555
    type_uri: Town
   longitude: 1.43167
    latitude: 43.5964
      status: 1
*************************** 5. row ***************************
sortingValue: 347636
         uri: ville-06088
    type_uri: Town
   longitude: 7.23827
    latitude: 43.712
      status: 1
5 rows in set (0.348 sec)
 
delete from index_stats where table_name='data_entity;
delete from column_stats where table_name='data_entity';
delete from table_stats where table_name='data_entity';
Flush tables;
 
 
ANALYZE SELECT     `sort`.`value` AS `sortingValue`,     uri,     type_uri,     longitude,     latitude,     status FROM     `data_entity` AS `e` force index(data_entity_type_uri_idx) STRAIGHT_JOIN `data_value` `sort` ON `e`.`uri` = `sort`.`entity_uri`                         AND `sort`.`predicate_uri` = "population" and `sort`.`date` = "2014-01-01" WHERE     `e`.`type_uri` = "Town"     AND `e`.`status` = 1 ORDER BY     `sort`.`numeric_value` DESC LIMIT 5\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: e
         type: ref
possible_keys: data_entity_type_uri_idx
          key: data_entity_type_uri_idx
      key_len: 767
          ref: const
         rows: 48621
       r_rows: 34955.00
     filtered: 100.00
   r_filtered: 100.00
        Extra: Using index condition; Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: sort
         type: ref
possible_keys: entity_uri,dpne,pne,pve,ped
          key: ped
      key_len: 1538
          ref: const,directory_figaro_immobilier.e.uri,const
         rows: 1
       r_rows: 1.00
     filtered: 4.48
   r_filtered: 100.00
        Extra: Using index condition
2 rows in set (0.795 sec)

rows estimate on table e never get back to 1 as in the hurting plan and the query is alway fast now a RANGE is prefered over the REF without forcing and with no stats tables informations.

Glad to give more info if needed or spend. some time to try to reproduce if you get more ideas on what can lead to such issue .



 Comments   
Comment by VAROQUI Stephane [ 2023-03-22 ]

The only thing we tried before ANALYZE TABLE was to change the SESSION optimizer_use_condition_selectivity to 1 but this did not help with the speed

Comment by VAROQUI Stephane [ 2023-03-22 ]

One second typical status during trouble shooting with single buggy query and replication

| Handler_icp_attempts                                   | 56                                               |
| Handler_icp_match                                      | 56                                               |
| Handler_read_key                                       | 53                                               |
| Handler_read_next                                      | 1886848                                          |
| Handler_read_rnd_next                                  | 4485                                             |
| Handler_tmp_write                                      | 4338                                             |
| Innodb_background_log_sync                             | 1                                                |
| Innodb_buffer_pool_read_requests                       | 1896098                                          |
| Innodb_master_thread_idle_loops                        | 1                                                |
| Innodb_rows_read                                       | 1886787                                          |
| Innodb_secondary_index_triggered_cluster_reads         | 56                                               |

Comment by VAROQUI Stephane [ 2023-03-22 ]

show explain on never ending query

MariaDB [mysql]> show explain for 6750;
+------+-------------+-------+------+----------------------------------+--------------------------+---------+-------------+------+------------------------------------+
| id   | select_type | table | type | possible_keys                    | key                      | key_len | ref         | rows | Extra                              |
+------+-------------+-------+------+----------------------------------+--------------------------+---------+-------------+------+------------------------------------+
|    1 | SIMPLE      | e     | ref  | PRIMARY,data_entity_type_uri_idx | data_entity_type_uri_idx | 767     | const       |    1 | Using index condition; Using where |
|    1 | SIMPLE      | sort  | ref  | entity_uri,dpne,pne,pve,ped      | dpne                     | 771     | const,const |    1 | Using where; Using index           |
+------+-------------+-------+------+----------------------------------+--------------------------+---------+-------------+------+------------------------------------+

Comment by VAROQUI Stephane [ 2023-03-22 ]

Good new i'm able to reproduce the slow query by forcing index dpne on the big table
SELECT `sort`.`value` AS `sortingValue`, uri, type_uri, longitude, latitude, status FROM `data_entity` AS `e` force index(data_entity_type_uri_idx) STRAIGHT_JOIN `data_value` `sort` force index(dpne) ON `e`.`uri` = `sort`.`entity_uri` AND `sort`.`predicate_uri` = "population" and `sort`.`date` = "2014-01-01" WHERE
`e`.`type_uri` = "Town" AND `e`.`status` = 1 ORDER BY `sort`.`numeric_value` DESC LIMIT 5\G

Comment by VAROQUI Stephane [ 2023-03-22 ]

Now what is very suspect is i'm sure i was able to fix the slowness of the query by only taking stats on the small table that means access type ref on first part is influencing the choice of second part index

Comment by VAROQUI Stephane [ 2023-03-22 ]

From term history i can not prove the ped index plan was slow only the dpne

| data_value |          1 | dpne       |            1 | date          | A         |         136 |     NULL | NULL   | YES  | BTREE      |         |               | NO      |
| data_value |          1 | dpne       |            2 | predicate_uri | A         |       30832 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| data_value |          1 | dpne       |            3 | numeric_value | A         |   183900529 |     NULL | NULL   | YES  | BTREE      |         |               | NO      |
| data_value |          1 | dpne       |            4 | entity_uri    | A         |   296098242 |     NULL | NULL   |      | BTREE      |         |               | NO      |

vs supposed fast

  
| data_value |          1 | ped        |            1 | predicate_uri | A         |         494 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| data_value |          1 | ped        |            2 | entity_uri    | A         |    12730480 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| data_value |          1 | ped        |            3 | date          | A         |   296068635 |     NULL | NULL   | YES  | BTREE      |         |               | NO      |

I can understand now why it was slow, bug report is less worth than expected, only that without accurate statistics the optimizer was not able to flavor ped (predicate_uri,entity_uri,date) over dpne (date,predicate_uri) this is stange as ped fully cover dpne with one extra const

Comment by VAROQUI Stephane [ 2023-03-22 ]

From upper view the position of covering numeric_value just before entity_uri on dpme is breaking the join predicate and extra icp should not be trusted so that CONST, CONST, *, CONST should always lose over CONST,CONST ,CONST without stats

Generated at Thu Feb 08 10:19:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.