Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-16374

filtered shows 0 for materilization scan for a semi join, which makes optimizer always pick materialization scan over materialization lookup

Details

    Description

      The dataset:

      create table t0(a int);
      insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
      create table t1 (a int, b int);
      insert into t1 select A.a + B.a* 10 + C.a * 100,  A.a + B.a* 10 + C.a * 100 from t0 A, t0 B, t0 C;
      

      MariaDB [test]> set @@optimizer_use_condition_selectivity=2;
      Query OK, 0 rows affected (0.000 sec)
       
      MariaDB [test]> explain format=json select * from t1 where a in (select max(a) from t1 group by b);
       
      EXPLAIN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | {
        "query_block": {
          "select_id": 1,
          "table": {
            "table_name": "<subquery2>",
            "access_type": "ALL",
            "possible_keys": ["distinct_key"],
            "rows": 1000,
            "filtered": 0,
            "materialized": {
              "unique": 1,
              "query_block": {
                "select_id": 2,
                "temporary_table": {
                  "table": {
                    "table_name": "t1",
                    "access_type": "ALL",
                    "rows": 1000,
                    "filtered": 100
                  }
                }
              }
            }
          },
          "block-nl-join": {
            "table": {
              "table_name": "t1",
              "access_type": "ALL",
              "rows": 1000,
              "filtered": 100
            },
            "buffer_type": "flat",
            "buffer_size": "256Kb",
            "join_type": "BNL",
            "attached_condition": "t1.a = `<subquery2>`.`max(a)`"
          }
        }
      } |
      
      

      If you look closely filtered=0 for the <subquery2> which means the optimizer predicted that no rows would be read. Due to this , the cost for such a plan would always be less than any other plan.

      Attachments

        Issue Links

          Activity

            Introduced by this commit

            commit 06ef0bdad254fbb2e826fabcd050134df2d11e75
            Author:    Igor Babaev <igor@askmonty.org>  Tue Mar 18 22:30:50 2014
            Committer:    Igor Babaev <igor@askmonty.org>  Tue Mar 18 22:30:50 2014
             
            Fixed bug mdev-5191.
            Corrected cost estimates when a join buffer is used and the optimizer is requested
            to use condition selectivities.
            

            varun Varun Gupta (Inactive) added a comment - Introduced by this commit commit 06ef0bdad254fbb2e826fabcd050134df2d11e75 Author: Igor Babaev <igor@askmonty.org> Tue Mar 18 22:30:50 2014 Committer: Igor Babaev <igor@askmonty.org> Tue Mar 18 22:30:50 2014   Fixed bug mdev-5191. Corrected cost estimates when a join buffer is used and the optimizer is requested to use condition selectivities.

            Running the same query by using a derived table, we get the correct plan

            explain format=json
            select * from t1 ,(select max(a) as p from t1 group by b)q where t1.a=q.p ;
            EXPLAIN
            {
              "query_block": {
                "select_id": 1,
                "table": {
                  "table_name": "t1",
                  "access_type": "ALL",
                  "rows": 1000,
                  "filtered": 100,
                  "attached_condition": "t1.a is not null"
                },
                "table": {
                  "table_name": "<derived2>",
                  "access_type": "ref",
                  "possible_keys": ["key0"],
                  "key": "key0",
                  "key_length": "5",
                  "used_key_parts": ["p"],
                  "ref": ["test.t1.a"],
                  "rows": 10,
                  "filtered": 100,
                  "materialized": {
                    "query_block": {
                      "select_id": 2,
                      "filesort": {
                        "sort_key": "t1.b",
                        "temporary_table": {
                          "table": {
                            "table_name": "t1",
                            "access_type": "ALL",
                            "rows": 1000,
                            "filtered": 100
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
            

            varun Varun Gupta (Inactive) added a comment - Running the same query by using a derived table, we get the correct plan explain format=json select * from t1 ,(select max(a) as p from t1 group by b)q where t1.a=q.p ; EXPLAIN { "query_block": { "select_id": 1, "table": { "table_name": "t1", "access_type": "ALL", "rows": 1000, "filtered": 100, "attached_condition": "t1.a is not null" }, "table": { "table_name": "<derived2>", "access_type": "ref", "possible_keys": ["key0"], "key": "key0", "key_length": "5", "used_key_parts": ["p"], "ref": ["test.t1.a"], "rows": 10, "filtered": 100, "materialized": { "query_block": { "select_id": 2, "filesort": { "sort_key": "t1.b", "temporary_table": { "table": { "table_name": "t1", "access_type": "ALL", "rows": 1000, "filtered": 100 } } } } } } } }
            varun Varun Gupta (Inactive) added a comment - Patch: http://lists.askmonty.org/pipermail/commits/2018-June/012591.html

            There is this code in make_join_statistics() which fetches the number of rows in the materialized table:

                /* Approximate found rows and time to read them */
                if (s->table->is_filled_at_execution())
                {
                  get_delayed_table_estimates(s->table, &s->records, &s->read_time,
                                              &s->startup_cost);
                  s->found_records= s->records;
                  table->quick_condition_rows=s->records;
                }
                else
                {
                   s->scan_time();
                }
            

            But then, the code in matching_candidates_in_table() uses table->stat_records():

              if (use_cond_selectivity > 1)
              {
                TABLE *table= s->table;
                double sel= table->cond_selectivity;
                double table_records= (double)table->stat_records();
                dbl_records= table_records * sel;
                return dbl_records;
            

            Maybe it should it use s->records or s->found_records instead?

            psergei Sergei Petrunia added a comment - There is this code in make_join_statistics() which fetches the number of rows in the materialized table: /* Approximate found rows and time to read them */ if (s->table->is_filled_at_execution()) { get_delayed_table_estimates(s->table, &s->records, &s->read_time, &s->startup_cost); s->found_records= s->records; table->quick_condition_rows=s->records; } else { s->scan_time(); } But then, the code in matching_candidates_in_table() uses table->stat_records(): if (use_cond_selectivity > 1) { TABLE *table= s->table; double sel= table->cond_selectivity; double table_records= ( double )table->stat_records(); dbl_records= table_records * sel; return dbl_records; Maybe it should it use s->records or s->found_records instead?

            cond_selectivity is selectivity of condition w.r.t. the total number of rows in the table (that is, if the table has a range access, the cond_selectivity "includes" the selectivity of the range access). That is, the code in matching_candidates_in_table needs the number of rows in the table.

            Looking at members in JOIN_TAB:

              /* Either #rows in the table or 1 for const table.  */
              ha_rows	records;
            

            (Also one can see that the code in best_access_path and elsewhere assumes s->records is the total number of rows in the table)

            psergei Sergei Petrunia added a comment - cond_selectivity is selectivity of condition w.r.t. the total number of rows in the table (that is, if the table has a range access, the cond_selectivity "includes" the selectivity of the range access). That is, the code in matching_candidates_in_table needs the number of rows in the table. Looking at members in JOIN_TAB: /* Either #rows in the table or 1 for const table. */ ha_rows records; (Also one can see that the code in best_access_path and elsewhere assumes s->records is the total number of rows in the table)

            Yesterday on the optimizer call agreed that

            • The patch should go into 10.0
            • On the question of whether the rows estimate should be stored just in JOIN_TAB::records or in table->stats.records too, decided that It should be put into table->file.stats.records (and the existing code will copy the value from there into JOIN_TAB::records). That is, Varun's patch is ok.
            psergei Sergei Petrunia added a comment - Yesterday on the optimizer call agreed that The patch should go into 10.0 On the question of whether the rows estimate should be stored just in JOIN_TAB::records or in table->stats.records too, decided that It should be put into table->file.stats.records (and the existing code will copy the value from there into JOIN_TAB::records). That is, Varun's patch is ok.

            Pushed to 10.0

            varun Varun Gupta (Inactive) added a comment - Pushed to 10.0

            People

              varun Varun Gupta (Inactive)
              varun Varun Gupta (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.