Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23704

Optimizer does not use semijoin optimization for some WHERE (pk1, pk2, pk3) IN ((1,2,3), ...) queries

    XMLWordPrintable

Details

    Description

      For query like this:

      select * from t where (PK1, PK2, PK3) in ((v1,v2,v3), (v4,v5,v6), ...)
      

      with long enough list of tuples for multi-column PK values there are 3 possible plans presented by quotes from ANALYZE FORMAT=JSON outputs):

      1. Range scan, can be forced in some cases with FORCE INDEX(PRIMARY):

      {
        "query_block": {
          "select_id": 1,
          "r_loops": 1,
          "r_total_time_ms": 19.184,
          "table": {
            "table_name": "t",
            "access_type": "range",
            "possible_keys": ["PRIMARY"],
            "key": "PRIMARY",
            "key_length": "74",
            "used_key_parts": [
              "PK1",
              "PK2",
              "PK3"
            ],
            "r_loops": 1,
            "rows": 1000,
            "r_rows": 1000,
            "r_total_time_ms": 15.165,
            "filtered": 100,
            "r_filtered": 100,
            "attached_condition": "(t.PK1,t.PK2,t.PK3) in (<cache>((349,'*********','01')),<cache>((349,'*********','01')),)"
          }
        }
      }
      

      2. For the same IN list in other environment we end up with full table scan that is very slow:

      {
        "query_block": {
          "select_id": 1,
          "r_loops": 1,
          "r_total_time_ms": 51060,
          "table": {
            "table_name": "t",
            "access_type": "ALL",
            "possible_keys": [
              "PRIMARY",
               ...
            ],
            "r_loops": 1,
            "rows": 7957643,
            "r_rows": 7.96e6,
            "r_total_time_ms": 48324,
            "filtered": 100,
            "r_filtered": 0.0126,
            "attached_condition": "(t.PK1,t.PK2,t.PK3) in (<cache>((***,'*******','01')),<cache>((***,'*******','01')),.......)"
          }
        }
      }
      

      3. Finally in other environment with very similar data for the same IN list we end up with semijoin optimization applied:

      {
        "query_block": {
          "select_id": 1,
          "r_loops": 1,
          "r_total_time_ms": 11.565,
          "duplicates_removal": {
            "table": {
              "table_name": "<derived3>",
              "access_type": "ALL",
              "r_loops": 1,
              "rows": 1000,
              "r_rows": 1000,
              "r_total_time_ms": 0.1065,
              "filtered": 100,
              "r_filtered": 100,
              "materialized": {
                "query_block": {
                  "union_result": {
                    "table_name": "<unit3>",
                    "access_type": "ALL",
                    "r_loops": 0,
                    "r_rows": null,
                    "query_specifications": [
                      {
                        "query_block": {
                          "select_id": 3,
                          "table": {
                            "message": "No tables used"
                          }
                        }
                      }
                    ]
                  }
                }
              }
            },
            "table": {
              "table_name": "t",
              "access_type": "eq_ref",
              "possible_keys": [
                "PRIMARY",
              ...
              ],
              "key": "PRIMARY",
              "key_length": "74",
              "used_key_parts": [
                "PK1",
                "PK2",
                "PK3"
              ],
              "ref": ["tvc_0._col_1", "func", "func"],
              "r_loops": 1000,
              "rows": 1,
              "r_rows": 0.998,
              "r_total_time_ms": 7.7333,
              "filtered": 100,
              "r_filtered": 100,
              "attached_condition": "t.PK2 = convert(tvc_0._col_2 using utf8) and t.PK3 = convert(tvc_0._col_3 using utf8)"
            }
          }
        }
      }
      

      This plan is the fastest even comparing to range join, but there is no way to force it.

      So, I think there is a bug when semijoin optimization is not used even when "range" is extended to full table scan as an alternative. I'd also like to have a way to force the best plan.

      Attachments

        1. a2.test
          21 kB
          Sergei Petrunia
        2. screenshot-1.png
          29 kB
          Sergei Petrunia

        Issue Links

          Activity

            People

              psergei Sergei Petrunia
              valerii Valerii Kravchuk
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.