[MDEV-19134] EXISTS() slower if ORDER BY is defined Created: 2019-04-02  Updated: 2019-05-19  Resolved: 2019-05-16

Status: Closed
Project: MariaDB Server
Component/s: Data Manipulation - Subquery, Optimizer
Affects Version/s: 10.1.38, 10.3.13
Fix Version/s: 10.4.5

Type: Bug Priority: Major
Reporter: David Rodrigues Assignee: Sergei Petrunia
Resolution: Fixed Votes: 0
Labels: performance


 Description   

In some specific cases, EXISTS() will slowdown if ORDER BY is defined in the subquery. Why is the reason to apply an ORDER BY to EXISTS()? None. But Laravel does that and I can't control it.

The question is: ORDER BY inside of an EXISTS() will slowdown in a specific case. I think that the optimizer is not ignoring that.

Setup:

CREATE TABLE IF NOT EXISTS `table_a` (
	`id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
	PRIMARY KEY (`id`)
)
ENGINE=InnoDB;
 
CREATE TABLE IF NOT EXISTS `table_b` (
	`id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
	`index_a` INT(10) UNSIGNED NULL DEFAULT NULL,
	`index_b` INT(10) UNSIGNED NULL DEFAULT NULL,
	`index_c` INT(10) UNSIGNED NULL DEFAULT NULL,
	`parameter` INT(10) UNSIGNED NULL DEFAULT NULL,
	PRIMARY KEY (`id`),
	INDEX `index_a_index_b` (`index_a`, `index_b`),
	INDEX `index_c` (`index_c`),
	INDEX `parameter` (`parameter`)
)
ENGINE=InnoDB;

Populate:

DELIMITER ;;
FOR i IN 1 .. 1250
DO
  INSERT INTO table_a VALUES ();
  INSERT INTO table_b (index_a, index_b, index_c, parameter) VALUES (123, i, NULL, i);
END FOR;;
DELIMITER ;

Slow query:

SELECT table_a.id
FROM   table_a 
 
WHERE
	EXISTS (
		SELECT *
		FROM table_b
		
		WHERE table_b.index_a = 123 AND
			  table_b.index_b = table_a.id AND
			  table_b.parameter = 123
 
		ORDER BY table_b.index_c
	);

Cleanup:

DROP TABLE IF EXISTS table_a;
DROP TABLE IF EXISTS table_b;

What I realized:

  • Dropping ORDER BY will solve (which is not an option for me, unfortunatelly);
  • Dropping index index_a_index_b or index_c will solve;
  • Moving index index_a_index_b to end will solve;
  • Split index index_a_index_b into two separated indexes will solve;

And about the parameter column:

I created this column just to make sure that it will returns only the `table_a.id = 123` as an additional parameter. In this case, I could run a different query with the same behaviour using `IN()` that will not will slowdown (0.000s vs. 1.516s for 1250 x 1250 rows).

SELECT table_a.id
FROM   table_a
 
WHERE
	table_a.id IN (
		SELECT table_b.parameter
		FROM table_b
		
		WHERE table_b.index_a = 123 AND
			   table_b.index_b = table_a.id AND
			   table_b.parameter = 123
		
		ORDER BY table_b.index_c
	)

Explain:

The EXPLAIN is a bit different.

  • The PRIMARY is the same: table_a, type index, possible keys NULL, key PRIMARY, key length 4, ref NULL, rows 1445, extra using where; using index;
  • The DEPENDENT SUBQUERY will have some equal values, like table table_b, possible keys index_a_index_b,parameter, rows 1 and extra using where;
  • The DEPENDENT SUBQUERY of EXISTS() query will be type index, key index_c, key length 5 and ref NULL;
  • The DEPENDENT SUBQUERY of IN() query will be type ref, key index_a_index_b, key length 10 and ref const,test.table_a.id;

About ORDER BY insde EXISTS()

This query is generated by Laravel framework when we use the whereHas() method. Because of a Scope it add the ORDER BY in this place. As workaround I have modified the code to use IN() instead of EXISTS(), but I think that it could be optimized to work fine in all the cases (even with ORDER BY).



 Comments   
Comment by Sergei Petrunia [ 2019-05-04 ]

ANALYZE FORMAT=JSON output on the current 10.1.40:

  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 5.708,
    "table": {
      "table_name": "table_a",
      "access_type": "index",
      "key": "PRIMARY",
      "key_length": "4",
      "used_key_parts": ["id"],
      "r_loops": 1,
      "rows": 1250,
      "r_rows": 1250,
      "r_total_time_ms": 0.3471,
      "filtered": 100,
      "r_filtered": 0.08,
      "attached_condition": "<in_optimizer>(table_a.`id`,<exists>(subquery#2))",
      "using_index": true
    },
    "subqueries": [
      {
        "query_block": {
          "select_id": 2,
          "r_loops": 1250,
          "r_total_time_ms": 4.6903,
          "table": {
            "table_name": "table_b",
            "access_type": "ref",
            "possible_keys": ["index_a_index_b", "parameter"],
            "key": "index_a_index_b",
            "key_length": "10",
            "used_key_parts": ["index_a", "index_b"],
            "ref": ["const", "test.table_a.id"],
            "r_loops": 1250,
            "rows": 1,
            "r_rows": 1,
            "r_total_time_ms": 3.5381,
            "filtered": 100,
            "r_filtered": 0.08,
            "attached_condition": "((1250 = table_b.parameter) and (table_b.parameter = 123))"
          }
        }
      }
    ]
  }
} 

Comment by Sergei Petrunia [ 2019-05-04 ]

If I remove the ORDER BY, the subquery is merged into a semi-join:

{
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 0.0467,
    "table": {
      "table_name": "table_a",
      "access_type": "const",
      "possible_keys": ["PRIMARY"],
      "key": "PRIMARY",
      "key_length": "4",
      "used_key_parts": ["id"],
      "ref": ["const"],
      "r_loops": 0,
      "rows": 1,
      "r_rows": null,
      "filtered": 100,
      "r_filtered": null,
      "using_index": true
    },
    "table": {
      "table_name": "table_b",
      "access_type": "ref",
      "possible_keys": ["index_a_index_b", "parameter"],
      "key": "index_a_index_b",
      "key_length": "10",
      "used_key_parts": ["index_a", "index_b"],
      "ref": ["const", "const"],
      "r_loops": 1,
      "rows": 1,
      "r_rows": 1,
      "r_total_time_ms": 0.0218,
      "filtered": 100,
      "r_filtered": 100,
      "attached_condition": "(table_b.parameter = 123)",
      "first_match": "table_a"
    }
  }
}

Comment by Sergei Petrunia [ 2019-05-04 ]

10.4 (debug build)

{
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 1.9693,
    "table": {
      "table_name": "table_a",
      "access_type": "index",
      "key": "PRIMARY",
      "key_length": "4",
      "used_key_parts": ["id"],
      "r_loops": 1,
      "rows": 1250,
      "r_rows": 1250,
      "r_total_time_ms": 0.2257,
      "filtered": 100,
      "r_filtered": 0.08,
      "attached_condition": "<in_optimizer>(table_a.`id`,<exists>(subquery#2))",
      "using_index": true
    },
    "subqueries": [
      {
        "query_block": {
          "select_id": 2,
          "r_loops": 1250,
          "r_total_time_ms": 1.4948,
          "table": {
            "table_name": "table_b",
            "access_type": "ref",
            "possible_keys": ["index_a_index_b", "parameter"],
            "key": "parameter",
            "key_length": "5",
            "used_key_parts": ["parameter"],
            "ref": ["const"],
            "r_loops": 1250,
            "rows": 1,
            "r_rows": 0.0008,
            "r_total_time_ms": 0.8395,
            "filtered": 100,
            "r_filtered": 100,
            "index_condition": "1250 = table_b.parameter",
            "attached_condition": "table_b.index_b = table_a.`id` and table_b.index_a = 123"
          }
        }
      }
    ]
  }
}

note that I don't get the sorting step here (but the bug report says it's done?)

Comment by Sergei Petrunia [ 2019-05-04 ]

Ok, If I run the ANALYZE FORMAT=JSON immediately after the INSERT statements (without a server restart in the middle), I get the sort step, too:

| {
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 17.856,
    "table": {
      "table_name": "table_a",
      "access_type": "index",
      "key": "PRIMARY",
      "key_length": "4",
      "used_key_parts": ["id"],
      "r_loops": 1,
      "rows": 1250,
      "r_rows": 1250,
      "r_total_time_ms": 0.844,
      "filtered": 100,
      "r_filtered": 0.08,
      "attached_condition": "<in_optimizer>(1,exists(subquery#2))",
      "using_index": true
    },
    "subqueries": [
      {
        "expression_cache": {
          "state": "disabled",
          "r_loops": 200,
          "r_hit_ratio": 0,
          "query_block": {
            "select_id": 2,
            "r_loops": 1250,
            "r_total_time_ms": 15.395,
            "read_sorted_file": {
              "r_rows": 0.0008,
              "filesort": {
                "sort_key": "table_b.index_c",
                "r_loops": 1250,
                "r_total_time_ms": 10.515,
                "r_limit": 1,
                "r_used_priority_queue": true,
                "r_output_rows": 0,
                "table": {
                  "table_name": "table_b",
                  "access_type": "ref",
                  "possible_keys": ["index_a_index_b", "parameter"],
                  "key": "parameter",
                  "key_length": "5",
                  "used_key_parts": ["parameter"],
                  "ref": ["const"],
                  "r_loops": 1250,
                  "rows": 1,
                  "r_rows": 1,
                  "r_total_time_ms": 6.5982,
                  "filtered": 100,
                  "r_filtered": 0.08,
                  "attached_condition": "table_b.parameter <=> 123 and table_b.index_a = 123 and table_b.index_b = table_a.`id`"
                }
              }
            }
          }
        }
      }
    ]
  }
}

Comment by Sergei Petrunia [ 2019-05-04 ]

ORDER BY without LIMIT can be removed for EXISTS/IN subqueries.

For IN subqueries, LIMIT is not a concern, because MariaDB (and MySQL) doesn't support them. The queries will be rejected with this error:

ERROR 1235 (42000): This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

EXISTS subqueries allow LIMIT. "ORDER BY ... LIMIT n" should be removable. "ORDER BY ... LIMIT x OFFSET y" is not removable.

Comment by Sergei Petrunia [ 2019-05-04 ]

Found this piece in item_subselect.cc, Item_in_subselect::select_in_like_transformer

  /*
    IN/SOME/ALL/ANY subqueries aren't support LIMIT clause. Without it
    ORDER BY clause becomes meaningless thus we drop it here.
  */
  for (SELECT_LEX *sl= current->master_unit()->first_select();
       sl; sl= sl->next_select())
  {
    if (sl->join)
    {
      sl->join->order= 0;
      sl->join->skip_sort_order= 1;
    }
  }

Comment by Sergei Petrunia [ 2019-05-04 ]

Investigated.

The optimizer does these query rewrites in this order:

1. EXISTS->IN transformation.
2. IN->semi-join conversion
3. IN->EXISTS rewrite, MIN/MAX rewrite
3.1 Removal of redundant ORDER BY in subquery.

For the EXISTS subquery from this bug report, the following happens:

1. EXISTS->IN doesn't fire because it doesn't allow ORDER BY clause
2. IN->semi-join conversion doesn't happen (because #1 didn't)
3.1 Removal of redundant ORDER BY in subquery doesn't cover EXISTS queries.

In explain, we see sorting, and no merging.

For the IN-subquery from this bug report, the following happens:

2. IN->semi-join conversion doesn't happen because it doesn't allow ORDER BY clause
3.1. Redundant ORDER BY is removed. 

In explain, we see that sorting is not done, but merging is done.

Comment by Sergei Petrunia [ 2019-05-04 ]

Committed a patch that makes the order of transformations to be:

1. EXISTS->IN transformation.
2. Remove redundant ORDER BY [LIMIT] in subqueries
3. IN->semi-join conversion

Now, for the IN-subquery from this bug report:

2. Remove redundant ORDER BY [LIMIT] in subqueries works
2. IN->semi-join conversion works

and we should get a good query plan.

For the EXISTS subquery from this bug report:

1. EXISTS->IN transformation doesn't fire (because ORDER BY is present)
2. Remove redundant ORDER BY [LIMIT] in subqueries - works and removes it
3. IN->semi-join conversion - doesn't work because EXIST->IN didn't

The sorting is gone but we are not able to merge it to semi-join yet.

Comment by Sergei Petrunia [ 2019-05-04 ]

The Step #1 patch: http://lists.askmonty.org/pipermail/commits/2019-May/013738.html . sanja could you please review?

Comment by Sergei Petrunia [ 2019-05-04 ]

I've also tried to make EXISTS->IN transformation work when ORDER BY is present but didn't succeed. I'm hitting this assert in Item_exists_subselect::exists2in_processor

  DBUG_ASSERT(first_select->join->all_fields.elements ==
              first_select->item_list.elements);

Comment by Sergei Petrunia [ 2019-05-07 ]

Discussed with Sanja. The takeaways are:

  • Check again if ORDER BY ... LIMIT is removable for EXISTS subqueries (I initially thought that no, then he said yes during the conversation. Now I'm trying to type down the reasoning and find it to be faulty)
  • Item_exists_subselect::exists2in_processor should be able to ignore the ORDER BY clause. There are some adjustments to be made (e.g. asserts removed) but this should be easily doable.
Comment by Sergei Petrunia [ 2019-05-10 ]

Patch for Step #2: http://lists.askmonty.org/pipermail/commits/2019-May/013758.html

Comment by Sergei Petrunia [ 2019-05-10 ]

Sanja, please review both patches.

Comment by Oleksandr Byelkin [ 2019-05-14 ]

OK to push (both)

Comment by Sergei Petrunia [ 2019-05-16 ]

rentalhost, thanks for reporting this!

The fix is pushed into MariaDB 10.4 tree.

It should be trivial to apply the patch to previous MariaDB versions as well. The reason we are fixing this only in 10.4 is that it changes the query plans which may potentially cause a slowdown for somebody.

Generated at Thu Feb 08 08:49:19 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.