[MDEV-19620] Changing join_buffer_size causes different results Created: 2019-05-28  Updated: 2023-11-28

Status: Stalled
Project: MariaDB Server
Component/s: Optimizer
Affects Version/s: 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11
Fix Version/s: 10.4, 10.5, 10.6

Type: Bug Priority: Critical
Reporter: Michael Widenius Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 0
Labels: 11.0-sel


 Description   

This test case is originally subselect_sj.test

--echo # 
--echo # MDEV-5059: Wrong result (missing row) wih semijoin, join_cache_level > 2, LEFT JOIN, ORDER BY
--echo # 
 
set join_buffer_size= 1024*1024*2;
SET join_cache_level = 3;
set optimizer_search_depth=62;
set join_buffer_space_limit=2097152;
 
set optimizer_switch="index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on";

CREATE TABLE t1 (pk1 INT PRIMARY KEY, i1 INT) ENGINE=MyISAM;
INSERT INTO t1 VALUES (1,4),(2,5);
 
CREATE TABLE t2 (i2 INT, c2 CHAR(1)) ENGINE=MyISAM;
INSERT INTO t2 VALUES (1,'v'),(7,'s');
 
CREATE TABLE t3 (pk3 INT PRIMARY KEY, i3 INT, c3 CHAR(1), INDEX(i3), INDEX(c3,i3)) ENGINE=MyISAM;
INSERT INTO t3 VALUES (1,7,'g'),(2,4,'p'),(3,1,'q');
 
CREATE TABLE t4 (i4 INT) ENGINE=MyISAM;
INSERT INTO t4 VALUES (1);

SELECT * FROM t1, t2 LEFT JOIN t3 ON (c3 = c2) 
WHERE ( i2, pk1 )  IN ( SELECT i3, 1 FROM t3, t4 ) ORDER BY i1, c2;
 
set join_buffer_size=262144;
 
SELECT * FROM t1, t2 LEFT JOIN t3 ON (c3 = c2) 
WHERE ( i2, pk1 )  IN ( SELECT i3, 1 FROM t3, t4 ) ORDER BY i1, c2;
 
DROP TABLE t1,t2,t3,t4;

-----------
The last SELECT produces two different results:

SELECT * FROM t1, t2 LEFT JOIN t3 ON (c3 = c2) 
WHERE ( i2, pk1 )  IN ( SELECT i3, 1 FROM t3, t4 ) ORDER BY i1, c2;
pk1     i1      i2      c2      pk3     i3      c3
1       4       7       s       1       7       g
1       4       7       s       2       4       p
1       4       7       s       3       1       q
1       4       1       v       1       7       g
1       4       1       v       2       4       p
1       4       1       v       3       1       q

set join_buffer_size=262144;
SELECT * FROM t1, t2 LEFT JOIN t3 ON (c3 = c2) 
WHERE ( i2, pk1 )  IN ( SELECT i3, 1 FROM t3, t4 ) ORDER BY i1, c2;
pk1     i1      i2      c2      pk3     i3      c3
1       4       7       s       NULL    NULL    NULL
1       4       1       v       NULL    NULL    NULL



 Comments   
Comment by Sergei Petrunia [ 2019-05-28 ]

If run the query with semijoin=off, I get two rows. I assume two rows is the correct output.

The EXPLAIN is the same with both join_buffer_size settings:

+------+-------------+-------+------------+---------------+-------------+---------+----------+------+----------------------------------------------------------------------------------+
| id   | select_type | table | type       | possible_keys | key         | key_len | ref      | rows | Extra                                                                            |
+------+-------------+-------+------------+---------------+-------------+---------+----------+------+----------------------------------------------------------------------------------+
|    1 | PRIMARY     | t4    | system     | NULL          | NULL        | NULL    | NULL     |    1 | Using temporary; Using filesort                                                  |
|    1 | PRIMARY     | t1    | const      | PRIMARY       | PRIMARY     | 4       | const    |    1 |                                                                                  |
|    1 | PRIMARY     | t2    | ALL        | NULL          | NULL        | NULL    | NULL     |    2 | Using where                                                                      |
|    1 | PRIMARY     | t3    | hash_index | i3            | #hash#i3:i3 | 5:5     | j2.t2.i2 |    3 | Using index; Start temporary; End temporary; Using join buffer (flat, BNLH join) |
|    1 | PRIMARY     | t3    | hash_ALL   | c3            | #hash#c3    | 2       | j2.t2.c2 |    3 | Using where; Using join buffer (flat, BNLH join)                                 |
+------+-------------+-------+------------+---------------+-------------+---------+----------+------+----------------------------------------------------------------------------------+

Comment by Sergei Petrunia [ 2019-05-28 ]

Reproducible without the ORDER BY clause.

Comment by Varun Gupta (Inactive) [ 2020-10-21 ]

OK the plans look similar but they are not exactly the same, lets analyze the output of ANALYZE FORMAT=JSON for the queries with different join_buffer_size.
When we have a smaller join_buffer_size in that case we disallow using join buffering for a table (t3 here)

Output of ANALYZE with join_buffer_size= 2097152

ANALYZE FORMAT=JSON
SELECT * FROM t1, t2 LEFT JOIN t3 ON (c3 = c2)
WHERE ( i2, pk1 )  IN ( SELECT i3, i4 FROM t3 A, t4 B);
ANALYZE
{
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 0.4434,
    "table": {
      "table_name": "B",
      "access_type": "system",
      "r_loops": 0,
      "rows": 1,
      "r_rows": null,
      "filtered": 100,
      "r_filtered": null
    },
    "table": {
      "table_name": "t1",
      "access_type": "const",
      "possible_keys": ["PRIMARY"],
      "key": "PRIMARY",
      "key_length": "4",
      "used_key_parts": ["pk1"],
      "ref": ["const"],
      "r_loops": 0,
      "rows": 1,
      "r_rows": null,
      "filtered": 100,
      "r_filtered": null
    },
    "table": {
      "table_name": "t2",
      "access_type": "ALL",
      "r_loops": 1,
      "rows": 2,
      "r_rows": 2,
      "r_total_time_ms": 0.0201,
      "filtered": 100,
      "r_filtered": 100,
      "attached_condition": "t2.i2 is not null"
    },
    "duplicates_removal": {
      "block-nl-join": {
        "table": {
          "table_name": "A",
          "access_type": "hash_index",
          "possible_keys": ["i3"],
          "key": "#hash#i3:i3",
          "key_length": "5:5",
          "used_key_parts": ["i3"],
          "ref": ["test.t2.i2"],
          "r_loops": 1,
          "rows": 3,
          "r_rows": 3,
          "r_total_time_ms": 0.055,
          "filtered": 66.667,
          "r_filtered": 100,
          "using_index": true
        },
        "buffer_type": "flat",
        "buffer_size": "2048Kb",
        "join_type": "BNLH",
        "r_filtered": 100
      }
    },
    "table": {
      "table_name": "t3",
      "access_type": "hash_ALL",
      "possible_keys": ["c3"],
      "key": "#hash#c3",
      "key_length": "2",
      "used_key_parts": ["c3"],
      "ref": ["test.t2.c2"],
      "r_loops": 2,
      "rows": 3,
      "r_rows": 3,
      "r_total_time_ms": 0.0289,
      "filtered": 33.333,
      "r_filtered": 100,
      "attached_condition": "trigcond(trigcond(t2.c2 is not null))"
    }
  }
}

Output for ANALYZE FORMAT=JSON with join_buffer_size= 262144 (gives correct result)

ANALYZE FORMAT=JSON
SELECT * FROM t1, t2 LEFT JOIN t3 ON (c3 = c2)
WHERE ( i2, pk1 )  IN ( SELECT i3, i4 FROM t3 A, t4 B);
ANALYZE
{
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 0.3191,
    "table": {
      "table_name": "B",
      "access_type": "system",
      "r_loops": 0,
      "rows": 1,
      "r_rows": null,
      "filtered": 100,
      "r_filtered": null
    },
    "table": {
      "table_name": "t1",
      "access_type": "const",
      "possible_keys": ["PRIMARY"],
      "key": "PRIMARY",
      "key_length": "4",
      "used_key_parts": ["pk1"],
      "ref": ["const"],
      "r_loops": 0,
      "rows": 1,
      "r_rows": null,
      "filtered": 100,
      "r_filtered": null
    },
    "table": {
      "table_name": "t2",
      "access_type": "ALL",
      "r_loops": 1,
      "rows": 2,
      "r_rows": 2,
      "r_total_time_ms": 0.0172,
      "filtered": 100,
      "r_filtered": 100,
      "attached_condition": "t2.i2 is not null"
    },
    "duplicates_removal": {
      "block-nl-join": {
        "table": {
          "table_name": "A",
          "access_type": "hash_index",
          "possible_keys": ["i3"],
          "key": "#hash#i3:i3",
          "key_length": "5:5",
          "used_key_parts": ["i3"],
          "ref": ["test.t2.i2"],
          "r_loops": 1,
          "rows": 3,
          "r_rows": 3,
          "r_total_time_ms": 0.0422,
          "filtered": 66.667,
          "r_filtered": 100,
          "using_index": true
        },
        "buffer_type": "flat",
        "buffer_size": "256Kb",
        "join_type": "BNLH",
        "r_filtered": 100
      }
    },
    "block-nl-join": {
      "table": {
        "table_name": "t3",
        "access_type": "hash_ALL",
        "possible_keys": ["c3"],
        "key": "#hash#c3",
        "key_length": "2",
        "used_key_parts": ["c3"],
        "ref": ["test.t2.c2"],
        "r_loops": 1,
        "rows": 3,
        "r_rows": 3,
        "r_total_time_ms": 0.0132,
        "filtered": 33.333,
        "r_filtered": 100
      },
      "buffer_type": "flat",
      "buffer_size": "256Kb",
      "join_type": "BNLH",
      "attached_condition": "trigcond(trigcond(t2.c2 is not null))",
      "r_filtered": null
    }
  }
}

The problem with the first output (with smaller join buffer), it is still using hash key for table t3 which is incorrect. I think instead of using hash-key we could use ref access on key (c3) here instead. Also by not being able to do the lookup the ON condition is not evaluated (c3= c2) and hence all rows pass the where clause and we return more results.

Comment by Sergei Petrunia [ 2020-12-29 ]

Review: https://lists.launchpad.net/maria-developers/msg12491.html

(also, do not forget to replace an assert with assignment, as discussed on Slack)

Comment by Varun Gupta (Inactive) [ 2020-12-30 ]

Made the changes
http://lists.askmonty.org/pipermail/commits/2020-December/014429.html

Comment by Sergei Petrunia [ 2021-01-12 ]

If Igor says it's an area he should be reviewing, perhaps this is also an area he should be making the fix for?

Generated at Thu Feb 08 08:53:04 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.