Details
- 
    
Task
 - 
    Status: Closed (View Workflow)
 - 
    
Critical
 - 
    Resolution: Fixed
 
- 
        Q4/2025 Server Maintenance
 
Description
Problem description
Consider a query using Split Materialized (EDIT: or just derived-with-keys): (database fill steps provided below)
					explain select *  | 
		
					from  | 
		
					t2, (select max(value), grp_id from t1 group by grp_id) DT  | 
		
					where
			 | 
		
					  t2.a= DT.grp_id;
			 | 
		
					+------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
			 | 
		
					| id   | select_type     | table      | type | possible_keys | key    | key_len | ref        | rows | Extra       |
			 | 
		
					+------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
			 | 
		
					|    1 | PRIMARY         | t2         | ALL  | NULL          | NULL   | NULL    | NULL       | 5    | Using where |
			 | 
		
					|    1 | PRIMARY         | <derived2> | ref  | key0          | key0   | 5       | test2.t2.a | 10   |             |
			 | 
		
					|    2 | LATERAL DERIVED | t1         | ref  | grp_id        | grp_id | 5       | test2.t2.a | 100  |             |
			 | 
		
					+------+-----------------+------------+------+---------------+--------+---------+------------+------+-------------+
			 | 
		
Note that table <derived2> has rows=10, even if we know that it will have rows=1 due to GROUP BY and splitting.
This isn't just the wrong estimate for number of rows to scan in <derived2>. The incorrect fanout affects the partial join cardinality.
Relevant sections of the optimizer trace:
					"plan_prefix": "t2",  | 
		
					"get_costs_for_tables": [  | 
		
					                      {
			 | 
		
					"best_access_path": {  | 
		
					"table": "<derived2>",  | 
		
					                          ...
			 | 
		
					"choose_best_splitting": {  | 
		
					"split_materialized": {  | 
		
					"chosen": true  | 
		
					                            }
			 | 
		
					                            ...
			 | 
		
					                          },
			 | 
		
					"chosen_access_method": {  | 
		
					"type": "ref",  | 
		
					"rows_read": 10,  | 
		
					"rows_out": 10,  | 
		
					"cost": 0.00639105,  | 
		
					"uses_join_buffering": false  | 
		
					                          }
			 | 
		
and then: (t2 has rows=5, so rows_for_plan=50 is a 10x over-estimate:
					                  {
			 | 
		
					"plan_prefix": "t2",  | 
		
					"table": "<derived2>",  | 
		
					"rows_for_plan": 50,  | 
		
					"cost_for_plan": 0.01806085  | 
		
					                  }
			 | 
		
Can this be fixed by just setting rows=1?
Database fill steps:
					Â  | 
		
					create table t1 (  | 
		
					grp_id int,  | 
		
					value int,  | 
		
					index (grp_id)  | 
		
					);
			 | 
		
					Â  | 
		
					insert into t1 select  | 
		
					  A.seq, B.seq
			 | 
		
					from  | 
		
					  seq_1_to_10000 A, 
			 | 
		
					  seq_1_to_100 B
			 | 
		
					;
			 | 
		
					Â  | 
		
					create table t2 (a int);  | 
		
					insert into t2 select seq from seq_1_to_5;  | 
		
					Â  | 
		
					analyze table t1,t2;  | 
		
A related bugfix done recently: MDEV-30877. 
Solution for derived_with_keys
Note that SELECT DISTINCT is already covered
Note that the code in TABLE::add_tmp_key already has similar logic for DISTINCT:
					if (derived)  | 
		
					    {
			 | 
		
					      st_select_lex* first= derived->first_select();
			 | 
		
					      uint select_list_items= first->get_item_list()->elements;
			 | 
		
					if (key_parts == select_list_items)  | 
		
					      {
			 | 
		
					if ((!first->is_part_of_union() && (first->options & SELECT_DISTINCT)) ||  | 
		
					            derived->check_distinct_in_union())
			 | 
		
					          keyinfo->rec_per_key[key_parts - 1]= 1;
			 | 
		
					      }
			 | 
		
					    }
			 | 
		
"if the index covers all elements of the select list, then specifying all keyparts gives rec_per_key=1".
Handling GROUP BY
An idea of what could be done here: add here a function that handles cases when
- The derived table has one SELECT (!first->is_part_of_union())
 - The derived table has a GROUP BY clause (first->group)
 
The logic to produce estimates is as follows.
We have a SELECT with GROUP BY that is materialized into a temporary table:
					create temporary table (tmp_col1, ... tmp_colK) as  | 
		
					select  | 
		
					   col1, ..., colN
			 | 
		
					sum(...) as group_col1, ...  | 
		
					group by  | 
		
					   col1, ... colN;
			 | 
		
and we're considering an index on some of the temp.table columns:
					INDEX( tmp_table_col1, ..., tmp_table_colN ...)  | 
		
Any index prefix that includes all GROUP BY columns will have rec_per_key=1.
Pseudo code to implement this:
					  my_group_list= {collect a list of items in the subquery's GROUP BY clause};
			 | 
		
					  for each key part  KP { 
			 | 
		
					     F= table->field(KP->field_number); 
			 | 
		
					     if (find F in my_group_list == NOT_FOUND)
			 | 
		
					        continue; // Can do nothing for this key part. But might do for further key parts
			 | 
		
					     remove F from my_group_list;
			 | 
		
					     if (my_group_list is empty)
			 | 
		
					     {
			 | 
		
					        for each keypart KP2 starting from KP 
			 | 
		
					           set index->rec_per_key[KP2] = 1;
			 | 
		
					        break;
			 | 
		
					     }
			 | 
		
					   }
			 | 
		
Handling UNION queries
TODO. Johnston's patch tries to do some handling but currently has bugs.
UNION [DISTINCT] removes the duplicates. If the index covers all of select list, we can set
					index->rec_per_key[n_key_parts - 1]=1;
			 | 
		
Both UNION DISTINCT and UNION ALL let one rec_per_key from each of the select. 
The initial state is that index statistics is unknown, that is, rec_per_key[i]=0 for all key parts:
| key_part | kp1 | kp2 | kp3 | kp 4 | 
| rec_per_key | 0 | 0 | 0 | 0 | 
Suppose, the first SELECT in the UNION has GROUP BY kp1,kp2,kp3. We will get this statistics:
| key_part | kp1 | kp2 | kp3 | kp 4 | 
| rec_per_key | 0 | 0 | 1 | 1 | 
Suppose, the second SELECT in the UNION has GROUP BY kp1,kp2. We will get:
| key_part | kp1 | kp2 | kp3 | kp 4 | 
| rec_per_key | 0 | 1 | 1 | 1 | 
How to add the statistics? 
If rec_per_key[i]=0 for any of the SELECTs, that means UNKNOWN and the result should also be 9, UNKNOWN.
If rec_per_key[i]=1 for each of the SELECTs, then we can add  1+1=2  and have that as an estimate for the resulting table.
For or example, this gives:
| key_part | kp1 | kp2 | kp3 | kp 4 | 
| rec_per_key | 0 | 0 | 2 | 2 | 
Touching the "guesstimate" of 10 rows
In current code, best_access_path() will use a guesstimate of 10 rows when index statistics is not available. 
Touching this goes outside of the scope of this MDEV entry.
Solution for Split-Materialized
Split-Materialized seems to reuse the code from derived_with_keys?
I've used the testcase in this MDEV, set rec_per_key[key_parts-1]=1  in the debugger, and the EXPLAIN that uses Split-Materialized has shown correct rows_out. 
One probably needs to also go through the Optimizer Trace and see if all cardinalities are correct.
Attachments
Issue Links
- is part of
 - 
                    
MDEV-26300 LATERAL DERIVED: Optimizer trace insufficient, odd computations
-         
 - Confirmed
 
 -         
 
- relates to
 - 
                    
MDEV-30877 Wrong cardinality estimation for the derived table leads to slow plan with LATERAL DERIVED
-         
 - Closed
 
 -         
 - 
                    
MDEV-37249 Assertion fail in apply_selectivity_for_table
-         
 - Confirmed
 
 -         
 - 
                    
MDEV-37936 Followup to MDEV-36321: out_rows for GROUP BY: use of item names?
-         
 - In Progress
 
 -         
 - 
                    
MDEV-26300 LATERAL DERIVED: Optimizer trace insufficient, odd computations
-         
 - Confirmed
 
 -         
 - 
                    
MDEV-37201 Pick the most efficient subset of keys there are too many key parts
-         
 - Open
 
 -