[MDEV-4419] SQL PARSER - Enum Optimization - some queries could use index and could return 'Impossible Where' when using ENUM fields and constants outside ENUM declaration Created: 2013-04-23  Updated: 2015-11-17

Status: Open
Project: MariaDB Server
Component/s: None
Fix Version/s: None

Type: Task Priority: Minor
Reporter: roberto spadim Assignee: Unassigned
Resolution: Unresolved Votes: 1
Labels: optimizer

Attachments: File mdev4419.tar.gz    

 Description   

Hi guys, i found a optimization (maybe a bug?) that could be 'easly' optimized, check this query...

EXPLAIN
SELECT COUNT(*)FROM dig_lotes WHERE lote_tipo='r' AND sit_spa!='mov'  

this result in 446196 rows (read all table), and changing it to:

EXPLAIN
SELECT COUNT(*)FROM dig_lotes WHERE lote_tipo='r' AND sit_spa IN ('','d','s','p','a')

result in 16 rows and 'using where' (an index)
the interesting part is....

column 'sit_spa' is ENUM('d','s','p','a','mov') NOT NULL DEFAULT 'd'


sit_spa!='mov' could be optimized via ENUM declaration to: 'd','s','p','a' plus '' if database accepted a value outside enum declaration, ex: a insert of 'xyz' result in '' enum value + warning in this field

the idea is...
optimize "enum_field!=enum_value" removing the enum_value from enum declaration list.
the result is an index being used instead of a full table scan

A second optimization is:
WHERE sit_spa='z'
since 'z' isn't in 'd','s','p','a','mov', we could return 'impossible where'

check that enum is a diferent field type (for where part, select part and order by/group by parts)...

for sit_spa enum ('d','s') not null
enum_index 0 = ""
enum_index 1 = 'd'
enum_index 2 = 's'

for sit_spa enum ('d','s') (without not null), enum_index NULL = NULL

when using enum_value ='3' and enum_index=2, example: enum ('1','3')
if we execute, WHERE enum='2'
the enum_value='2' don't exists, the parser check for enum_index, and found '3' as result (since enum_index='2', is enum_value='3')


implementation ?
from sergei emails:

We have a range optimizer (in opt_range.cc), it can already do pretty, much what you say, it can change, for example

WHERE a != 10

to

WHERE a < 10 OR a > 10

So, perhaps, all you need to do is to make sure this code is used for your ENUM queries.

but reading sql_select.cc we have remove_eq_conds that check conditions one by one, and can rewrite the condition or return a impossible where, or a always true condition
i think that get_mm_leaf is something that only works with index (i'm wrong?)


maybe it should be implemented in opt_range.cc at get_mm_leaf() function? or in sql_select.cc remove_eq_conds function?

something like:

  /* 
	MDEV-4419
	IMPOSSIBLE WHERE FOR ENUM FIELDS / REWRITE OF ENUM FIELDS OPERATOR
	
	any enum_field OP const_value should be executed in all enum values,
	to optimize the query, since ENUM is a 'bitmap' of all possible values
	we can rewrited it to IN/NOT IN operator, this allow index usage
	
	examples:
	
	for NULL possible fields:
		a ENUM ('a','b','c')
		internally we have: NULL,0=>'',1=>'a',2=>'b',3=>'c'
	for NOT NULL fields:
		a ENUM ('a','b','c') NOT NULL
		internally we have: 0=>'',1=>'a',2=>'b',3=>'c'
	
	example 1) (impossible where)
	WHERE a = 'd'
	'' = 'd'  ? false
	'a' = 'd' ? false
	'b' = 'd' ? false
	'c' = 'd' ? false
	
	rewrite to  
	(1=0) - always false, impossble where
	
	example 2) (allways true)
	a < 'd' will result:
	'' < 'd'  ? true
	'a' < 'd' ? true
	'b' < 'd' ? true
	'c' < 'd' ? true
	
	rewriting to :
	a IN ('','a','b','c') - always true
	
	example 3) (some values of enum, using IN)
	a LIKE '%a' will result:
	'' LIKE '%a'  ? false
	'a' LIKE '%a' ? true
	'b' LIKE '%a' ? false
	'c' LIKE '%a' ? false
 
	we have 1 true, and 3 falses, better use IN ()
	rewriting to :
	
	using IN:     a IN ('a')
	using NOT IN: a NOT IN ('','b','c')
	or using ENUM INDEX:
	using IN:     a IN (1)
	using NOT IN: a NOT IN (0,2,3)
	
	
	example 4) (some values of enum, using NOT IN)
	a != 'b' will result:
	'' != 'b'  ? true
	'a' != 'b' ? true
	'b' != 'b' ? false
	'c' != 'b' ? true
	
	we have 3 true, and 1 false, better use NOT IN ()
	rewriting to:
	
	using IN:     a IN ('','a','c')
	using NOT IN: a NOT IN ('b')
	or using ENUM INDEX:
	using IN:     a IN (0,1,3)
	using NOT IN: a NOT IN (2)
  */
  if (field->real_type() == MYSQL_TYPE_ENUM)
  {
    uint count_true_list =0;
    uint count_false_list=0;
    bool true_false_list[];
    uint enum_field_counter=0;
    char buffer[255];
    String enum_item(buffer, sizeof(buffer), res.charset());
    
    /* seek all values and check how many true/false we got */
    resize_true_false_list[]
    set true_false_list to false;
    /* true_false_list start from 0 to enum values +1, NULL = index 0, */
    uint *len= field->typelib->type_lengths;
    for (enum_field_counter=0,const char **pos= field->typelib->type_names; *pos; pos++, len++, enum_field_counter++)
    {
      enum_string_value= enum_item.copy(*pos, *len, charset(), res.charset(), &dummy_errors);
      if (type == Item_func::IN_FUNC)
      {
	/* search in each IN value */
	while( .. in_values .. )
	{
 
  	  if ((enum_str_value == in_str_value && in_value_is_string && compare_function(enum_str,in_str)) || 
              (enum_int_value == in_int_value && !in_value_is_string && compare_function(enum_int,in_int))) {
	    count_true_list ++;
	    true_false_list[enum_field_counter]=true;
 	  } else {
	    count_false_list ++;
	  }
	}
      } else if (type == Item_func::BETWEEN) {
        /* two args */
        // don't know what to do...
        goto end;
      } else {
        /* only one value */
  	if ((enum_str_value == str_value && value_is_string && compare_function(enum_str,value_str)) || 
              (enum_int_value == int_value && !value_is_string && compare_function(enum_int,value_int))) {
	  count_true_list ++;
	  true_false_list[enum_field_counter]=true;
	} else {
	  count_false_list ++;
	}
      }
    }
    /* check about always true (discart operation) / false (impossible where) */
    if (count_false_list == 0)
    {
      tree= &null_element;	/* always true */
      goto end;
    }
    if (count_true_list == 0)
    {
      tree->type= SEL_ARG::IMPOSSIBLE;	/* impossible where */
      goto end;
    }
    /* rewrite the operation using IN / NOT IN () */
    if (count_false_list > count_true_list)
	inv=true; /* add NOT before IN () */
    type=Item_func::IN_FUNC;
    for each true_false_list
    {
      if(inv)  
        add respective enum_value of true_false_list when it's false
      else
        add respective enum_value of true_false_list when it's true
    }
    goto end;
  }



 Comments   
Comment by Elena Stepanova [ 2013-04-23 ]

Hi Roberto,

Please provide the full data dump for dig_lotes.

Another question,

>> optimize "enum_field!=enum_value" removing the enum_value from enum declaration list.

Would you be willing to implement the optiization and provide a patch?

Comment by roberto spadim [ 2013-04-23 ]

hi elena, i don't have many contact with source code of mariadb, but i didn't understand if you got the main idea, something like...

enum_field can have enum(x,y,z,....) values + wrong value of enum( out of fields )
wrong values are normally ''

before execute field!=value check if the field is a enum or not, if yes it could be optimized to field IN (possible values) (i didn't checked if NOT IN () is optimized but i'm sure that IN is optimized, that's the work around that i'm using to execute SQL via index instead full table scan)

i don't know where could i help in the code :/
maybe i could help if someone could show me some guide lines and explain what files should i read, or where i could study more about the code to change it.

Comment by roberto spadim [ 2013-04-23 ]

using the table of MDEV-4416
you can check this:

explain select  * from dig_lotes where lote_tipo='r' and sit_spa !='mov'

result = Using where, 446577 rows, type= ALL

and this (using 'enum field to optimize'):

explain select  * from dig_lotes where lote_tipo='r' and sit_spa IN ('d','s','p','a','')

result = Using where, 28 rows, key = consulta_lote, type = range

CREATE TABLE `dig_lotes` (
`lote_unidade` int(11) NOT NULL DEFAULT '0',
`lote_data` date NOT NULL DEFAULT '0000-00-00',
`lote_numero` int(11) NOT NULL DEFAULT '0',
`lote_tipo` char(1) NOT NULL DEFAULT 'p',
`total_ocorrencias` int(10) unsigned NOT NULL DEFAULT '0',
`total_documentos` int(10) unsigned NOT NULL DEFAULT '0',
`total_baixas` int(10) unsigned NOT NULL DEFAULT '0',
`total_bancos` int(10) unsigned NOT NULL DEFAULT '0',
`moeda_total` char(3) NOT NULL DEFAULT 'R$',
`valor_total` decimal(17,5) unsigned NOT NULL DEFAULT '0.00000',
`valor_baixas` decimal(17,5) unsigned NOT NULL DEFAULT '0.00000',
`valor_bancos` decimal(17,5) unsigned NOT NULL DEFAULT '0.00000',
`digitador_tipo` enum('f','j') NOT NULL DEFAULT 'f',
`digitador_id` int(11) NOT NULL DEFAULT '0',

  1. sit_spa was changed in MDEV 4416, to char(3) after i checked the slow query, i was thinking that CHAR(3) could make optimizer happy, but no results
    `sit_spa` ENUM('d','s','p','a','mov') NOT NULL DEFAULT 'd',

`data_spa_d` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`data_spa_s` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`data_spa_p` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`data_spa_a` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`data_spa_mov` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`bordero` bigint(20) unsigned NOT NULL DEFAULT '0',
`cessionario_tipo` enum('f','j') NOT NULL DEFAULT 'f',
`cessionario_id` bigint(20) NOT NULL DEFAULT '0',
`importado` varchar(25) NOT NULL,
`numero_remessa_boleto` bigint(20) unsigned NOT NULL DEFAULT '0',
`arquivo_importacao_id` int(10) unsigned NOT NULL DEFAULT '0',
`observacao` text NOT NULL,
`lote_data_origem` date NOT NULL DEFAULT '0000-00-00',
`lote_numero_origem` int(11) NOT NULL DEFAULT '0',
`cobrador_padrao_tipo` enum('f','j') NOT NULL DEFAULT 'f',
`cobrador_padrao_id` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`lote_unidade`,`lote_data`,`lote_numero`),
KEY `lote_origem` (`lote_unidade`,`lote_data_origem`,`lote_numero_origem`),
KEY `consulta_lote` (`lote_tipo`,`sit_spa`(1),`lote_data`,`lote_numero`)
)

Comment by roberto spadim [ 2013-04-23 ]

here a smaller test table with the same 'problem':

http://www.spadim.com.br/mdev%204419.sql

Comment by roberto spadim [ 2013-04-23 ]

SEE THE DIFERENCE WITH TABLE OF LAST POST (5000 ROWS)

1)
flush status; flush query cache;
select * from tmp_test where lote_tipo='r' and sit_spa !='mov';
show status like 'handler_read%';
RESULT => Handler_read_rnd_next=5001

2)
flush status; flush query cache;
select * from tmp_test where lote_tipo='r' and sit_spa IN ('d','s','p','a','');
show status like 'handler_read%';
RESULT => Handler_read_key=5

Comment by roberto spadim [ 2013-04-23 ]

an example in php...

function optimize_diference_enum($enum_values,$enum_diff_value){
$enum_possibilities=$enum_values;unset($enum_possibilities['mov']);
$return=" IN (\"".implode(array_keys($enum_possibilities))."\")";
}
$enum_field_values=array(''=>-1,'s'=>0,'p'=>1,'a'=>2,'mov'=>3);
$SQL="SELECT * FROM tmp_test WHERE sit_spa ".optimize_diference_enum($enum_field_values,'mov');


i don't know where in mariadb code $enum_field_values could be checked when optimizing the query
but with this 'array' of enum values we can remove the != value part and rewrite the query with a IN ()


enum have a very nice point for optimizer, since we know what values can be found without reading the table! for example:
a table with a field sit_spa => ENUM('s','p','a')

1)select * from tmp_test where sit_spa='asdfasdf'
this query will result 0 rows, since 'asdfasdf' is not in ENUM() declaration, and can return near to 0 seconds

2)select * from tmp_test where sit_spa!='a'
this query can be optimized to sit_spa IN ('','s','p') using ENUM() declaration, and if table have an index we can use it! (the best part of optimization)


i will stop posting examples in this MDEV, it's too long

Comment by Elena Stepanova [ 2013-04-24 ]

Hi Roberto,

>> i don't have many contact with source code of mariadb, but i didn't understand if you got the main idea...

I got your idea, it's just when somebody says that something is easy to implement, I hope they already know how exactly to implement it, hence the simplest way to get it done is to provide it in the form of a patch.

Anyway, I will re-attach your test case to the report if you don't mind, so it's not lost, and will convert it into a task (a.k.a feature request).

Comment by Elena Stepanova [ 2013-04-24 ]

SQL from http://www.spadim.com.br/mdev%204419.sql is attached as mdev4419.tar.gz

Comment by roberto spadim [ 2013-04-24 ]

yes i agree with you, it's not 'easy' since i don't know how to do it
there's some where that i can read the flow of:
receive query -> parse -> optimize -> execute?

Comment by roberto spadim [ 2013-06-14 ]

Two points of optimization:
first one is the rewrite of operators different from = , be optimized via 'rewrite' to IN operator (including the IN operator)
the second is the 'impossible where' optimizations, executing the operator in each enum value, considering that index 0 of enum is "" and NULL is/isn't possible
both optimizations should consider string representation of enum, and not the index value

Comment by roberto spadim [ 2013-06-14 ]

to anyone who want help, maybe get_mm_leaf in opt_range.cc is the point to optimize, i don't know the source code, but i will register here, to don't forget or be a start point to anyone

Generated at Thu Feb 08 06:56:16 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.