Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4419

SQL PARSER - Enum Optimization - some queries could use index and could return 'Impossible Where' when using ENUM fields and constants outside ENUM declaration

Details

    • Task
    • Status: Open (View Workflow)
    • Minor
    • Resolution: Unresolved
    • None
    • None

    Description

      Hi guys, i found a optimization (maybe a bug?) that could be 'easly' optimized, check this query...

      EXPLAIN
      SELECT COUNT(*)FROM dig_lotes WHERE lote_tipo='r' AND sit_spa!='mov'  

      this result in 446196 rows (read all table), and changing it to:

      EXPLAIN
      SELECT COUNT(*)FROM dig_lotes WHERE lote_tipo='r' AND sit_spa IN ('','d','s','p','a')

      result in 16 rows and 'using where' (an index)
      the interesting part is....

      column 'sit_spa' is ENUM('d','s','p','a','mov') NOT NULL DEFAULT 'd'


      sit_spa!='mov' could be optimized via ENUM declaration to: 'd','s','p','a' plus '' if database accepted a value outside enum declaration, ex: a insert of 'xyz' result in '' enum value + warning in this field

      the idea is...
      optimize "enum_field!=enum_value" removing the enum_value from enum declaration list.
      the result is an index being used instead of a full table scan

      A second optimization is:
      WHERE sit_spa='z'
      since 'z' isn't in 'd','s','p','a','mov', we could return 'impossible where'

      check that enum is a diferent field type (for where part, select part and order by/group by parts)...

      for sit_spa enum ('d','s') not null
      enum_index 0 = ""
      enum_index 1 = 'd'
      enum_index 2 = 's'

      for sit_spa enum ('d','s') (without not null), enum_index NULL = NULL

      when using enum_value ='3' and enum_index=2, example: enum ('1','3')
      if we execute, WHERE enum='2'
      the enum_value='2' don't exists, the parser check for enum_index, and found '3' as result (since enum_index='2', is enum_value='3')


      implementation ?
      from sergei emails:

      We have a range optimizer (in opt_range.cc), it can already do pretty, much what you say, it can change, for example

      WHERE a != 10

      to

      WHERE a < 10 OR a > 10

      So, perhaps, all you need to do is to make sure this code is used for your ENUM queries.

      but reading sql_select.cc we have remove_eq_conds that check conditions one by one, and can rewrite the condition or return a impossible where, or a always true condition
      i think that get_mm_leaf is something that only works with index (i'm wrong?)


      maybe it should be implemented in opt_range.cc at get_mm_leaf() function? or in sql_select.cc remove_eq_conds function?

      something like:

        /* 
      	MDEV-4419
      	IMPOSSIBLE WHERE FOR ENUM FIELDS / REWRITE OF ENUM FIELDS OPERATOR
      	
      	any enum_field OP const_value should be executed in all enum values,
      	to optimize the query, since ENUM is a 'bitmap' of all possible values
      	we can rewrited it to IN/NOT IN operator, this allow index usage
      	
      	examples:
      	
      	for NULL possible fields:
      		a ENUM ('a','b','c')
      		internally we have: NULL,0=>'',1=>'a',2=>'b',3=>'c'
      	for NOT NULL fields:
      		a ENUM ('a','b','c') NOT NULL
      		internally we have: 0=>'',1=>'a',2=>'b',3=>'c'
      	
      	example 1) (impossible where)
      	WHERE a = 'd'
      	'' = 'd'  ? false
      	'a' = 'd' ? false
      	'b' = 'd' ? false
      	'c' = 'd' ? false
      	
      	rewrite to  
      	(1=0) - always false, impossble where
      	
      	example 2) (allways true)
      	a < 'd' will result:
      	'' < 'd'  ? true
      	'a' < 'd' ? true
      	'b' < 'd' ? true
      	'c' < 'd' ? true
      	
      	rewriting to :
      	a IN ('','a','b','c') - always true
      	
      	example 3) (some values of enum, using IN)
      	a LIKE '%a' will result:
      	'' LIKE '%a'  ? false
      	'a' LIKE '%a' ? true
      	'b' LIKE '%a' ? false
      	'c' LIKE '%a' ? false
       
      	we have 1 true, and 3 falses, better use IN ()
      	rewriting to :
      	
      	using IN:     a IN ('a')
      	using NOT IN: a NOT IN ('','b','c')
      	or using ENUM INDEX:
      	using IN:     a IN (1)
      	using NOT IN: a NOT IN (0,2,3)
      	
      	
      	example 4) (some values of enum, using NOT IN)
      	a != 'b' will result:
      	'' != 'b'  ? true
      	'a' != 'b' ? true
      	'b' != 'b' ? false
      	'c' != 'b' ? true
      	
      	we have 3 true, and 1 false, better use NOT IN ()
      	rewriting to:
      	
      	using IN:     a IN ('','a','c')
      	using NOT IN: a NOT IN ('b')
      	or using ENUM INDEX:
      	using IN:     a IN (0,1,3)
      	using NOT IN: a NOT IN (2)
        */
        if (field->real_type() == MYSQL_TYPE_ENUM)
        {
          uint count_true_list =0;
          uint count_false_list=0;
          bool true_false_list[];
          uint enum_field_counter=0;
          char buffer[255];
          String enum_item(buffer, sizeof(buffer), res.charset());
          
          /* seek all values and check how many true/false we got */
          resize_true_false_list[]
          set true_false_list to false;
          /* true_false_list start from 0 to enum values +1, NULL = index 0, */
          uint *len= field->typelib->type_lengths;
          for (enum_field_counter=0,const char **pos= field->typelib->type_names; *pos; pos++, len++, enum_field_counter++)
          {
            enum_string_value= enum_item.copy(*pos, *len, charset(), res.charset(), &dummy_errors);
            if (type == Item_func::IN_FUNC)
            {
      	/* search in each IN value */
      	while( .. in_values .. )
      	{
       
        	  if ((enum_str_value == in_str_value && in_value_is_string && compare_function(enum_str,in_str)) || 
                    (enum_int_value == in_int_value && !in_value_is_string && compare_function(enum_int,in_int))) {
      	    count_true_list ++;
      	    true_false_list[enum_field_counter]=true;
       	  } else {
      	    count_false_list ++;
      	  }
      	}
            } else if (type == Item_func::BETWEEN) {
              /* two args */
              // don't know what to do...
              goto end;
            } else {
              /* only one value */
        	if ((enum_str_value == str_value && value_is_string && compare_function(enum_str,value_str)) || 
                    (enum_int_value == int_value && !value_is_string && compare_function(enum_int,value_int))) {
      	  count_true_list ++;
      	  true_false_list[enum_field_counter]=true;
      	} else {
      	  count_false_list ++;
      	}
            }
          }
          /* check about always true (discart operation) / false (impossible where) */
          if (count_false_list == 0)
          {
            tree= &null_element;	/* always true */
            goto end;
          }
          if (count_true_list == 0)
          {
            tree->type= SEL_ARG::IMPOSSIBLE;	/* impossible where */
            goto end;
          }
          /* rewrite the operation using IN / NOT IN () */
          if (count_false_list > count_true_list)
      	inv=true; /* add NOT before IN () */
          type=Item_func::IN_FUNC;
          for each true_false_list
          {
            if(inv)  
              add respective enum_value of true_false_list when it's false
            else
              add respective enum_value of true_false_list when it's true
          }
          goto end;
        }

      Attachments

        Activity

          Hi Roberto,

          Please provide the full data dump for dig_lotes.

          Another question,

          >> optimize "enum_field!=enum_value" removing the enum_value from enum declaration list.

          Would you be willing to implement the optiization and provide a patch?

          elenst Elena Stepanova added a comment - Hi Roberto, Please provide the full data dump for dig_lotes. Another question, >> optimize "enum_field!=enum_value" removing the enum_value from enum declaration list. Would you be willing to implement the optiization and provide a patch?

          hi elena, i don't have many contact with source code of mariadb, but i didn't understand if you got the main idea, something like...

          enum_field can have enum(x,y,z,....) values + wrong value of enum( out of fields )
          wrong values are normally ''

          before execute field!=value check if the field is a enum or not, if yes it could be optimized to field IN (possible values) (i didn't checked if NOT IN () is optimized but i'm sure that IN is optimized, that's the work around that i'm using to execute SQL via index instead full table scan)

          i don't know where could i help in the code :/
          maybe i could help if someone could show me some guide lines and explain what files should i read, or where i could study more about the code to change it.

          rspadim roberto spadim added a comment - hi elena, i don't have many contact with source code of mariadb, but i didn't understand if you got the main idea, something like... enum_field can have enum(x,y,z,....) values + wrong value of enum( out of fields ) wrong values are normally '' before execute field!=value check if the field is a enum or not, if yes it could be optimized to field IN (possible values) (i didn't checked if NOT IN () is optimized but i'm sure that IN is optimized, that's the work around that i'm using to execute SQL via index instead full table scan) i don't know where could i help in the code :/ maybe i could help if someone could show me some guide lines and explain what files should i read, or where i could study more about the code to change it.

          using the table of MDEV-4416
          you can check this:

          explain select  * from dig_lotes where lote_tipo='r' and sit_spa !='mov'

          result = Using where, 446577 rows, type= ALL

          and this (using 'enum field to optimize'):

          explain select  * from dig_lotes where lote_tipo='r' and sit_spa IN ('d','s','p','a','')

          result = Using where, 28 rows, key = consulta_lote, type = range

          CREATE TABLE `dig_lotes` (
          `lote_unidade` int(11) NOT NULL DEFAULT '0',
          `lote_data` date NOT NULL DEFAULT '0000-00-00',
          `lote_numero` int(11) NOT NULL DEFAULT '0',
          `lote_tipo` char(1) NOT NULL DEFAULT 'p',
          `total_ocorrencias` int(10) unsigned NOT NULL DEFAULT '0',
          `total_documentos` int(10) unsigned NOT NULL DEFAULT '0',
          `total_baixas` int(10) unsigned NOT NULL DEFAULT '0',
          `total_bancos` int(10) unsigned NOT NULL DEFAULT '0',
          `moeda_total` char(3) NOT NULL DEFAULT 'R$',
          `valor_total` decimal(17,5) unsigned NOT NULL DEFAULT '0.00000',
          `valor_baixas` decimal(17,5) unsigned NOT NULL DEFAULT '0.00000',
          `valor_bancos` decimal(17,5) unsigned NOT NULL DEFAULT '0.00000',
          `digitador_tipo` enum('f','j') NOT NULL DEFAULT 'f',
          `digitador_id` int(11) NOT NULL DEFAULT '0',

          1. sit_spa was changed in MDEV 4416, to char(3) after i checked the slow query, i was thinking that CHAR(3) could make optimizer happy, but no results
            `sit_spa` ENUM('d','s','p','a','mov') NOT NULL DEFAULT 'd',

          `data_spa_d` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
          `data_spa_s` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
          `data_spa_p` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
          `data_spa_a` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
          `data_spa_mov` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
          `bordero` bigint(20) unsigned NOT NULL DEFAULT '0',
          `cessionario_tipo` enum('f','j') NOT NULL DEFAULT 'f',
          `cessionario_id` bigint(20) NOT NULL DEFAULT '0',
          `importado` varchar(25) NOT NULL,
          `numero_remessa_boleto` bigint(20) unsigned NOT NULL DEFAULT '0',
          `arquivo_importacao_id` int(10) unsigned NOT NULL DEFAULT '0',
          `observacao` text NOT NULL,
          `lote_data_origem` date NOT NULL DEFAULT '0000-00-00',
          `lote_numero_origem` int(11) NOT NULL DEFAULT '0',
          `cobrador_padrao_tipo` enum('f','j') NOT NULL DEFAULT 'f',
          `cobrador_padrao_id` int(11) NOT NULL DEFAULT '0',
          PRIMARY KEY (`lote_unidade`,`lote_data`,`lote_numero`),
          KEY `lote_origem` (`lote_unidade`,`lote_data_origem`,`lote_numero_origem`),
          KEY `consulta_lote` (`lote_tipo`,`sit_spa`(1),`lote_data`,`lote_numero`)
          )

          rspadim roberto spadim added a comment - using the table of MDEV-4416 you can check this: explain select * from dig_lotes where lote_tipo= 'r' and sit_spa != 'mov' result = Using where, 446577 rows, type= ALL and this (using 'enum field to optimize'): explain select * from dig_lotes where lote_tipo= 'r' and sit_spa IN ( 'd' , 's' , 'p' , 'a' , '' ) result = Using where, 28 rows, key = consulta_lote, type = range CREATE TABLE `dig_lotes` ( `lote_unidade` int(11) NOT NULL DEFAULT '0', `lote_data` date NOT NULL DEFAULT '0000-00-00', `lote_numero` int(11) NOT NULL DEFAULT '0', `lote_tipo` char(1) NOT NULL DEFAULT 'p', `total_ocorrencias` int(10) unsigned NOT NULL DEFAULT '0', `total_documentos` int(10) unsigned NOT NULL DEFAULT '0', `total_baixas` int(10) unsigned NOT NULL DEFAULT '0', `total_bancos` int(10) unsigned NOT NULL DEFAULT '0', `moeda_total` char(3) NOT NULL DEFAULT 'R$', `valor_total` decimal(17,5) unsigned NOT NULL DEFAULT '0.00000', `valor_baixas` decimal(17,5) unsigned NOT NULL DEFAULT '0.00000', `valor_bancos` decimal(17,5) unsigned NOT NULL DEFAULT '0.00000', `digitador_tipo` enum('f','j') NOT NULL DEFAULT 'f', `digitador_id` int(11) NOT NULL DEFAULT '0', sit_spa was changed in MDEV 4416, to char(3) after i checked the slow query, i was thinking that CHAR(3) could make optimizer happy, but no results `sit_spa` ENUM('d','s','p','a','mov') NOT NULL DEFAULT 'd', `data_spa_d` datetime NOT NULL DEFAULT '0000-00-00 00:00:00', `data_spa_s` datetime NOT NULL DEFAULT '0000-00-00 00:00:00', `data_spa_p` datetime NOT NULL DEFAULT '0000-00-00 00:00:00', `data_spa_a` datetime NOT NULL DEFAULT '0000-00-00 00:00:00', `data_spa_mov` datetime NOT NULL DEFAULT '0000-00-00 00:00:00', `bordero` bigint(20) unsigned NOT NULL DEFAULT '0', `cessionario_tipo` enum('f','j') NOT NULL DEFAULT 'f', `cessionario_id` bigint(20) NOT NULL DEFAULT '0', `importado` varchar(25) NOT NULL, `numero_remessa_boleto` bigint(20) unsigned NOT NULL DEFAULT '0', `arquivo_importacao_id` int(10) unsigned NOT NULL DEFAULT '0', `observacao` text NOT NULL, `lote_data_origem` date NOT NULL DEFAULT '0000-00-00', `lote_numero_origem` int(11) NOT NULL DEFAULT '0', `cobrador_padrao_tipo` enum('f','j') NOT NULL DEFAULT 'f', `cobrador_padrao_id` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`lote_unidade`,`lote_data`,`lote_numero`), KEY `lote_origem` (`lote_unidade`,`lote_data_origem`,`lote_numero_origem`), KEY `consulta_lote` (`lote_tipo`,`sit_spa`(1),`lote_data`,`lote_numero`) )

          here a smaller test table with the same 'problem':

          http://www.spadim.com.br/mdev%204419.sql

          rspadim roberto spadim added a comment - here a smaller test table with the same 'problem': http://www.spadim.com.br/mdev%204419.sql

          SEE THE DIFERENCE WITH TABLE OF LAST POST (5000 ROWS)

          1)
          flush status; flush query cache;
          select * from tmp_test where lote_tipo='r' and sit_spa !='mov';
          show status like 'handler_read%';
          RESULT => Handler_read_rnd_next=5001

          2)
          flush status; flush query cache;
          select * from tmp_test where lote_tipo='r' and sit_spa IN ('d','s','p','a','');
          show status like 'handler_read%';
          RESULT => Handler_read_key=5

          rspadim roberto spadim added a comment - SEE THE DIFERENCE WITH TABLE OF LAST POST (5000 ROWS) 1) flush status; flush query cache; select * from tmp_test where lote_tipo='r' and sit_spa !='mov'; show status like 'handler_read%'; RESULT => Handler_read_rnd_next=5001 2) flush status; flush query cache; select * from tmp_test where lote_tipo='r' and sit_spa IN ('d','s','p','a',''); show status like 'handler_read%'; RESULT => Handler_read_key=5
          rspadim roberto spadim added a comment - - edited

          an example in php...

          function optimize_diference_enum($enum_values,$enum_diff_value){
          $enum_possibilities=$enum_values;unset($enum_possibilities['mov']);
          $return=" IN (\"".implode(array_keys($enum_possibilities))."\")";
          }
          $enum_field_values=array(''=>-1,'s'=>0,'p'=>1,'a'=>2,'mov'=>3);
          $SQL="SELECT * FROM tmp_test WHERE sit_spa ".optimize_diference_enum($enum_field_values,'mov');


          i don't know where in mariadb code $enum_field_values could be checked when optimizing the query
          but with this 'array' of enum values we can remove the != value part and rewrite the query with a IN ()


          enum have a very nice point for optimizer, since we know what values can be found without reading the table! for example:
          a table with a field sit_spa => ENUM('s','p','a')

          1)select * from tmp_test where sit_spa='asdfasdf'
          this query will result 0 rows, since 'asdfasdf' is not in ENUM() declaration, and can return near to 0 seconds

          2)select * from tmp_test where sit_spa!='a'
          this query can be optimized to sit_spa IN ('','s','p') using ENUM() declaration, and if table have an index we can use it! (the best part of optimization)


          i will stop posting examples in this MDEV, it's too long

          rspadim roberto spadim added a comment - - edited an example in php... function optimize_diference_enum($enum_values,$enum_diff_value){ $enum_possibilities=$enum_values;unset($enum_possibilities ['mov'] ); $return=" IN (\"".implode(array_keys($enum_possibilities))."\")"; } $enum_field_values=array(''=>-1,'s'=>0,'p'=>1,'a'=>2,'mov'=>3); $SQL="SELECT * FROM tmp_test WHERE sit_spa ".optimize_diference_enum($enum_field_values,'mov'); i don't know where in mariadb code $enum_field_values could be checked when optimizing the query but with this 'array' of enum values we can remove the != value part and rewrite the query with a IN () enum have a very nice point for optimizer, since we know what values can be found without reading the table! for example: a table with a field sit_spa => ENUM('s','p','a') 1)select * from tmp_test where sit_spa='asdfasdf' this query will result 0 rows, since 'asdfasdf' is not in ENUM() declaration, and can return near to 0 seconds 2)select * from tmp_test where sit_spa!='a' this query can be optimized to sit_spa IN ('','s','p') using ENUM() declaration, and if table have an index we can use it! (the best part of optimization) i will stop posting examples in this MDEV, it's too long

          Hi Roberto,

          >> i don't have many contact with source code of mariadb, but i didn't understand if you got the main idea...

          I got your idea, it's just when somebody says that something is easy to implement, I hope they already know how exactly to implement it, hence the simplest way to get it done is to provide it in the form of a patch.

          Anyway, I will re-attach your test case to the report if you don't mind, so it's not lost, and will convert it into a task (a.k.a feature request).

          elenst Elena Stepanova added a comment - Hi Roberto, >> i don't have many contact with source code of mariadb, but i didn't understand if you got the main idea... I got your idea, it's just when somebody says that something is easy to implement, I hope they already know how exactly to implement it, hence the simplest way to get it done is to provide it in the form of a patch. Anyway, I will re-attach your test case to the report if you don't mind, so it's not lost, and will convert it into a task (a.k.a feature request).

          SQL from http://www.spadim.com.br/mdev%204419.sql is attached as mdev4419.tar.gz

          elenst Elena Stepanova added a comment - SQL from http://www.spadim.com.br/mdev%204419.sql is attached as mdev4419.tar.gz

          yes i agree with you, it's not 'easy' since i don't know how to do it
          there's some where that i can read the flow of:
          receive query -> parse -> optimize -> execute?

          rspadim roberto spadim added a comment - yes i agree with you, it's not 'easy' since i don't know how to do it there's some where that i can read the flow of: receive query -> parse -> optimize -> execute?
          rspadim roberto spadim added a comment - - edited

          Two points of optimization:
          first one is the rewrite of operators different from = , be optimized via 'rewrite' to IN operator (including the IN operator)
          the second is the 'impossible where' optimizations, executing the operator in each enum value, considering that index 0 of enum is "" and NULL is/isn't possible
          both optimizations should consider string representation of enum, and not the index value

          rspadim roberto spadim added a comment - - edited Two points of optimization: first one is the rewrite of operators different from = , be optimized via 'rewrite' to IN operator (including the IN operator) the second is the 'impossible where' optimizations, executing the operator in each enum value, considering that index 0 of enum is "" and NULL is/isn't possible both optimizations should consider string representation of enum, and not the index value

          to anyone who want help, maybe get_mm_leaf in opt_range.cc is the point to optimize, i don't know the source code, but i will register here, to don't forget or be a start point to anyone

          rspadim roberto spadim added a comment - to anyone who want help, maybe get_mm_leaf in opt_range.cc is the point to optimize, i don't know the source code, but i will register here, to don't forget or be a start point to anyone

          People

            Unassigned Unassigned
            rspadim roberto spadim
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.