Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6024

CONNECT PIVOT should use SELECT DISTINCT instead of SELECT *

Details

    Description

      When creating a CONNECT PIVOT table, the engine executes SELECT * on the target table. This seems really inefficient to me.

      In an ideal situation, there will be some usable index on the pivot column, so SELECT DISTINCT pivotCol will be much faster.

      To find the names of columns, a separate query to find table structure could be executed. I can't think of why the engine would need to do SELECT * when creating the table.

      Is there some explanation I'm missing?

      Attachments

        Activity

          Well, the reason is that this table reading may have two purposes:

          1 - Find the default pivot and/or fact columns when they have not been specified in the create table statement.
          2 - Take advantage of the table result to get the distinct pivot column values.

          It is true that this is not optimized when pivoting very big tables. As a matter of facts, all tests made so far where made on relatively small tables to have results of reasonable size.

          However, you're right, this should be optimized in case of big source tables. I shall study what can be done, for instance:

          a) Not doing the 1 above if the pivot and fact columns are specified.
          b) If not, just get the source table columns list, for instance by reading the table with LIMIT 1 (perhaps 0 also works)
          c) when the pivot column is known, do the select distinct on it as you said.

          Thanks for your suggestion.

          bertrandop Olivier Bertrand added a comment - Well, the reason is that this table reading may have two purposes: 1 - Find the default pivot and/or fact columns when they have not been specified in the create table statement. 2 - Take advantage of the table result to get the distinct pivot column values. It is true that this is not optimized when pivoting very big tables. As a matter of facts, all tests made so far where made on relatively small tables to have results of reasonable size. However, you're right, this should be optimized in case of big source tables. I shall study what can be done, for instance: a) Not doing the 1 above if the pivot and fact columns are specified. b) If not, just get the source table columns list, for instance by reading the table with LIMIT 1 (perhaps 0 also works) c) when the pivot column is known, do the select distinct on it as you said. Thanks for your suggestion.

          Olivier, it would also be possible to use DESC or a query against information_schema.tables to get the list of columns and their qualities (data types, nullability, et cetera). That could help you reject a column from automatic pivot selection if it is a BLOB or if it is nullable or something, if you ever in the future want to use some logic like that.

          kolbe Kolbe Kegel (Inactive) added a comment - Olivier, it would also be possible to use DESC or a query against information_schema.tables to get the list of columns and their qualities (data types, nullability, et cetera). That could help you reject a column from automatic pivot selection if it is a BLOB or if it is nullable or something, if you ever in the future want to use some logic like that.

          Kolbe, many things can be done. However, don't forget that a Pivot table can be made on a Srcdef query that is not in any information schema table. Meanwhile, I have implemented and tested what I described in my last comment. Here is what can be told about it:

          a) Nothing can be done about pivot tables based on a Srcdef query. The current way to process them must be kept. The documentation should warn users that this is to avoid when referring to big tables.
          b) On pivot tables based on tables or views, the CREATE TABLE statement is done firstly by getting the columns by a select on it limited to 1 row, which is very fast. However, the SELECT DISTINCT to get the pivot column values is fast only if an index is made on that column. Otherwise it is as long as the old way to do it. Note that currently the Pivot and Facts columns must be explicitly specified when using a view as source table.
          c) Even if the above can make the creation of a pivot table very fast. Using the table still may prove to be very long. This is because a Pivot result is based on a group by query on the source table that can take a very long time to execute.

          For instance I made tests on a 10 million rows table, diag described as:

          Field Type Null Key Default Extra
          Recordnb int(8) NO MUL <null> <null>
          Diagnb smallint(2) NO MUL <null> <null>
          RLN varchar(9) NO MUL <null> <null>
          Diag int(5) NO MUL <null> <null>
          DiagPresent char(1) NO   <null> <null>

          Because the Recordnb and DiagPresent columns should not be in the pivot table I created a view:

          create view vdiag as select RLN, diagnb, diag from diag;

          And created the pivot table by:

          create table diag_pivot engine=connect table_type=pivot tabname=vdiag option_list='PivotCol=diagnb,FncCol=diag';

          The CREATE TABLE was very fast (less than one second) but using the table by:

          select * from diag_pivot limit 50;

          took almost two hours, even I made on diag an index on (RLN,diagnb). This is because MySQL does nor use indexes when working on views (at least not in this case)

          Next step:

          create table diag_pivot2 engine=connect table_type=pivot tabname=diag option_list='PivotCol=diagnb,FncCol=diag';
          describe diag_pivot2;
          alter table diag_pivot2 drop recordnb, drop diagpresent;
          select * from diag_pivot2 limit 500;

          Here I used the table than dropped the unwanted columns. The creation was as fast and the use faster (20 minutes instead of almost 2 hours)

          bertrandop Olivier Bertrand added a comment - Kolbe, many things can be done. However, don't forget that a Pivot table can be made on a Srcdef query that is not in any information schema table. Meanwhile, I have implemented and tested what I described in my last comment. Here is what can be told about it: a) Nothing can be done about pivot tables based on a Srcdef query. The current way to process them must be kept. The documentation should warn users that this is to avoid when referring to big tables. b) On pivot tables based on tables or views, the CREATE TABLE statement is done firstly by getting the columns by a select on it limited to 1 row, which is very fast. However, the SELECT DISTINCT to get the pivot column values is fast only if an index is made on that column. Otherwise it is as long as the old way to do it. Note that currently the Pivot and Facts columns must be explicitly specified when using a view as source table. c) Even if the above can make the creation of a pivot table very fast. Using the table still may prove to be very long. This is because a Pivot result is based on a group by query on the source table that can take a very long time to execute. For instance I made tests on a 10 million rows table, diag described as: Field Type Null Key Default Extra Recordnb int(8) NO MUL <null> <null> Diagnb smallint(2) NO MUL <null> <null> RLN varchar(9) NO MUL <null> <null> Diag int(5) NO MUL <null> <null> DiagPresent char(1) NO   <null> <null> Because the Recordnb and DiagPresent columns should not be in the pivot table I created a view: create view vdiag as select RLN, diagnb, diag from diag; And created the pivot table by: create table diag_pivot engine=connect table_type=pivot tabname=vdiag option_list='PivotCol=diagnb,FncCol=diag'; The CREATE TABLE was very fast (less than one second) but using the table by: select * from diag_pivot limit 50; took almost two hours, even I made on diag an index on (RLN,diagnb). This is because MySQL does nor use indexes when working on views (at least not in this case) Next step: create table diag_pivot2 engine=connect table_type=pivot tabname=diag option_list='PivotCol=diagnb,FncCol=diag'; describe diag_pivot2; alter table diag_pivot2 drop recordnb, drop diagpresent; select * from diag_pivot2 limit 500; Here I used the table than dropped the unwanted columns. The creation was as fast and the use faster (20 minutes instead of almost 2 hours)

          In addition to what has been said in my last comments, I added a new option to the Pivot table allowing to make discovery skip unwanted columns from a source table. Doing so the last create example is:

          create table diag_pivot2 engine=connect table_type=pivot tabname=diag option_list='PivotCol=diagnb,FncCol=diag,SkipCol=recordnb;diagpresent';

          Doing so there is no more need to create a view or to have to drop columns with an alter table.

          Note that because of the syntax of OPTION_LIST, the columns to skip must be separated by semi-colons.

          bertrandop Olivier Bertrand added a comment - In addition to what has been said in my last comments, I added a new option to the Pivot table allowing to make discovery skip unwanted columns from a source table. Doing so the last create example is: create table diag_pivot2 engine=connect table_type=pivot tabname=diag option_list='PivotCol=diagnb,FncCol=diag,SkipCol=recordnb;diagpresent'; Doing so there is no more need to create a view or to have to drop columns with an alter table. Note that because of the syntax of OPTION_LIST, the columns to skip must be separated by semi-colons.

          People

            bertrandop Olivier Bertrand
            kolbe Kolbe Kegel (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.