[MDEV-6024] CONNECT PIVOT should use SELECT DISTINCT instead of SELECT * - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Minor
Resolution: Fixed
Fix Version/s: 10.0.11
Component/s: None
Labels:
- connect-engine

Description

When creating a CONNECT PIVOT table, the engine executes SELECT * on the target table. This seems really inefficient to me.

In an ideal situation, there will be some usable index on the pivot column, so SELECT DISTINCT pivotCol will be much faster.

To find the names of columns, a separate query to find table structure could be executed. I can't think of why the engine would need to do SELECT * when creating the table.

Is there some explanation I'm missing?

Attachments

Activity

Ascending order - Click to sort in descending order

Olivier Bertrand added a comment - 2014-04-07 16:43

Well, the reason is that this table reading may have two purposes:

1 - Find the default pivot and/or fact columns when they have not been specified in the create table statement.
2 - Take advantage of the table result to get the distinct pivot column values.

It is true that this is not optimized when pivoting very big tables. As a matter of facts, all tests made so far where made on relatively small tables to have results of reasonable size.

However, you're right, this should be optimized in case of big source tables. I shall study what can be done, for instance:

a) Not doing the 1 above if the pivot and fact columns are specified.
b) If not, just get the source table columns list, for instance by reading the table with LIMIT 1 (perhaps 0 also works)
c) when the pivot column is known, do the select distinct on it as you said.

Thanks for your suggestion.

Olivier Bertrand added a comment - 2014-04-07 16:43 Well, the reason is that this table reading may have two purposes: 1 - Find the default pivot and/or fact columns when they have not been specified in the create table statement. 2 - Take advantage of the table result to get the distinct pivot column values. It is true that this is not optimized when pivoting very big tables. As a matter of facts, all tests made so far where made on relatively small tables to have results of reasonable size. However, you're right, this should be optimized in case of big source tables. I shall study what can be done, for instance: a) Not doing the 1 above if the pivot and fact columns are specified. b) If not, just get the source table columns list, for instance by reading the table with LIMIT 1 (perhaps 0 also works) c) when the pivot column is known, do the select distinct on it as you said. Thanks for your suggestion.

Kolbe Kegel (Inactive) added a comment - 2014-04-07 18:21

Olivier, it would also be possible to use DESC or a query against information_schema.tables to get the list of columns and their qualities (data types, nullability, et cetera). That could help you reject a column from automatic pivot selection if it is a BLOB or if it is nullable or something, if you ever in the future want to use some logic like that.

Kolbe Kegel (Inactive) added a comment - 2014-04-07 18:21 Olivier, it would also be possible to use DESC or a query against information_schema.tables to get the list of columns and their qualities (data types, nullability, et cetera). That could help you reject a column from automatic pivot selection if it is a BLOB or if it is nullable or something, if you ever in the future want to use some logic like that.

Olivier Bertrand added a comment - 2014-04-08 01:45

Kolbe, many things can be done. However, don't forget that a Pivot table can be made on a Srcdef query that is not in any information schema table. Meanwhile, I have implemented and tested what I described in my last comment. Here is what can be told about it:

a) Nothing can be done about pivot tables based on a Srcdef query. The current way to process them must be kept. The documentation should warn users that this is to avoid when referring to big tables.
b) On pivot tables based on tables or views, the CREATE TABLE statement is done firstly by getting the columns by a select on it limited to 1 row, which is very fast. However, the SELECT DISTINCT to get the pivot column values is fast only if an index is made on that column. Otherwise it is as long as the old way to do it. Note that currently the Pivot and Facts columns must be explicitly specified when using a view as source table.
c) Even if the above can make the creation of a pivot table very fast. Using the table still may prove to be very long. This is because a Pivot result is based on a group by query on the source table that can take a very long time to execute.

For instance I made tests on a 10 million rows table, diag described as:

Field	Type	Null	Key	Default	Extra
Recordnb	int(8)	NO	MUL	<null>	<null>
Diagnb	smallint(2)	NO	MUL	<null>	<null>
RLN	varchar(9)	NO	MUL	<null>	<null>
Diag	int(5)	NO	MUL	<null>	<null>
DiagPresent	char(1)	NO		<null>	<null>

Because the Recordnb and DiagPresent columns should not be in the pivot table I created a view:

create view vdiag as select RLN, diagnb, diag from diag;

And created the pivot table by:

create table diag_pivot engine=connect table_type=pivot tabname=vdiag option_list='PivotCol=diagnb,FncCol=diag';

The CREATE TABLE was very fast (less than one second) but using the table by:

select * from diag_pivot limit 50;

took almost two hours, even I made on diag an index on (RLN,diagnb). This is because MySQL does nor use indexes when working on views (at least not in this case)

Next step:

create table diag_pivot2 engine=connect table_type=pivot tabname=diag option_list='PivotCol=diagnb,FncCol=diag';
describe diag_pivot2;
alter table diag_pivot2 drop recordnb, drop diagpresent;
select * from diag_pivot2 limit 500;

Here I used the table than dropped the unwanted columns. The creation was as fast and the use faster (20 minutes instead of almost 2 hours)

Olivier Bertrand added a comment - 2014-04-08 01:45 Kolbe, many things can be done. However, don't forget that a Pivot table can be made on a Srcdef query that is not in any information schema table. Meanwhile, I have implemented and tested what I described in my last comment. Here is what can be told about it: a) Nothing can be done about pivot tables based on a Srcdef query. The current way to process them must be kept. The documentation should warn users that this is to avoid when referring to big tables. b) On pivot tables based on tables or views, the CREATE TABLE statement is done firstly by getting the columns by a select on it limited to 1 row, which is very fast. However, the SELECT DISTINCT to get the pivot column values is fast only if an index is made on that column. Otherwise it is as long as the old way to do it. Note that currently the Pivot and Facts columns must be explicitly specified when using a view as source table. c) Even if the above can make the creation of a pivot table very fast. Using the table still may prove to be very long. This is because a Pivot result is based on a group by query on the source table that can take a very long time to execute. For instance I made tests on a 10 million rows table, diag described as: Field Type Null Key Default Extra Recordnb int(8) NO MUL <null> <null> Diagnb smallint(2) NO MUL <null> <null> RLN varchar(9) NO MUL <null> <null> Diag int(5) NO MUL <null> <null> DiagPresent char(1) NO <null> <null> Because the Recordnb and DiagPresent columns should not be in the pivot table I created a view: create view vdiag as select RLN, diagnb, diag from diag; And created the pivot table by: create table diag_pivot engine=connect table_type=pivot tabname=vdiag option_list='PivotCol=diagnb,FncCol=diag'; The CREATE TABLE was very fast (less than one second) but using the table by: select * from diag_pivot limit 50; took almost two hours, even I made on diag an index on (RLN,diagnb). This is because MySQL does nor use indexes when working on views (at least not in this case) Next step: create table diag_pivot2 engine=connect table_type=pivot tabname=diag option_list='PivotCol=diagnb,FncCol=diag'; describe diag_pivot2; alter table diag_pivot2 drop recordnb, drop diagpresent; select * from diag_pivot2 limit 500; Here I used the table than dropped the unwanted columns. The creation was as fast and the use faster (20 minutes instead of almost 2 hours)

Olivier Bertrand added a comment - 2014-04-08 19:32

In addition to what has been said in my last comments, I added a new option to the Pivot table allowing to make discovery skip unwanted columns from a source table. Doing so the last create example is:

create table diag_pivot2 engine=connect table_type=pivot tabname=diag option_list='PivotCol=diagnb,FncCol=diag,SkipCol=recordnb;diagpresent';

Doing so there is no more need to create a view or to have to drop columns with an alter table.

Note that because of the syntax of OPTION_LIST, the columns to skip must be separated by semi-colons.

Olivier Bertrand added a comment - 2014-04-08 19:32 In addition to what has been said in my last comments, I added a new option to the Pivot table allowing to make discovery skip unwanted columns from a source table. Doing so the last create example is: create table diag_pivot2 engine=connect table_type=pivot tabname=diag option_list='PivotCol=diagnb,FncCol=diag,SkipCol=recordnb;diagpresent'; Doing so there is no more need to create a view or to have to drop columns with an alter table. Note that because of the syntax of OPTION_LIST, the columns to skip must be separated by semi-colons.

MariaDB Server

CONNECT PIVOT should use SELECT DISTINCT instead of SELECT *

Details

Description

Attachments

Activity

People

Dates

Git Integration