Kolbe, many things can be done. However, don't forget that a Pivot table can be made on a Srcdef query that is not in any information schema table. Meanwhile, I have implemented and tested what I described in my last comment. Here is what can be told about it:
a) Nothing can be done about pivot tables based on a Srcdef query. The current way to process them must be kept. The documentation should warn users that this is to avoid when referring to big tables.
b) On pivot tables based on tables or views, the CREATE TABLE statement is done firstly by getting the columns by a select on it limited to 1 row, which is very fast. However, the SELECT DISTINCT to get the pivot column values is fast only if an index is made on that column. Otherwise it is as long as the old way to do it. Note that currently the Pivot and Facts columns must be explicitly specified when using a view as source table.
c) Even if the above can make the creation of a pivot table very fast. Using the table still may prove to be very long. This is because a Pivot result is based on a group by query on the source table that can take a very long time to execute.
For instance I made tests on a 10 million rows table, diag described as:
Field |
Type |
Null |
Key |
Default |
Extra |
Recordnb |
int(8) |
NO |
MUL |
<null> |
<null> |
Diagnb |
smallint(2) |
NO |
MUL |
<null> |
<null> |
RLN |
varchar(9) |
NO |
MUL |
<null> |
<null> |
Diag |
int(5) |
NO |
MUL |
<null> |
<null> |
DiagPresent |
char(1) |
NO |
|
<null> |
<null> |
Because the Recordnb and DiagPresent columns should not be in the pivot table I created a view:
create view vdiag as select RLN, diagnb, diag from diag;
And created the pivot table by:
create table diag_pivot engine=connect table_type=pivot tabname=vdiag option_list='PivotCol=diagnb,FncCol=diag';
The CREATE TABLE was very fast (less than one second) but using the table by:
select * from diag_pivot limit 50;
took almost two hours, even I made on diag an index on (RLN,diagnb). This is because MySQL does nor use indexes when working on views (at least not in this case)
Next step:
create table diag_pivot2 engine=connect table_type=pivot tabname=diag option_list='PivotCol=diagnb,FncCol=diag';
describe diag_pivot2;
alter table diag_pivot2 drop recordnb, drop diagpresent;
select * from diag_pivot2 limit 500;
Here I used the table than dropped the unwanted columns. The creation was as fast and the use faster (20 minutes instead of almost 2 hours)
Well, the reason is that this table reading may have two purposes:
1 - Find the default pivot and/or fact columns when they have not been specified in the create table statement.
2 - Take advantage of the table result to get the distinct pivot column values.
It is true that this is not optimized when pivoting very big tables. As a matter of facts, all tests made so far where made on relatively small tables to have results of reasonable size.
However, you're right, this should be optimized in case of big source tables. I shall study what can be done, for instance:
a) Not doing the 1 above if the pivot and fact columns are specified.
b) If not, just get the source table columns list, for instance by reading the table with LIMIT 1 (perhaps 0 also works)
c) when the pivot column is known, do the select distinct on it as you said.
Thanks for your suggestion.