[MCOL-1368] join on table with different charset performance Created: 2018-04-25  Updated: 2021-01-16  Resolved: 2021-01-16

Status: Closed
Project: MariaDB ColumnStore
Component/s: N/A
Affects Version/s: 1.1.4
Fix Version/s: N/A

Type: Bug Priority: Minor
Reporter: patrice Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None


 Description   

hi,

I discovered that joining table with different charset (latin1 and utf8) on (varchar to be verified that it is not also on numeric ) result in a very poor performance and erratic behavior.

If using with statement and stays on the UM, the UM process is allocating memory until no more (128G) ,if pushed down to the pm , memory stay stable but the query is very slow, insert select into is not finishing after more than 1hour, whereas when the table are on the same charset, this same operation take 5s.

I verified both way all table in utf8 and all table in latin1, it works even though latin1 seems faster.



 Comments   
Comment by Andrew Hutchings (Inactive) [ 2018-04-25 ]

Can you please give an example of what you are doing in both the fast and slow cases so that we can reproduce it here?

Comment by patrice [ 2018-04-25 ]

The whole thing is bot suited for here, I can do a videoconf to demonstrate it. Not a sure a minimal example will show the behavior.

Comment by patrice [ 2018-04-25 ]

Here is the code, even with such a low number of row, the trace show the difference, Not that without the join on the integer has better performance, the degradation is when both condition are on the join :

drop table toto_u;
drop table toto_l;
drop table feed;
create table toto_l (a int,b varchar(48)) engine=columnstore ;
create table feed (a int,b varchar(48)) engine=columnstore default charset=utf8 ;
create table toto_u (a int,b varchar(48)) engine=columnstore default charset=utf8;

CREATE PROCEDURE `insert_it`(p1 int)
begin
declare x integer;
set x=0;
label1: LOOP
SET x = x + 1;
insert into feed values (10,md5(rand())),(10,md5(rand())),(10,md5(rand())),(10,md5(rand())),(10,md5(rand())),(10,md5(rand()));
IF x < p1 THEN
ITERATE label1;
END if;
LEAVE label1;
END LOOP label1;
END

call insert_it(200);

#5 times
insert into toto_u
select * from sc.feed order by b limit 100;

insert into toto_l
select * from toto_u;

select calSetTrace(1);
select * from toto_l
join feed on (feed.b=toto_l.b and feed.a=toto_l.a);
select calgettrace();
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

calgettrace()

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows
BPS PM feed 11762 (a,b) 0 8 0 0.004 1200
BPS PM toto_l 11758 (a,b) 0 4 0 0.338 500
HJS PM toto_l-feed 11758 - - - - ----- -

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

select * from toto_u
join feed on (feed.b=toto_u.b and feed.a=toto_u.a);
select calgettrace();
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

calgettrace()

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows
BPS PM feed 11762 (a,b) 0 8 0 0.004 1200
BPS PM toto_u 11766 (a,b) 0 14 0 0.004 500
HJS PM toto_u-feed 11766 - - - - ----- -

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

select * from toto_l join feed on (feed.b=toto_l.b);

MariaDB [sc]> select calgettrace();
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

calgettrace()

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows
BPS PM feed 11762 (a,b) 0 8 0 0.002 1200
BPS PM toto_l 11758 (a,b) 0 6 0 0.006 500
HJS PM toto_l-feed 11758 - - - - ----- -

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Comment by Andrew Hutchings (Inactive) [ 2018-04-26 ]

It should probably go without saying, but joins on dictionary columns in ColumnStore is generally a bad idea, also mixing character sets in an equality match is a bad idea in ColumnStore.

That said I don't have an instant explanation as to why that particular query would be slower than the last query.

The thing that may give us more information before we have to dive into profiling tools is doing calgettrace(1) instead which will give a little more breakdown of what was executed.

Comment by patrice [ 2018-04-26 ]

1) Sometimes this can't be avoided, varchar are important in our case. The saying cannot be "do not use long varchar, it is compatible but not really optimized...", should it not be "can we have a better performance/support for those". Are those joins creating a joint-dictionnary or of a double call comparison for all rows ?

2) the charset mix was more by inadvertence than a real design decision, it should work even if less than optimal, maybe a warning should be given ?

3) It is not only about slowness, it is also about memory "leak" or consumption out of control. there should be no reason that different charset result in an very large memory allocation, and that it seems to only happen with both condition on int and varchar.

It should be reproducible from the example I gave.

Generated at Thu Feb 08 02:28:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.