[MDEV-6529] optimize VARCHAR temp storage during EITS ANALYZE Created: 2014-08-05  Updated: 2024-02-06

Status: Confirmed
Project: MariaDB Server
Component/s: Optimizer
Fix Version/s: 11.5

Type: New Feature Priority: Critical
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 3
Labels: eits

Issue Links:
Blocks
is blocked by MDEV-6915 Allow packed keys and packed values o... Closed
Duplicate
is duplicated by MDEV-10283 SEGV on ANALYZE LOCAL TABLE __ PERSIS... Closed
Relates
relates to MDEV-21829 Use packed sort keys in Unique objects In Review
relates to MDEV-32472 Test histogram precision for ANALYZE,... In Progress
relates to MDEV-6181 EITS could eat all tmpdir space and hang Closed

 Description   

Filing this https://jira.mariadb.org/browse/MDEV-6181?focusedCommentId=57209&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-57209

as a separate issue:

Motivated by CSC#7897, I've tried to see how much space can ANALYZE use. (So far my understanding was that max. temp dir usage would be rougly equal to size of the table w/o indexes).

Let's try this:

create table ten(a int);
insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
 
create table one_k(a int);
insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;
 
create table t1 (pk int primary key, col1 varchar(100)) charset=utf8;
insert into t1 select A.a+B.a*1000, concat('val-', A.a+B.a*1000) from one_k A, one_k B;

The table has 1M rows. Rows in the VARCHAR(100) column occupy less than
300 bytes (which is typical).

The table on disk:

 -rw-rw---- 1 psergey psergey  44M Aug  1 23:09 t1.ibd

Now,

analyze table t1 persistent for all;

  Breakpoint 4, unique_write_to_file_with_count (key=0x7ffebf2ce0a0 "\005", count=1, unique=0x7ffebf047ef0) at /home/psergey/dev2/10.0/sql/uniques.cc:53
(gdb) print key
  $7 = (uchar *) 0x7ffebf2ce0a0 "\005"
(gdb)  
(gdb) p unique->size
  $8 = 302

It's writing 300 bytes, the unpacked length.

In total, we get:

  Breakpoint 3, Unique::flush (this=0x7ffebf047ef0) at /home/psergey/dev2/10.0/sql/uniques.cc:379
  $46 = 281,734,200
  Breakpoint 3, Unique::flush (this=0x7ffebf047ef0) at /home/psergey/dev2/10.0/sql/uniques.cc:379
  $48 = 297,386,100
...
  Breakpoint 3, Unique::flush (this=0x7ffebf047a30) at /home/psergey/dev2/10.0/sql/uniques.cc:379
  $47 = 4,194,304

300MBit = 286 M, i.e. ANALYZE may require much more space than is occupied by the table (44M)



 Comments   
Comment by Elena Stepanova [ 2015-11-05 ]

Filesystem     1K-blocks      Used Available Use% Mounted on
/dev/sdc       165139820 131460860  25290352  84% /data/repo
 
analyze table t1 persistent for all;
...
 
Filesystem     1K-blocks      Used Available Use% Mounted on
/dev/sdc       165139820 131759988  24991224  85% /data/repo
 
# analyze finished
 
Filesystem     1K-blocks      Used Available Use% Mounted on
/dev/sdc       165139820 131460860  25290352  84% /data/repo

Comment by Sergei Petrunia [ 2015-11-11 ]

Need to check with igor about the status of this.

Comment by Sergei Petrunia [ 2015-11-11 ]

There is a change in MySQL 5.7 that could be used for this. Igor has started merging it, but didn't finish.

Comment by Sergei Petrunia [ 2015-11-27 ]

The change I was talking about in the previous comment is MDEV-6915.

MDEV-6915 is about making "sorted buffer merge" code (from filesort.cc) to use efficient ways of storing varchar columns.
After that, we will be able to make Unique class to use this efficient storage format (will also need to use it for in-memory representation).

Comment by Abdul Manaf [ 2019-04-22 ]

Team - is there any plan to fix this in coming versions ?

Comment by Sergei Petrunia [ 2019-04-22 ]

navodaya05 this specific deficiency is not fixed. But there is a feature in MariaDB 10.4 which allows to have a workaround. It's "Bayesian sampling for histograms":

https://mariadb.com/kb/en/library/server-system-variables/#analyze_sample_percentage

The default is 100, which means "use the whole dataset", like pre-10.4 versions did. If you set it to a smaller value, then the histogram will be built from a sample, and thus require less space.
We recommend trying analyze_sample_percentage=0, which means the sample size will be determined automatically.

Comment by Michael Widenius [ 2022-03-14 ]

Shouldn't MDEV-6915 Allow packed keys and packed values of non-sorted fields in the sort buffer solve this or if not, can't we fix Unique to use the packed file format that filesort is using. As we are using the filesort functions to read the data already, it looks like a relative easy task.

Comment by Mason Sharp [ 2023-10-13 ]

Is hyperloglog a potential solution here for VARCHAR columns to determine selectivity and not use much space?

Comment by Michael Widenius [ 2023-11-20 ]

We have an old patch, MDEV-21829, for reducing the size of VARCHAR as part of UNIQUE originally created by Varun that I am involved in fixing.
I should have a solution for this withing the next 14 days.

Comment by Mason Sharp [ 2024-01-22 ]

Linked to MDEV-32472

Generated at Thu Feb 08 07:12:37 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.