[MDEV-17886] Benchmark speed of EITS ANALYZE TABLE, histogram collection Created: 2018-12-02  Updated: 2019-02-27  Resolved: 2018-12-02

Status: Closed
Project: MariaDB Server
Component/s: Optimizer
Fix Version/s: N/A

Type: Task Priority: Major
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Done Votes: 0
Labels: benchmarking, eits

Issue Links:
Relates
relates to MDEV-18758 Test histograms precision Open

 Description   

This task is for writing down the results of EITS' variant of ANALYZE TABLE . That is, we want to know whether it is expensive to collect EITS histograms. (We know it is expensive as it involves a full table scan and a Unique object but we want to know the size of the overhead)



 Comments   
Comment by Sergei Petrunia [ 2018-12-02 ]

An example of EITS data load speed

Hi, so I was messing with a big table with real customer data for an optimizer issue.
Thought I would use that opportunity to check EITS speed.
 
 
Table name: changed to t1 
Table rows:  740M
Table size on disk: 89G
Loading data from mysqldump: 151 minutes (this is with flush_log_at_trx_commit=0, etc)
Machine: AWS with SSD, 16G RAM, 4G buffer pool.
 
CHECKSUM TABLE t1: 11 minutes
ANALYZE TABLE t1 PERSISTENT FOR ALL: 1 hour 5 min
 
I got the same time for two runs regardless of whether histogram_size=0 (so
it doesn't collect a histogram) or histogram_size=128, DOUBLE_PREC_HB.
 
In both cases, the server used 3.8G extra disk space at the end of ANALYZE.
 
Take-ways:
- Collecting a histogram is heavier than just reading the table. in our case, 6x times
- for non-utf8 columns, the time is tolerable.

Comment by Sergei Petrunia [ 2018-12-02 ]

Comparison with MySQL 8

TL;DR:

  • MySQL works at full table scan speed. Histogram collection is minor when compared to scanning through the entire table.
  • MariaDB is about the same when collecting histogram for one column. When collecting histograms for many columns, it's much more expensive than a full table scan.

Machine: psergey's t480, Samsung PCI-E SSD, 16G RAM

=== MySQL 8.0 ==
Re-run on MySQL-8.0 current, release build.
 
Time to load: 4.43 hrs.
Dataset size: 89G t1.ibd
Binlog Size: ~34G.  (might have some extra data)
 
CHECKSUM TABLE t1; -- 5 min 58 sec.
ANALYZE TABLE t1 UPDATE HISTOGRAM ON <col1, col2, ...>; -- 6 Min
 
ANALYZE TABLE t1 UPDATE HISTOGRAM on col1; -- 5 min 45 sec
ANALYZE TABLE t1 UPDATE HISTOGRAM on col2; -- 6 min

=== MariaDB 10.3 ===
 
Time to load: 4.08 hrs.
Dataset size: 89G t1.ibd
 
checksum table t1; -- 5 min 40 sec.
 
analyze table t1 persistent for all;  -- 32 min
 
analyze table t1 persistent for columns(col1) indexes () -- 5 min
analyze table t1 persistent for columns(col2) indexes () -- 4 min

Comment by Varun Gupta (Inactive) [ 2019-02-19 ]

MariaDB 10.4

MariaDB [test]> checksum table t1;
   +----------------------+------------+
| Table                | Checksum   |
+----------------------+------------+
| test.t1              | 2149892167 |
+----------------------+------------+
1 row in set (8 min 59.812 sec)

Without sampling

MariaDB [test]> analyze table t1  persistent for columns(taskid) indexes ();
+----------------------+---------+----------+-----------------------------------------+
| Table                | Op      | Msg_type | Msg_text                                |
+----------------------+---------+----------+-----------------------------------------+
| test.t1 | analyze | status   | Engine-independent statistics collected |
| test.t1 | analyze | status   | OK                                      |
+----------------------+---------+----------+-----------------------------------------+
2 rows in set (8 min 17.281 sec)
 
MariaDB [test]> analyze table t1  persistent for columns(subid) indexes ();
+----------------------+---------+----------+-----------------------------------------+
| Table                | Op      | Msg_type | Msg_text                                |
+----------------------+---------+----------+-----------------------------------------+
| test.t1 | analyze | status   | Engine-independent statistics collected |
| test.t1 | analyze | status   | OK                                      |
+----------------------+---------+----------+-----------------------------------------+
2 rows in set (6 min 4.782 sec)
 
MariaDB [test]> analyze table t1  persistent for all;
+----------------------+---------+----------+-----------------------------------------+
| Table                | Op      | Msg_type | Msg_text                                |
+----------------------+---------+----------+-----------------------------------------+
| test.t1 | analyze | status   | Engine-independent statistics collected |
| test.t1 | analyze | status   | OK                                      |
+----------------------+---------+----------+-----------------------------------------+
2 rows in set (45 min 2.379 sec)

With sampling

MariaDB [test]> analyze table t1  persistent for columns(taskid) indexes ();
+----------------------+---------+----------+-----------------------------------------+
| Table                | Op      | Msg_type | Msg_text                                |
+----------------------+---------+----------+-----------------------------------------+
| test.t1 | analyze | status   | Engine-independent statistics collected |
| test.t1 | analyze | status   | OK                                      |
+----------------------+---------+----------+-----------------------------------------+
2 rows in set (4 min 48.113 sec)
 
MariaDB [test]> analyze table t1  persistent for all;
+----------------------+---------+----------+-----------------------------------------+
| Table                | Op      | Msg_type | Msg_text                                |
+----------------------+---------+----------+-----------------------------------------+
| test.t1 | analyze | status   | Engine-independent statistics collected |
| test.t1 | analyze | status   | OK                                      |
+----------------------+---------+----------+-----------------------------------------+
2 rows in set (24 min 27.986 sec)
 
MariaDB [test]> analyze table t1  persistent for columns(subid) indexes ();
+----------------------+---------+----------+-----------------------------------------+
| Table                | Op      | Msg_type | Msg_text                                |
+----------------------+---------+----------+-----------------------------------------+
| test.t1 | analyze | status   | Engine-independent statistics collected |
| test.t1 | analyze | status   | OK                                      |
+----------------------+---------+----------+-----------------------------------------+
2 rows in set (5 min 54.834 sec)
 
MariaDB [test]> ANALYZE TABLE t1 persistent for columns( col1, col2, col3, col4, col5, col6, col7, col8) indexes ();
+---------+---------+----------+-----------------------------------------+
| Table   | Op      | Msg_type | Msg_text                                |
+---------+---------+----------+-----------------------------------------+
| test.t1 | analyze | status   | Engine-independent statistics collected |
| test.t1 | analyze | status   | OK                                      |
+---------+---------+----------+-----------------------------------------+
2 rows in set (6 min 48.440 sec)

Comment by Varun Gupta (Inactive) [ 2019-02-20 ]

histogram benchmarks mysql

histogram benchmarks mysql
 
mysql> CHECKSUM TABLE t1;
+---------+------------+
| Table   | Checksum   |
+---------+------------+
| test.t1 | 1553974534 |
+---------+------------+
1 row in set (7 min 58.96 sec)
 
 
mysql> ANALYZE TABLE t1 UPDATE HISTOGRAM ON col1, col2, col3, col4, col5, col6, col7, col8; +---------+-----------+----------+-------------------------------------------------+
| Table   | Op        | Msg_type | Msg_text                                        |
+---------+-----------+----------+-------------------------------------------------+
| test.t1 | histogram | status   | Histogram statistics created for column 'col3'. |
| test.t1 | histogram | status   | Histogram statistics created for column 'col8'. |
| test.t1 | histogram | status   | Histogram statistics created for column 'col4'. |
| test.t1 | histogram | status   | Histogram statistics created for column 'col5'. |
| test.t1 | histogram | status   | Histogram statistics created for column 'col2'. |
| test.t1 | histogram | status   | Histogram statistics created for column 'col1'. |
| test.t1 | histogram | status   | Histogram statistics created for column 'col6'. |
| test.t1 | histogram | status   | Histogram statistics created for column 'col7'. |
+---------+-----------+----------+-------------------------------------------------+
8 rows in set (8 min 38.69 sec)
 
mysql> ANALYZE TABLE t1 UPDATE HISTOGRAM ON col1;
+---------+-----------+----------+---------------------------------------------------+
| Table   | Op        | Msg_type | Msg_text                                          |
+---------+-----------+----------+---------------------------------------------------+
| test.t1 | histogram | status   | Histogram statistics created for column 'col1'.   |
+---------+-----------+----------+---------------------------------------------------+
1 row in set (7 min 24.30 sec)
 
mysql> ANALYZE TABLE t1 UPDATE HISTOGRAM ON col2;
+---------+-----------+----------+--------------------------------------------------+
| Table   | Op        | Msg_type | Msg_text                                         |
+---------+-----------+----------+--------------------------------------------------+
| test.t1 | histogram | status   | Histogram statistics created for column 'col2'.  |
+---------+-----------+----------+--------------------------------------------------+
1 row in set (7 min 29.38 sec)

Comment by Sergei Petrunia [ 2019-02-20 ]

Summarizing Varun's results.

MariaDB 10.4

checksum table t1; -- 9 min
 
= Without sampling ==
analyze table t1 persistent for columns(col1) indexes (); - 8 min 17 sec
analyze table t1  persistent for columns(col2) indexes (); - 6 min 4 sec
analyze table t1  persistent for all; - 45 min
 
== With sampling, automatic sample size ==
analyze table t1  persistent for columns(col1) indexes (); - 4 min 48 sec
analyze table t1  persistent for columns(col2) indexes ();  - 5 min 54 sec
analyze table t1 persistent for columns(col1, col2, col3, col4, 
  col5, col6, col7, col8) indexes ();  - 6 min 48 sec
 
analyze table t1  persistent for all;                        - 24 min 30 sec

MySQL 8.0

CHECKSUM TABLE t1;                              --  7 min 58 sec
 
ANALYZE TABLE t1 UPDATE HISTOGRAM ON col1, col2, 
  col3, col4, col5, col6, col7, col8;            -- 8 min 38 sec
ANALYZE TABLE t1 UPDATE HISTOGRAM ON col1;       -- 7 min 24 sec
ANALYZE TABLE t1 UPDATE HISTOGRAM ON col2;       -- 7 min 29 sec

Comment by Sergei Petrunia [ 2019-02-20 ]

Takeaways:

  • With sampling, the speed of ANALYZE .. PERSISTENT FOR COLUMNS in MariaDB becomes the same as the speed of full table scan.
    ANALYZE ... PERSISTENT FOR INDEXES(...) can be more expensive due to full index scans that are performed to collect index statistics.
    That is, ANALYZE... PERSISTENT FOR ALL () can be noticeably slower than just doing ANALYZE for columns.
Generated at Thu Feb 08 08:39:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.