[MCOL-3270] Improve cpimport ingest speed into Dictionary columns Created: 2019-04-18  Updated: 2020-08-25  Resolved: 2019-04-19

Status: Closed
Project: MariaDB ColumnStore
Component/s: cpimport
Affects Version/s: 1.2.3
Fix Version/s: 1.2.4

Type: New Feature Priority: Major
Reporter: Roman Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Problem/Incident
causes MCOL-3395 regression: dictionary de-duplication... Closed
Sprint: 2019-04

 Description   

Given 800 000 000 records with a couple Dictionary columns with lots of equal length strings in the data set. It took 4 167 seconds to ingest the data set into CS.
After the patch it takes only 467 seconds.

There were two main sources of latency:

  • Dctnry::getTokenFromArray represented de-dup buffer as array and called memcpy for any equal-sized string
  • COND_WAIT_SECONDS was 3 seconds per default


 Comments   
Comment by Roman [ 2019-04-18 ]

For QA: you could test the patch comparing ingestion speed of text or varchars without and with the patch using at least 10 000 000 records. There must be a reasonable difference in timings.

Comment by Daniel Lee (Inactive) [ 2019-04-19 ]

Build verified: 1.2.4-1 nightly

[dlee@master centos7]$ cat gitversionInfo.txt
server commit:
137b9a8
engine commit:
b3a7559

Dataset tested, 10 gb dbt3

orders table has 15,000,000 rows
lineitem table has 59,986,052 rows
plus 6 smaller tables.

Performed cpimport timing test on both 1.2.2-1 and 1.2.4-1. 1.2.4-1 is about 2.5 times faster. Disk space utilization remained the same.

Also with 1.2.4-1, loaded two 1gb dbt3 databases. columnstore database loaded using cpimport and innnodb database loaded using LDI. Verified all varchar columns in the orders table to be identical between the two databases using cross-engine join.

1.2.2-1

[root@localhost columnstore]# time /data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch10 columnstore 10g

real 8m53.788s
user 10m56.330s
sys 0m38.518s

[root@localhost columnstore]# du -sh data1
9.2G data1

1.2.4-1

[root@localhost columnstore]# du -sh data1
739M data1
[root@localhost columnstore]# cd

[root@localhost ~]# time /data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch10 columnstore 10g

real 3m27.598s
user 3m44.575s
sys 0m36.012s

[root@localhost columnstore]# du -sh data1
9.2G data1

Generated at Thu Feb 08 02:41:29 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.