Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-4576

cpimport from S3 Slower When Using Flags/Parameters

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Minor
    • Resolution: Fixed
    • 5.5.2
    • 6.3.1
    • None
    • None
    • 2021-4, 2021-5, 2021-6, 2021-7, 2021-8, 2021-9, 2021-10, 2021-11, 2021-12, 2021-13, 2021-14, 2021-15, 2021-16, 2021-17

    Description

      This first example uses the command line flags which can be found by running cpimport -h:

      Method 1

      [root@ip-172-31-43-62 cpimport]# cpimport -d3 -e 1 -H s3.us-west-2.amazonaws.com -y FAKEAUTHKEY -K FAKEAUTHSECRET -t sample-columnstore-data -g us-west-2 bts flights all.csv -s ',' -E '"'

      Debug level set to 3
      Column delimiter : ,
      ModuleType pm
      InstallType 2
      PM Count 1
      CPI CmdLineArgs : /usr/bin/cpimport.bin -L /var/log/mariadb/columnstore/cpimport/ -e 1 -d 3 -s , -E " -P pm1-17231 -T SYSTEM -y AKIA4TJNISLRYEQX6IXL -K QJHhQ9/dHAav6UvFBEQ5QhyGgSqtD4fbyC5kL5YG -t sample-columnstore-data -g us-west-2 -H s3.us-west-2.amazonaws.com -u38ea2fef-adc0-4d82-b567-88dfadb788e4 bts flights all.csv
      Locale = en_US.UTF-8
      Debug level is set to 3
      Column delimiter : ,
      Enclosed by Character : "
      Using table OID 3016 as the default JOB ID
      Input file will be read from S3 Bucket : sample-columnstore-data, file/object : /var/log/mariadb/columnstore/data/bulk/tmpjob/3016_D20210303_T175823_S757723_Job_3016.xml
      Job description file : /var/log/mariadb/columnstore/data/bulk/tmpjob/3016_D20210303_T175823_S757723_Job_3016.xml
      Log file for this job: /var/log/mariadb/columnstore/cpimport/Job_3016.log
      2021-03-03 17:58:23 (17231:140074827606144) INFO : successfully loaded job file /var/log/mariadb/columnstore/data/bulk/tmpjob/3016_D20210303_T175823_S757723_Job_3016.xml
      2021-03-03 17:58:23 (17231:140074827606144) INFO : Job file loaded, run time for this step : 0.0347769 seconds
      2021-03-03 17:58:23 (17231:140074827606144) INFO : PreProcessing check starts
      2021-03-03 17:58:23 (17231:140074827606144) INFO : PreProcessing check completed
      2021-03-03 17:58:23 (17231:140074827606144) INFO : preProcess completed, run time for this step : 0.180334 seconds
      2021-03-03 17:58:23 (17231:140074827606144) INFO : No of Read Threads Spawned = 1
      2021-03-03 17:58:23 (17231:140074827606144) INFO : No of Parse Threads Spawned = 3
      2021-03-03 18:00:56 (17231:140073043490560) INFO : Dictionary cache flush: 483328, 483328, 491520, 491520, 499712, 499894, 475136, 481816, 507904, 508716, 516096, 516939, 524288, 525293, 532480, 533172, 540672, 541684,
      2021-03-03 18:00:56 (17231:140073043490560) INFO : For table bts.flights: 38083735 rows processed and 38083735 rows inserted.
      2021-03-03 18:00:56 (17231:140074827606144) INFO : Bulk load completed, total run time : 153.233 seconds
      

      which takes 153.233 seconds.

      The next example uses the method described here: https://mariadb.com/kb/en/columnstore-bulk-data-loading/#bulk-loading-from-aws-s3

      Method 2

      [root@ip-172-31-43-62 cpimport]# aws s3 cp --quiet s3://sample-columnstore-data/all.csv - | cpimport -d 3 -e 1 bts flights -s ',' -E '"'

      Debug level set to 3
      Column delimiter : ,
      ModuleType pm
      InstallType 2
      PM Count 1
      CPI CmdLineArgs : /usr/bin/cpimport.bin -L /var/log/mariadb/columnstore/cpimport/ -e 1 -d 3 -s , -E " -P pm1-17281 -T SYSTEM -u9069b047-adf4-432f-87fe-62149c7302c7 bts flights
      Locale = en_US.UTF-8
      Debug level is set to 3
      Column delimiter : ,
      Enclosed by Character : "
      Using table OID 3016 as the default JOB ID
      Input file(s) will be read from : STDIN
      Job description file : /var/log/mariadb/columnstore/data/bulk/tmpjob/3016_D20210303_T180157_S709597_Job_3016.xml
      Log file for this job: /var/log/mariadb/columnstore/cpimport/Job_3016.log
      2021-03-03 18:01:57 (17281:140276622887040) INFO : successfully loaded job file /var/log/mariadb/columnstore/data/bulk/tmpjob/3016_D20210303_T180157_S709597_Job_3016.xml
      2021-03-03 18:01:57 (17281:140276622887040) INFO : Job file loaded, run time for this step : 0.0354359 seconds
      2021-03-03 18:01:57 (17281:140276622887040) INFO : PreProcessing check starts
      2021-03-03 18:01:58 (17281:140276622887040) INFO : PreProcessing check completed
      2021-03-03 18:01:58 (17281:140276622887040) INFO : preProcess completed, run time for this step : 0.325933 seconds
      2021-03-03 18:01:58 (17281:140276622887040) INFO : No of Read Threads Spawned = 1
      2021-03-03 18:01:58 (17281:140275337512704) INFO : Reading input from STDIN to import into table bts.flights...
      2021-03-03 18:01:58 (17281:140276622887040) INFO : No of Parse Threads Spawned = 3
      2021-03-03 18:02:38 (17281:140275329120000) INFO : Dictionary cache flush: 483328, 699392, 874496, 1098752, 491520, 674816, 882688, 1123328, 500020, 683177, 899282, 1107259, 1275591, 714151, 897700, 1121671, 508919, 692684, 908058, 1091086, 517140, 717157, 916277, 1132055, 525585, 725868, 932899, 1140478, 533465, 732979, 924348, 1075005, 541975, 742210, 941112, 1083145,
      2021-03-03 18:02:38 (17281:140275329120000) INFO : For table bts.flights: 38083735 rows processed and 38083735 rows inserted.
      2021-03-03 18:02:39 (17281:140276622887040) INFO : Bulk load completed, total run time : 41.367 seconds
      

      which takes 41.367 seconds.

      We can see that Method 1 is almost 4x slower.

      Attachments

        Activity

          People

            dleeyh Daniel Lee (Inactive)
            toddstoffel Todd Stoffel (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.