Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-5523

SELECT INTO OUTFILE | Performance

    XMLWordPrintable

Details

    • 2024-2

    Description

      Customers need an easy way to export data to disk or local to the client in a performant way without accessing the command line. See simplified reproduction at the bottom

      Customer Summary
      Customer case runs on 40 billion records table
      3 seconds just the select,
      19 minutes with OUTFILE
      3 seconds with mariadb -q > file.tbl
      all yielding 1128 results

      Query runs fast alone

      multi-unmanaged-db-p09 [(none)]> select year_id, location_id, count(*)  from gbd.output_injury_single_year_v14778_incidence_dd_4401 where measure_id=6 and year_id in (2018) group by year_id, location_id order by year_id, location_id;
      1128 rows in set (3.001 sec)
      

      Runs unacceptably slow with OUTFILE

      MariaDB [(none)]> select year_id, location_id, count(*) into outfile '/tmp/test1.csv' from gbd.output_injury_single_year_v14778_incidence_dd_4401 where measure_id=6 and year_id in (2018) group by year_id, location_id order by year_id, location_id;
      Query OK, 1128 rows affected (19 min 22.171 sec)
      

      Work around

      mariadb-dba@multi-unmanaged-db-p09:/var/lib/mysql$ time mysql -q -e "select year_id, location_id, count(*) from gbd.output_injury_single_year_v14778_incidence_dd_4401 where measure_id=6 and year_id in (2018) group by year_id, location_id order by year_id, location_id" > /tmp/test6.csv
       
      real	0m3.030s
      

      Reproduction

      DB="test"
      TABLE="delete_test"
      RECORDS=12500000
      MAX_VALUE=100
      mariadb -e "create database if not exists $DB;"
      mariadb $DB -e "DROP TABLE IF EXISTS $TABLE";
      mariadb $DB -e "CREATE TABLE $TABLE( number1 int, group_name CHAR(2), number2 DECIMAL(12, 2), var varchar(1), column3 int, column4 int, column5 int, column6 int, column7 int, column8 int, column9 int, column10 int, column11 int, column12 int, column13 int, column14 int, column15 int, column16 int, column17 int, column18 int, column19 int, column20 int, column21 int, column22 int, column23 int, column24 int, column25 int, column26 int, column27 int, column28 int, column29 int, column30 int, column31 int, column32 int, column33 int, column34 int, column35 int, column36 int, column37 int, column38 int, column39 int, column40 int, column41 int, column42 int, column43 int, column44 int, column45 int, column46 int, column47 int, column48 int, column49 int, column50 int) ENGINE=COLUMNSTORE;"
      time mariadb $DB -e "INSERT INTO $TABLE SELECT ROUND(RAND() * $MAX_VALUE, 0), CAST(ROUND(RAND() * 10, 0) AS CHAR), ROUND(RAND() * 1000000, 2), substring(MD5(RAND()),1,1), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0) FROM seq_1_to_$RECORDS;"
       
       
      time mariadb $DB -e "select number1, count(*) from $TABLE where number2>=50 group by number1;" | wc -l
      rm -rf /tmp/test1.csv;
      time mariadb $DB -e "select number1, count(*) into outfile '/tmp/test1.csv' from $TABLE where number2>=50 group by number1;";
      wc -l /tmp/test1.csv; rm -rf /tmp/test1.csv;
      time mariadb $DB -e "select number1, count(*) from $TABLE where number2>=50 group by number1;" > /tmp/test1.csv;
      wc -l /tmp/test1.csv; 
      

      Reproduction results

      • note 102 vs 101 because of headers included

        [root@ip-172-31-21-130 centos]# time mariadb $DB -e "select number1, count(*) from $TABLE where number2>=50 group by number1;" | wc -l
        102
         
        real	0m0.494s               * expected
        user	0m0.001s
        sys	0m0.009s
        [root@ip-172-31-21-130 centos]# rm -rf /tmp/test1.csv;
        [root@ip-172-31-21-130 centos]# time mariadb $DB -e "select number1, count(*) into outfile '/tmp/test1.csv' from $TABLE where number2>=50 group by number1;";
         
        real	0m22.484s              * slow ( in this repo its ~40x slower, customer case is ~385x slower )
        user	0m0.004s
        sys	0m0.004s
        [root@ip-172-31-21-130 centos]# wc -l /tmp/test1.csv; rm -rf /tmp/test1.csv;
        101 /tmp/test1.csv
        [root@ip-172-31-21-130 centos]# time mariadb $DB -e "select number1, count(*) from $TABLE where number2>=50 group by number1;" > /tmp/test1.csv;
         
        real	0m0.296s                * expected
        user	0m0.010s
        sys	0m0.000s
        [root@ip-172-31-21-130 centos]# wc -l /tmp/test1.csv;
        102 /tmp/test1.csv
        

      Attachments

        Activity

          People

            leonid.fedorov Leonid Fedorov
            allen.herrera Allen Herrera
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.