[MCOL-5523] SELECT INTO OUTFILE | Performance Created: 2023-06-29  Updated: 2024-01-18

Status: Open
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: None
Fix Version/s: Icebox

Type: New Feature Priority: Critical
Reporter: Allen Herrera Assignee: Max Mether
Resolution: Unresolved Votes: 0
Labels: triage


 Description   

Customers need an easy way to export data to disk or local to the client in a performant way without accessing the command line. See simplified reproduction at the bottom

Customer Summary
Customer case runs on 40 billion records table
3 seconds just the select,
19 minutes with OUTFILE
3 seconds with mariadb -q > file.tbl
all yielding 1128 results

Query runs fast alone

multi-unmanaged-db-p09 [(none)]> select year_id, location_id, count(*)  from gbd.output_injury_single_year_v14778_incidence_dd_4401 where measure_id=6 and year_id in (2018) group by year_id, location_id order by year_id, location_id;
1128 rows in set (3.001 sec)

Runs unacceptably slow with OUTFILE

MariaDB [(none)]> select year_id, location_id, count(*) into outfile '/tmp/test1.csv' from gbd.output_injury_single_year_v14778_incidence_dd_4401 where measure_id=6 and year_id in (2018) group by year_id, location_id order by year_id, location_id;
Query OK, 1128 rows affected (19 min 22.171 sec)

Work around

mariadb-dba@multi-unmanaged-db-p09:/var/lib/mysql$ time mysql -q -e "select year_id, location_id, count(*) from gbd.output_injury_single_year_v14778_incidence_dd_4401 where measure_id=6 and year_id in (2018) group by year_id, location_id order by year_id, location_id" > /tmp/test6.csv
 
real	0m3.030s

Reproduction

DB="test"
TABLE="delete_test"
RECORDS=12500000
MAX_VALUE=100
mariadb -e "create database if not exists $DB;"
mariadb $DB -e "DROP TABLE IF EXISTS $TABLE";
mariadb $DB -e "CREATE TABLE $TABLE( number1 int, group_name CHAR(2), number2 DECIMAL(12, 2), var varchar(1), column3 int, column4 int, column5 int, column6 int, column7 int, column8 int, column9 int, column10 int, column11 int, column12 int, column13 int, column14 int, column15 int, column16 int, column17 int, column18 int, column19 int, column20 int, column21 int, column22 int, column23 int, column24 int, column25 int, column26 int, column27 int, column28 int, column29 int, column30 int, column31 int, column32 int, column33 int, column34 int, column35 int, column36 int, column37 int, column38 int, column39 int, column40 int, column41 int, column42 int, column43 int, column44 int, column45 int, column46 int, column47 int, column48 int, column49 int, column50 int) ENGINE=COLUMNSTORE;"
time mariadb $DB -e "INSERT INTO $TABLE SELECT ROUND(RAND() * $MAX_VALUE, 0), CAST(ROUND(RAND() * 10, 0) AS CHAR), ROUND(RAND() * 1000000, 2), substring(MD5(RAND()),1,1), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0), ROUND(RAND() * $MAX_VALUE, 0) FROM seq_1_to_$RECORDS;"
 
 
time mariadb $DB -e "select number1, count(*) from $TABLE where number2>=50 group by number1;" | wc -l
rm -rf /tmp/test1.csv;
time mariadb $DB -e "select number1, count(*) into outfile '/tmp/test1.csv' from $TABLE where number2>=50 group by number1;";
wc -l /tmp/test1.csv; rm -rf /tmp/test1.csv;
time mariadb $DB -e "select number1, count(*) from $TABLE where number2>=50 group by number1;" > /tmp/test1.csv;
wc -l /tmp/test1.csv; 

Reproduction results

  • note 102 vs 101 because of headers included

    [root@ip-172-31-21-130 centos]# time mariadb $DB -e "select number1, count(*) from $TABLE where number2>=50 group by number1;" | wc -l
    102
     
    real	0m0.494s               * expected
    user	0m0.001s
    sys	0m0.009s
    [root@ip-172-31-21-130 centos]# rm -rf /tmp/test1.csv;
    [root@ip-172-31-21-130 centos]# time mariadb $DB -e "select number1, count(*) into outfile '/tmp/test1.csv' from $TABLE where number2>=50 group by number1;";
     
    real	0m22.484s              * slow ( in this repo its ~40x slower, customer case is ~385x slower )
    user	0m0.004s
    sys	0m0.004s
    [root@ip-172-31-21-130 centos]# wc -l /tmp/test1.csv; rm -rf /tmp/test1.csv;
    101 /tmp/test1.csv
    [root@ip-172-31-21-130 centos]# time mariadb $DB -e "select number1, count(*) from $TABLE where number2>=50 group by number1;" > /tmp/test1.csv;
     
    real	0m0.296s                * expected
    user	0m0.010s
    sys	0m0.000s
    [root@ip-172-31-21-130 centos]# wc -l /tmp/test1.csv;
    102 /tmp/test1.csv
    



 Comments   
Comment by Roman [ 2023-11-16 ]

For this particular case(low output cardinality over huge dataset) should be a week. Given that the `subselect RAP` works this must be Select Handler creation failure. In this case MDB gets data from MCS and runs aggregate on its own that is slow.

Generated at Thu Feb 08 02:58:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.