Build verififed: 1.4.4-1 (Jenkins 20200522, RC #3)
Test:
With a table of 200 columnstores, and a dataset of 100 rows, plus about 15 empty lines, which would cause errors in cpimport. Run cpimport in a loop 100 times.
Unabled to reproduce the issue in 1.4.3-4 when running in my local VM and using aws S3 storage probably because of the network speed. I was able to reproduce the issue using local S3 storage at iteration 6.
2020-05-26 22:24:59 (13938) INFO : Bulkload Parse (thread 1) Stopped parsing Tables. BulkLoad::parse() responding to job termination
2020-05-26 22:24:59 (13938) INFO : Bulkload Parse (thread 2) Stopped parsing Tables. BulkLoad::parse() responding to job termination
2020-05-26 22:25:00 (13938) INFO : Table mytest.widetable (OID-3050) was not successfully loaded. Rolling back.
2020-05-26 22:25:00 (13938) ERR : Error rolling back table mytest.widetable; Error writing compressed column headers to DB for: OID-3062; DbRoot-1; partition-0; segment-0; Error writing to a database file. [1057]
2020-05-26 22:25:00 (13938) INFO : Bulk load completed, total run time : 4.37072 seconds
Error in loading job data
iteration 7
Locale is : C
Using table OID 3050 as the default JOB ID
Input file(s) will be read from : /root
Job description file : /var/lib/columnstore/data/bulk/tmpjob/3050_D20200526_T222500_S949540_Job_3050.xml
Log file for this job: /var/lib/columnstore/data/bulk/log/Job_3050.log
2020-05-26 22:25:01 (13997) INFO : successfully loaded job file /var/lib/columnstore/data/bulk/tmpjob/3050_D20200526_T222500_S949540_Job_3050.xml
2020-05-26 22:25:01 (13997) INFO : Job file loaded, run time for this step : 0.143359 seconds
2020-05-26 22:25:01 (13997) INFO : PreProcessing check starts
2020-05-26 22:25:01 (13997) INFO : input data file /root/t.txt
2020-05-26 22:25:11 (13997) ERR : Unable to acquire lock for table mytest.widetable; OID-3050; table currently locked by process-cpimport.bin (pm1); pid-13938; session-1; txn-1 [1203]
Error in loading job data
Repeated the same test in 1.4.4-1 and all 100 iterations finished, with each rolled backed failed cpimport job, as expected.
I have some scripts to reproduce it in the image sky is using. They should work equally well for all versions up to the fix though b/c the relevant code hasn't changed.
I'll attach later if needed. Basically, create a wide table, give cpimport the -e0 parameter to make it rollback on any errors it finds, give some bad data to import, check whether SM is still running or not. Attempt imports 100x until SM crashes.