There is a bug in storagemanager which would allow it to create files with the wrong length in the filename. There is an assertion in the Sync class that checks the length.
cpimport causes the initial problem when it creates the rollback files. The assertion happens if/when the sync class starts synchronizing those files before cpimport can finish and do the rollback.
This is related to MCOL-3459, but is the minimal fix needed to fix this specific problem for 1.4.4.
I have some scripts to reproduce it in the image sky is using. They should work equally well for all versions up to the fix though b/c the relevant code hasn't changed.
I'll attach later if needed. Basically, create a wide table, give cpimport the -e0 parameter to make it rollback on any errors it finds, give some bad data to import, check whether SM is still running or not. Attempt imports 100x until SM crashes.
Patrick LeBlanc (Inactive)
added a comment - I have some scripts to reproduce it in the image sky is using. They should work equally well for all versions up to the fix though b/c the relevant code hasn't changed.
I'll attach later if needed. Basically, create a wide table, give cpimport the -e0 parameter to make it rollback on any errors it finds, give some bad data to import, check whether SM is still running or not. Attempt imports 100x until SM crashes.
added the scripts, data, etc. The script is a WIP, may still need tweaking to be perfect. I was watching the output & using judgement when I first reproduced the problem with this.
Patrick LeBlanc (Inactive)
added a comment - added the scripts, data, etc. The script is a WIP, may still need tweaking to be perfect. I was watching the output & using judgement when I first reproduced the problem with this.
With a table of 200 columnstores, and a dataset of 100 rows, plus about 15 empty lines, which would cause errors in cpimport. Run cpimport in a loop 100 times.
Unabled to reproduce the issue in 1.4.3-4 when running in my local VM and using aws S3 storage probably because of the network speed. I was able to reproduce the issue using local S3 storage at iteration 6.
2020-05-26 22:24:59 (13938) INFO : Bulkload Parse (thread 1) Stopped parsing Tables. BulkLoad::parse() responding to job termination
2020-05-26 22:24:59 (13938) INFO : Bulkload Parse (thread 2) Stopped parsing Tables. BulkLoad::parse() responding to job termination
2020-05-26 22:25:00 (13938) INFO : Table mytest.widetable (OID-3050) was not successfully loaded. Rolling back.
2020-05-26 22:25:00 (13938) ERR : Error rolling back table mytest.widetable; Error writing compressed column headers to DB for: OID-3062; DbRoot-1; partition-0; segment-0; Error writing to a database file. [1057]
2020-05-26 22:25:00 (13938) INFO : Bulk load completed, total run time : 4.37072 seconds
Error in loading job data
iteration 7
Locale is : C
Using table OID 3050 as the default JOB ID
Input file(s) will be read from : /root
Job description file : /var/lib/columnstore/data/bulk/tmpjob/3050_D20200526_T222500_S949540_Job_3050.xml
Log file for this job: /var/lib/columnstore/data/bulk/log/Job_3050.log
2020-05-26 22:25:01 (13997) INFO : successfully loaded job file /var/lib/columnstore/data/bulk/tmpjob/3050_D20200526_T222500_S949540_Job_3050.xml
2020-05-26 22:25:01 (13997) INFO : Job file loaded, run time for this step : 0.143359 seconds
2020-05-26 22:25:01 (13997) INFO : PreProcessing check starts
2020-05-26 22:25:01 (13997) INFO : input data file /root/t.txt
2020-05-26 22:25:11 (13997) ERR : Unable to acquire lock for table mytest.widetable; OID-3050; table currently locked by process-cpimport.bin (pm1); pid-13938; session-1; txn-1 [1203]
Error in loading job data
Repeated the same test in 1.4.4-1 and all 100 iterations finished, with each rolled backed failed cpimport job, as expected.
Daniel Lee (Inactive)
added a comment - Build verififed: 1.4.4-1 (Jenkins 20200522, RC #3)
Test:
With a table of 200 columnstores, and a dataset of 100 rows, plus about 15 empty lines, which would cause errors in cpimport. Run cpimport in a loop 100 times.
Unabled to reproduce the issue in 1.4.3-4 when running in my local VM and using aws S3 storage probably because of the network speed. I was able to reproduce the issue using local S3 storage at iteration 6.
2020-05-26 22:24:59 (13938) INFO : Bulkload Parse (thread 1) Stopped parsing Tables. BulkLoad::parse() responding to job termination
2020-05-26 22:24:59 (13938) INFO : Bulkload Parse (thread 2) Stopped parsing Tables. BulkLoad::parse() responding to job termination
2020-05-26 22:25:00 (13938) INFO : Table mytest.widetable (OID-3050) was not successfully loaded. Rolling back.
2020-05-26 22:25:00 (13938) ERR : Error rolling back table mytest.widetable; Error writing compressed column headers to DB for: OID-3062; DbRoot-1; partition-0; segment-0; Error writing to a database file. [1057]
2020-05-26 22:25:00 (13938) INFO : Bulk load completed, total run time : 4.37072 seconds
Error in loading job data
iteration 7
Locale is : C
Using table OID 3050 as the default JOB ID
Input file(s) will be read from : /root
Job description file : /var/lib/columnstore/data/bulk/tmpjob/3050_D20200526_T222500_S949540_Job_3050.xml
Log file for this job: /var/lib/columnstore/data/bulk/log/Job_3050.log
2020-05-26 22:25:01 (13997) INFO : successfully loaded job file /var/lib/columnstore/data/bulk/tmpjob/3050_D20200526_T222500_S949540_Job_3050.xml
2020-05-26 22:25:01 (13997) INFO : Job file loaded, run time for this step : 0.143359 seconds
2020-05-26 22:25:01 (13997) INFO : PreProcessing check starts
2020-05-26 22:25:01 (13997) INFO : input data file /root/t.txt
2020-05-26 22:25:11 (13997) ERR : Unable to acquire lock for table mytest.widetable; OID-3050; table currently locked by process-cpimport.bin (pm1); pid-13938; session- 1; txn -1 [1203]
Error in loading job data
Repeated the same test in 1.4.4-1 and all 100 iterations finished, with each rolled backed failed cpimport job, as expected.
People
Daniel Lee (Inactive)
Patrick LeBlanc (Inactive)
Votes:
0Vote for this issue
Watchers:
3Start watching this issue
Dates
Created:
Updated:
Resolved:
Git Integration
Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.
{"report":{"fcp":2137.4000000953674,"ttfb":785,"pageVisibility":"visible","entityId":87475,"key":"jira.project.issue.view-issue","isInitial":true,"threshold":1000,"elementTimings":{},"userDeviceMemory":8,"userDeviceProcessors":64,"apdex":0.5,"journeyId":"3548914e-4e65-43b2-971d-a787507799d6","navigationType":0,"readyForUser":2256.199999809265,"redirectCount":0,"resourceLoadedEnd":1911.3000001907349,"resourceLoadedStart":834.5999999046326,"resourceTiming":[{"duration":225.30000019073486,"initiatorType":"link","name":"https://jira.mariadb.org/s/2c21342762a6a02add1c328bed317ffd-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/css/_super/batch.css","startTime":834.5999999046326,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":834.5999999046326,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1059.9000000953674,"responseStart":0,"secureConnectionStart":0},{"duration":225.2999997138977,"initiatorType":"link","name":"https://jira.mariadb.org/s/7ebd35e77e471bc30ff0eba799ebc151-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/css/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.css?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&slack-enabled=true&whisper-enabled=true","startTime":834.9000000953674,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":834.9000000953674,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1060.1999998092651,"responseStart":0,"secureConnectionStart":0},{"duration":671.5,"initiatorType":"script","name":"https://jira.mariadb.org/s/0917945aaa57108d00c5076fea35e069-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/js/_super/batch.js?locale=en","startTime":835.1999998092651,"connectEnd":835.1999998092651,"connectStart":835.1999998092651,"domainLookupEnd":835.1999998092651,"domainLookupStart":835.1999998092651,"fetchStart":835.1999998092651,"redirectEnd":0,"redirectStart":0,"requestStart":1071.8000001907349,"responseEnd":1506.6999998092651,"responseStart":1114.8000001907349,"secureConnectionStart":835.1999998092651},{"duration":990,"initiatorType":"script","name":"https://jira.mariadb.org/s/2d8175ec2fa4c816e8023260bd8c1786-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/js/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&locale=en&slack-enabled=true&whisper-enabled=true","startTime":835.3000001907349,"connectEnd":835.3000001907349,"connectStart":835.3000001907349,"domainLookupEnd":835.3000001907349,"domainLookupStart":835.3000001907349,"fetchStart":835.3000001907349,"redirectEnd":0,"redirectStart":0,"requestStart":1072,"responseEnd":1825.3000001907349,"responseStart":1119,"secureConnectionStart":835.3000001907349},{"duration":283.30000019073486,"initiatorType":"script","name":"https://jira.mariadb.org/s/a9324d6758d385eb45c462685ad88f1d-CDN/lu2cib/820016/12ta74/c92c0caa9a024ae85b0ebdbed7fb4bd7/_/download/contextbatch/js/atl.global,-_super/batch.js?locale=en","startTime":835.5,"connectEnd":835.5,"connectStart":835.5,"domainLookupEnd":835.5,"domainLookupStart":835.5,"fetchStart":835.5,"redirectEnd":0,"redirectStart":0,"requestStart":1072.1999998092651,"responseEnd":1118.8000001907349,"responseStart":1117.8000001907349,"secureConnectionStart":835.5},{"duration":286.80000019073486,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-en/jira.webresources:calendar-en.js","startTime":835.6999998092651,"connectEnd":835.6999998092651,"connectStart":835.6999998092651,"domainLookupEnd":835.6999998092651,"domainLookupStart":835.6999998092651,"fetchStart":835.6999998092651,"redirectEnd":0,"redirectStart":0,"requestStart":1072.3000001907349,"responseEnd":1122.5,"responseStart":1121.6999998092651,"secureConnectionStart":835.6999998092651},{"duration":290.69999980926514,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-localisation-moment/jira.webresources:calendar-localisation-moment.js","startTime":836,"connectEnd":836,"connectStart":836,"domainLookupEnd":836,"domainLookupStart":836,"fetchStart":836,"redirectEnd":0,"redirectStart":0,"requestStart":1072.4000000953674,"responseEnd":1126.6999998092651,"responseStart":1124.8000001907349,"secureConnectionStart":836},{"duration":227,"initiatorType":"link","name":"https://jira.mariadb.org/s/b04b06a02d1959df322d9cded3aeecc1-CDN/lu2cib/820016/12ta74/a2ff6aa845ffc9a1d22fe23d9ee791fc/_/download/contextbatch/css/jira.global.look-and-feel,-_super/batch.css","startTime":836.0999999046326,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":836.0999999046326,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1063.0999999046326,"responseStart":0,"secureConnectionStart":0},{"duration":290.5,"initiatorType":"script","name":"https://jira.mariadb.org/rest/api/1.0/shortcuts/820016/47140b6e0a9bc2e4913da06536125810/shortcuts.js?context=issuenavigation&context=issueaction","startTime":836.3000001907349,"connectEnd":836.3000001907349,"connectStart":836.3000001907349,"domainLookupEnd":836.3000001907349,"domainLookupStart":836.3000001907349,"fetchStart":836.3000001907349,"redirectEnd":0,"redirectStart":0,"requestStart":1072.6999998092651,"responseEnd":1126.8000001907349,"responseStart":1125.3000001907349,"secureConnectionStart":836.3000001907349},{"duration":228.19999980926514,"initiatorType":"link","name":"https://jira.mariadb.org/s/3ac36323ba5e4eb0af2aa7ac7211b4bb-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/css/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.css?jira.create.linked.issue=true","startTime":836.4000000953674,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":836.4000000953674,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1064.5999999046326,"responseStart":0,"secureConnectionStart":0},{"duration":318.90000009536743,"initiatorType":"script","name":"https://jira.mariadb.org/s/5d5e8fe91fbc506585e83ea3b62ccc4b-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/js/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.js?jira.create.linked.issue=true&locale=en","startTime":836.5999999046326,"connectEnd":836.5999999046326,"connectStart":836.5999999046326,"domainLookupEnd":836.5999999046326,"domainLookupStart":836.5999999046326,"fetchStart":836.5999999046326,"redirectEnd":0,"redirectStart":0,"requestStart":1077.6999998092651,"responseEnd":1155.5,"responseStart":1127.0999999046326,"secureConnectionStart":836.5999999046326},{"duration":1073.5,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-js/jira.webresources:bigpipe-js.js","startTime":837.4000000953674,"connectEnd":837.4000000953674,"connectStart":837.4000000953674,"domainLookupEnd":837.4000000953674,"domainLookupStart":837.4000000953674,"fetchStart":837.4000000953674,"redirectEnd":0,"redirectStart":0,"requestStart":1512.5,"responseEnd":1910.9000000953674,"responseStart":1896.0999999046326,"secureConnectionStart":837.4000000953674},{"duration":1071.9000000953674,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-init/jira.webresources:bigpipe-init.js","startTime":839.4000000953674,"connectEnd":839.4000000953674,"connectStart":839.4000000953674,"domainLookupEnd":839.4000000953674,"domainLookupStart":839.4000000953674,"fetchStart":839.4000000953674,"redirectEnd":0,"redirectStart":0,"requestStart":1512.6999998092651,"responseEnd":1911.3000001907349,"responseStart":1896.9000000953674,"secureConnectionStart":839.4000000953674},{"duration":192.80000019073486,"initiatorType":"xmlhttprequest","name":"https://jira.mariadb.org/rest/webResources/1.0/resources","startTime":1761,"connectEnd":1761,"connectStart":1761,"domainLookupEnd":1761,"domainLookupStart":1761,"fetchStart":1761,"redirectEnd":0,"redirectStart":0,"requestStart":1921.0999999046326,"responseEnd":1953.8000001907349,"responseStart":1952.8000001907349,"secureConnectionStart":1761}],"fetchStart":0,"domainLookupStart":0,"domainLookupEnd":0,"connectStart":0,"connectEnd":0,"requestStart":639,"responseStart":785,"responseEnd":833,"domLoading":794,"domInteractive":2316,"domContentLoadedEventStart":2316,"domContentLoadedEventEnd":2377,"domComplete":2881,"loadEventStart":2881,"loadEventEnd":2882,"userAgent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)","marks":[{"name":"bigPipe.sidebar-id.start","time":2292.0999999046326},{"name":"bigPipe.sidebar-id.end","time":2292.800000190735},{"name":"bigPipe.activity-panel-pipe-id.start","time":2292.9000000953674},{"name":"bigPipe.activity-panel-pipe-id.end","time":2294.5},{"name":"activityTabFullyLoaded","time":2420.9000000953674}],"measures":[],"correlationId":"cb990a6f7ddab7","effectiveType":"4g","downlink":10,"rtt":0,"serverDuration":93,"dbReadsTimeInMs":13,"dbConnsTimeInMs":21,"applicationHash":"9d11dbea5f4be3d4cc21f03a88dd11d8c8687422","experiments":[]}}
I have some scripts to reproduce it in the image sky is using. They should work equally well for all versions up to the fix though b/c the relevant code hasn't changed.
I'll attach later if needed. Basically, create a wide table, give cpimport the -e0 parameter to make it rollback on any errors it finds, give some bad data to import, check whether SM is still running or not. Attempt imports 100x until SM crashes.