[MCOL-4021] rollback causing storagemanager to crash Created: 2020-05-26 Updated: 2021-04-19 Resolved: 2020-05-26 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | Storage Manager |
| Affects Version/s: | 1.4.3 |
| Fix Version/s: | 1.4.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Patrick LeBlanc (Inactive) | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
There is a bug in storagemanager which would allow it to create files with the wrong length in the filename. There is an assertion in the Sync class that checks the length. cpimport causes the initial problem when it creates the rollback files. The assertion happens if/when the sync class starts synchronizing those files before cpimport can finish and do the rollback. This is related to |
| Comments |
| Comment by Patrick LeBlanc (Inactive) [ 2020-05-26 ] |
|
I have some scripts to reproduce it in the image sky is using. They should work equally well for all versions up to the fix though b/c the relevant code hasn't changed. I'll attach later if needed. Basically, create a wide table, give cpimport the -e0 parameter to make it rollback on any errors it finds, give some bad data to import, check whether SM is still running or not. Attempt imports 100x until SM crashes. |
| Comment by Patrick LeBlanc (Inactive) [ 2020-05-26 ] |
|
added the scripts, data, etc. The script is a WIP, may still need tweaking to be perfect. I was watching the output & using judgement when I first reproduced the problem with this. |
| Comment by Daniel Lee (Inactive) [ 2020-05-26 ] |
|
Build verififed: 1.4.4-1 (Jenkins 20200522, RC #3) Test: With a table of 200 columnstores, and a dataset of 100 rows, plus about 15 empty lines, which would cause errors in cpimport. Run cpimport in a loop 100 times. Unabled to reproduce the issue in 1.4.3-4 when running in my local VM and using aws S3 storage probably because of the network speed. I was able to reproduce the issue using local S3 storage at iteration 6. 2020-05-26 22:24:59 (13938) INFO : Bulkload Parse (thread 1) Stopped parsing Tables. BulkLoad::parse() responding to job termination Error in loading job data Using table OID 3050 as the default JOB ID Error in loading job data Repeated the same test in 1.4.4-1 and all 100 iterations finished, with each rolled backed failed cpimport job, as expected. |