Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-4021

rollback causing storagemanager to crash

Details

    Description

      There is a bug in storagemanager which would allow it to create files with the wrong length in the filename. There is an assertion in the Sync class that checks the length.

      cpimport causes the initial problem when it creates the rollback files. The assertion happens if/when the sync class starts synchronizing those files before cpimport can finish and do the rollback.

      This is related to MCOL-3459, but is the minimal fix needed to fix this specific problem for 1.4.4.

      Attachments

        1. bad-data.txt
          0.0 kB
        2. create.sql
          1.0 kB
        3. exploit.sh
          0.4 kB

        Issue Links

          Activity

            I have some scripts to reproduce it in the image sky is using. They should work equally well for all versions up to the fix though b/c the relevant code hasn't changed.

            I'll attach later if needed. Basically, create a wide table, give cpimport the -e0 parameter to make it rollback on any errors it finds, give some bad data to import, check whether SM is still running or not. Attempt imports 100x until SM crashes.

            pleblanc Patrick LeBlanc (Inactive) added a comment - I have some scripts to reproduce it in the image sky is using. They should work equally well for all versions up to the fix though b/c the relevant code hasn't changed. I'll attach later if needed. Basically, create a wide table, give cpimport the -e0 parameter to make it rollback on any errors it finds, give some bad data to import, check whether SM is still running or not. Attempt imports 100x until SM crashes.

            added the scripts, data, etc. The script is a WIP, may still need tweaking to be perfect. I was watching the output & using judgement when I first reproduced the problem with this.

            pleblanc Patrick LeBlanc (Inactive) added a comment - added the scripts, data, etc. The script is a WIP, may still need tweaking to be perfect. I was watching the output & using judgement when I first reproduced the problem with this.

            Build verififed: 1.4.4-1 (Jenkins 20200522, RC #3)

            Test:

            With a table of 200 columnstores, and a dataset of 100 rows, plus about 15 empty lines, which would cause errors in cpimport. Run cpimport in a loop 100 times.

            Unabled to reproduce the issue in 1.4.3-4 when running in my local VM and using aws S3 storage probably because of the network speed. I was able to reproduce the issue using local S3 storage at iteration 6.

            2020-05-26 22:24:59 (13938) INFO : Bulkload Parse (thread 1) Stopped parsing Tables. BulkLoad::parse() responding to job termination
            2020-05-26 22:24:59 (13938) INFO : Bulkload Parse (thread 2) Stopped parsing Tables. BulkLoad::parse() responding to job termination
            2020-05-26 22:25:00 (13938) INFO : Table mytest.widetable (OID-3050) was not successfully loaded. Rolling back.
            2020-05-26 22:25:00 (13938) ERR : Error rolling back table mytest.widetable; Error writing compressed column headers to DB for: OID-3062; DbRoot-1; partition-0; segment-0; Error writing to a database file. [1057]
            2020-05-26 22:25:00 (13938) INFO : Bulk load completed, total run time : 4.37072 seconds

            Error in loading job data
            iteration 7
            Locale is : C

            Using table OID 3050 as the default JOB ID
            Input file(s) will be read from : /root
            Job description file : /var/lib/columnstore/data/bulk/tmpjob/3050_D20200526_T222500_S949540_Job_3050.xml
            Log file for this job: /var/lib/columnstore/data/bulk/log/Job_3050.log
            2020-05-26 22:25:01 (13997) INFO : successfully loaded job file /var/lib/columnstore/data/bulk/tmpjob/3050_D20200526_T222500_S949540_Job_3050.xml
            2020-05-26 22:25:01 (13997) INFO : Job file loaded, run time for this step : 0.143359 seconds
            2020-05-26 22:25:01 (13997) INFO : PreProcessing check starts
            2020-05-26 22:25:01 (13997) INFO : input data file /root/t.txt
            2020-05-26 22:25:11 (13997) ERR : Unable to acquire lock for table mytest.widetable; OID-3050; table currently locked by process-cpimport.bin (pm1); pid-13938; session-1; txn-1 [1203]

            Error in loading job data

            Repeated the same test in 1.4.4-1 and all 100 iterations finished, with each rolled backed failed cpimport job, as expected.

            dleeyh Daniel Lee (Inactive) added a comment - Build verififed: 1.4.4-1 (Jenkins 20200522, RC #3) Test: With a table of 200 columnstores, and a dataset of 100 rows, plus about 15 empty lines, which would cause errors in cpimport. Run cpimport in a loop 100 times. Unabled to reproduce the issue in 1.4.3-4 when running in my local VM and using aws S3 storage probably because of the network speed. I was able to reproduce the issue using local S3 storage at iteration 6. 2020-05-26 22:24:59 (13938) INFO : Bulkload Parse (thread 1) Stopped parsing Tables. BulkLoad::parse() responding to job termination 2020-05-26 22:24:59 (13938) INFO : Bulkload Parse (thread 2) Stopped parsing Tables. BulkLoad::parse() responding to job termination 2020-05-26 22:25:00 (13938) INFO : Table mytest.widetable (OID-3050) was not successfully loaded. Rolling back. 2020-05-26 22:25:00 (13938) ERR : Error rolling back table mytest.widetable; Error writing compressed column headers to DB for: OID-3062; DbRoot-1; partition-0; segment-0; Error writing to a database file. [1057] 2020-05-26 22:25:00 (13938) INFO : Bulk load completed, total run time : 4.37072 seconds Error in loading job data iteration 7 Locale is : C Using table OID 3050 as the default JOB ID Input file(s) will be read from : /root Job description file : /var/lib/columnstore/data/bulk/tmpjob/3050_D20200526_T222500_S949540_Job_3050.xml Log file for this job: /var/lib/columnstore/data/bulk/log/Job_3050.log 2020-05-26 22:25:01 (13997) INFO : successfully loaded job file /var/lib/columnstore/data/bulk/tmpjob/3050_D20200526_T222500_S949540_Job_3050.xml 2020-05-26 22:25:01 (13997) INFO : Job file loaded, run time for this step : 0.143359 seconds 2020-05-26 22:25:01 (13997) INFO : PreProcessing check starts 2020-05-26 22:25:01 (13997) INFO : input data file /root/t.txt 2020-05-26 22:25:11 (13997) ERR : Unable to acquire lock for table mytest.widetable; OID-3050; table currently locked by process-cpimport.bin (pm1); pid-13938; session- 1; txn -1 [1203] Error in loading job data Repeated the same test in 1.4.4-1 and all 100 iterations finished, with each rolled backed failed cpimport job, as expected.

            People

              dleeyh Daniel Lee (Inactive)
              pleblanc Patrick LeBlanc (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.