Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.1.9, 10.1.10
    • Fix Version/s: None
    • Labels:
    • Environment:
      Ubuntu 14.04.3 LTS

      Description

      We have started to see almost daily corruption of of one of our core tables using TokuDB with FAST compression enabled. These tables are manually partitioned by day, and have 180-200 million rows added each day. A small percentage of these rows are deleted (< 5%), and another portion (< 10%) are updated one or more times through the day.

      Here is an example table :

      *************************** 8. row ***************************
                 Name: XXXX_20160107
               Engine: TokuDB
              Version: 10
           Row_format: Dynamic
                 Rows: 188682009
       Avg_row_length: 224
          Data_length: 42396852041
      Max_data_length: 9223372036854775807
         Index_length: 13764087186
            Data_free: 19841155072
       Auto_increment: 228709329
          Create_time: 2016-01-06 00:00:01
          Update_time: 2016-01-08 23:46:54
           Check_time: NULL
            Collation: utf8_general_ci
             Checksum: NULL
       Create_options: `COMPRESSION`=TOKUDB_FAST
              Comment:

      We find at some point the table becomes corrupt, and that trying to select from a certain sequence of rows fails and reports the following error :

      ERROR 1034 (HY000): Incorrect key file for table 'XXXXX_20160108'; try to repair it

      As I understand it, it isn't possible to repair a tokudb, the only option is to restore from backups. This server isn't currently running with a slave, so I can't verify if a slave would have been corrupted in the same way.

      The database is running on a pair of mirrored SSDs, and only holds up to 8 days worth of these tables, and some views pointing to these tables. Also of note we have had this problem on 10.1.9 and 10.1.10, and on two different servers running different hardware, with different models of CPU and SSDs, so it doesn't appear to be a hardware issue. One of the servers had ECC memory, and one didn't.

      I am going to turn of deletions on this table, and just flag rows as deleted to see if that changes the rate of occurrence, or if it fixes it completely, and points to deletion of rows as the problem. My only other anecdotal evidence is that this seems to have only started to happen after we begun doing relatively heavy updates of the data, I don't think it has ever occurred when we were only doing inserts, selects and deletes.

      Given it is happening only relatively infrequently (We have had 5 tables corrupted over the last week or so this way) and randomly and on such a large dataset, I am not sure how else to try to debug this further, or what other information may be of help, but please let me know what other details may be of assistance in resolving this issue.

        Attachments

          Activity

            People

            • Assignee:
              elenst Elena Stepanova
              Reporter:
              johnbarratt John Barratt
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated: