Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9384

Key File Corruption for TokuDB

    XMLWordPrintable

Details

    Description

      We have started to see almost daily corruption of of one of our core tables using TokuDB with FAST compression enabled. These tables are manually partitioned by day, and have 180-200 million rows added each day. A small percentage of these rows are deleted (< 5%), and another portion (< 10%) are updated one or more times through the day.

      Here is an example table :

      *************************** 8. row ***************************
                 Name: XXXX_20160107
               Engine: TokuDB
              Version: 10
           Row_format: Dynamic
                 Rows: 188682009
       Avg_row_length: 224
          Data_length: 42396852041
      Max_data_length: 9223372036854775807
         Index_length: 13764087186
            Data_free: 19841155072
       Auto_increment: 228709329
          Create_time: 2016-01-06 00:00:01
          Update_time: 2016-01-08 23:46:54
           Check_time: NULL
            Collation: utf8_general_ci
             Checksum: NULL
       Create_options: `COMPRESSION`=TOKUDB_FAST
              Comment:

      We find at some point the table becomes corrupt, and that trying to select from a certain sequence of rows fails and reports the following error :

      ERROR 1034 (HY000): Incorrect key file for table 'XXXXX_20160108'; try to repair it

      As I understand it, it isn't possible to repair a tokudb, the only option is to restore from backups. This server isn't currently running with a slave, so I can't verify if a slave would have been corrupted in the same way.

      The database is running on a pair of mirrored SSDs, and only holds up to 8 days worth of these tables, and some views pointing to these tables. Also of note we have had this problem on 10.1.9 and 10.1.10, and on two different servers running different hardware, with different models of CPU and SSDs, so it doesn't appear to be a hardware issue. One of the servers had ECC memory, and one didn't.

      I am going to turn of deletions on this table, and just flag rows as deleted to see if that changes the rate of occurrence, or if it fixes it completely, and points to deletion of rows as the problem. My only other anecdotal evidence is that this seems to have only started to happen after we begun doing relatively heavy updates of the data, I don't think it has ever occurred when we were only doing inserts, selects and deletes.

      Given it is happening only relatively infrequently (We have had 5 tables corrupted over the last week or so this way) and randomly and on such a large dataset, I am not sure how else to try to debug this further, or what other information may be of help, but please let me know what other details may be of assistance in resolving this issue.

      Attachments

        Activity

          People

            Unassigned Unassigned
            johnbarratt John Barratt
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.