[MDEV-9384] Key File Corruption for TokuDB - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Won't Fix
Affects Version/s: 10.1.9, 10.1.10
Fix Version/s: N/A
Component/s: Storage Engine - TokuDB
Labels:
- tokudb
Environment:
Ubuntu 14.04.3 LTS

Description

We have started to see almost daily corruption of of one of our core tables using TokuDB with FAST compression enabled. These tables are manually partitioned by day, and have 180-200 million rows added each day. A small percentage of these rows are deleted (< 5%), and another portion (< 10%) are updated one or more times through the day.

Here is an example table :

*************************** 8. row ***************************

           Name: XXXX_20160107

         Engine: TokuDB

        Version: 10

     Row_format: Dynamic

           Rows: 188682009

 Avg_row_length: 224

    Data_length: 42396852041

Max_data_length: 9223372036854775807

   Index_length: 13764087186

      Data_free: 19841155072

 Auto_increment: 228709329

    Create_time: 2016-01-06 00:00:01

    Update_time: 2016-01-08 23:46:54

     Check_time: NULL

      Collation: utf8_general_ci

       Checksum: NULL

 Create_options: `COMPRESSION`=TOKUDB_FAST

        Comment:

We find at some point the table becomes corrupt, and that trying to select from a certain sequence of rows fails and reports the following error :

ERROR 1034 (HY000): Incorrect key file for table 'XXXXX_20160108'; try to repair it

As I understand it, it isn't possible to repair a tokudb, the only option is to restore from backups. This server isn't currently running with a slave, so I can't verify if a slave would have been corrupted in the same way.

The database is running on a pair of mirrored SSDs, and only holds up to 8 days worth of these tables, and some views pointing to these tables. Also of note we have had this problem on 10.1.9 and 10.1.10, and on two different servers running different hardware, with different models of CPU and SSDs, so it doesn't appear to be a hardware issue. One of the servers had ECC memory, and one didn't.

I am going to turn of deletions on this table, and just flag rows as deleted to see if that changes the rate of occurrence, or if it fixes it completely, and points to deletion of rows as the problem. My only other anecdotal evidence is that this seems to have only started to happen after we begun doing relatively heavy updates of the data, I don't think it has ever occurred when we were only doing inserts, selects and deletes.

Given it is happening only relatively infrequently (We have had 5 tables corrupted over the last week or so this way) and randomly and on such a large dataset, I am not sure how else to try to debug this further, or what other information may be of help, but please let me know what other details may be of assistance in resolving this issue.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: John Barratt

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2016-01-09 02:05

Updated:: 2023-09-24 18:49

Resolved:: 2023-09-24 18:49

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.