Details

    Description

      We have started to see almost daily corruption of of one of our core tables using TokuDB with FAST compression enabled. These tables are manually partitioned by day, and have 180-200 million rows added each day. A small percentage of these rows are deleted (< 5%), and another portion (< 10%) are updated one or more times through the day.

      Here is an example table :

      *************************** 8. row ***************************
                 Name: XXXX_20160107
               Engine: TokuDB
              Version: 10
           Row_format: Dynamic
                 Rows: 188682009
       Avg_row_length: 224
          Data_length: 42396852041
      Max_data_length: 9223372036854775807
         Index_length: 13764087186
            Data_free: 19841155072
       Auto_increment: 228709329
          Create_time: 2016-01-06 00:00:01
          Update_time: 2016-01-08 23:46:54
           Check_time: NULL
            Collation: utf8_general_ci
             Checksum: NULL
       Create_options: `COMPRESSION`=TOKUDB_FAST
              Comment:

      We find at some point the table becomes corrupt, and that trying to select from a certain sequence of rows fails and reports the following error :

      ERROR 1034 (HY000): Incorrect key file for table 'XXXXX_20160108'; try to repair it

      As I understand it, it isn't possible to repair a tokudb, the only option is to restore from backups. This server isn't currently running with a slave, so I can't verify if a slave would have been corrupted in the same way.

      The database is running on a pair of mirrored SSDs, and only holds up to 8 days worth of these tables, and some views pointing to these tables. Also of note we have had this problem on 10.1.9 and 10.1.10, and on two different servers running different hardware, with different models of CPU and SSDs, so it doesn't appear to be a hardware issue. One of the servers had ECC memory, and one didn't.

      I am going to turn of deletions on this table, and just flag rows as deleted to see if that changes the rate of occurrence, or if it fixes it completely, and points to deletion of rows as the problem. My only other anecdotal evidence is that this seems to have only started to happen after we begun doing relatively heavy updates of the data, I don't think it has ever occurred when we were only doing inserts, selects and deletes.

      Given it is happening only relatively infrequently (We have had 5 tables corrupted over the last week or so this way) and randomly and on such a large dataset, I am not sure how else to try to debug this further, or what other information may be of help, but please let me know what other details may be of assistance in resolving this issue.

      Attachments

        Activity

          hayden,

          Yes, unfortunately the bug has had no action for over half a year because we couldn't create a repeatable test case. Can you provide one?
          If we have it, we can check it against upstream TokuDB and if the problem is repeatable there, re-file the bug for them. If it turns out to be MariaDB specific, our development will take care of it.

          elenst Elena Stepanova added a comment - hayden , Yes, unfortunately the bug has had no action for over half a year because we couldn't create a repeatable test case. Can you provide one? If we have it, we can check it against upstream TokuDB and if the problem is repeatable there, re-file the bug for them. If it turns out to be MariaDB specific, our development will take care of it.
          hayden Hayden Clark added a comment -

          It's a bit tricky. The problem happens occasionally, under heavy load, with very large datasets. Even if scripting a test case was possible, I have no idea what the sensitivity of the bug is.

          This is really one of those really hard bugs to spot. The actual error may lie in the table or its indexes for days before a query hits the bad patch.

          How can we proceed with this? What logs can I turn on to diagnose this in the future?
          Without a fix, I'll have to downgrade to Innodb, and that will reduce performance.

          hayden Hayden Clark added a comment - It's a bit tricky. The problem happens occasionally, under heavy load, with very large datasets. Even if scripting a test case was possible, I have no idea what the sensitivity of the bug is. This is really one of those really hard bugs to spot. The actual error may lie in the table or its indexes for days before a query hits the bad patch. How can we proceed with this? What logs can I turn on to diagnose this in the future? Without a fix, I'll have to downgrade to Innodb, and that will reduce performance.
          johnbarratt John Barratt added a comment -

          Sorry to hear you've got the same problem Hayden, though I am selfishly pleased that there is someone else out there that may be able to help get to the bottom of it. Our situation seems absolutely identical to yours. It has been still happening to us, we have just put in place some work arounds to minimise the impact, as we aren't in a position to be able to downgrade to innodb due to the size of the data and the hardware available. Dumping and recreating the table 'fixes' it with no lost data it seems, it is only the index that is breaking, but that isn't a practical solution.

          We have tried to keep up with the latest releases of mariadb, but it is still happening to us every few days, even on v10.1.20.

          One clarification on this, the table this corruption is occurring on is currently write once and forget, ie it is effectively just logging data. The rows are never modified or deleted after insertion, and a new one gets created each day, and gets a new 250million odd rows. Also as before, an identical system with less data (about half the rows, same structure) has never seen this problem, though it has less read load on the data as well.

          johnbarratt John Barratt added a comment - Sorry to hear you've got the same problem Hayden, though I am selfishly pleased that there is someone else out there that may be able to help get to the bottom of it. Our situation seems absolutely identical to yours. It has been still happening to us, we have just put in place some work arounds to minimise the impact, as we aren't in a position to be able to downgrade to innodb due to the size of the data and the hardware available. Dumping and recreating the table 'fixes' it with no lost data it seems, it is only the index that is breaking, but that isn't a practical solution. We have tried to keep up with the latest releases of mariadb, but it is still happening to us every few days, even on v10.1.20. One clarification on this, the table this corruption is occurring on is currently write once and forget, ie it is effectively just logging data. The rows are never modified or deleted after insertion, and a new one gets created each day, and gets a new 250million odd rows. Also as before, an identical system with less data (about half the rows, same structure) has never seen this problem, though it has less read load on the data as well.
          gao1738 dennis added a comment -

          I got the same bug in the percona 5.7, and report the bug:
          https://jira.percona.com/browse/PS-3773

          But there is no solution now.
          It seams that, this bug is only related to percona5.7 or related mariadb version (10.1 or higher).

          gao1738 dennis added a comment - I got the same bug in the percona 5.7, and report the bug: https://jira.percona.com/browse/PS-3773 But there is no solution now. It seams that, this bug is only related to percona5.7 or related mariadb version (10.1 or higher).

          TokuDB engine is no longer maintained in MariaDB, and as of 10.5, no longer released.

          elenst Elena Stepanova added a comment - TokuDB engine is no longer maintained in MariaDB, and as of 10.5, no longer released.

          People

            Unassigned Unassigned
            johnbarratt John Barratt
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.