Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-11192

Error "Checsum failure while reading node partition in file" makes server crashes

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Not a Bug
    • 10.0.27
    • N/A

    Description

      After replacing SSD drives laying under the the RAID+LVM (256GB drives to larger 512GB drives) and extending the LVM logical volume, server started crashing with:

      Checksum failure while reading node partition in file ... error on different TokuDB files (example crash.log attached) - most of the time it's main files, sometime index
      Table structure attached. Crash can be reproduced easily by reading the whole table (or identify the failing part of the data range and read its data).

      The only solution was to drop index/table - in case of table it hurt because it contained 10M-100M lines of data.
      What helps was to split large tables into partitions, smaller ones are not crashing anymore, but the the ones containing 30M+ lines per partition still crashes from time to time.

      My question is if it's related to storage layer or tokudb/mariadb?

      Attachments

        1. 20160711-mysqld.err
          9 kB
        2. 201611071723-mysql.err
          12 kB
        3. 201611110203-innodb-mysqld.err
          10 kB
        4. 20161118-mysqld.err
          451 kB
        5. core-gdb-201611071338.txt
          33 kB
        6. core-gdb-201611071723.txt
          36 kB
        7. core-gdb-full-201611071338.txt
          46 kB
        8. core-gdb-full-201611071338.txt
          46 kB
        9. core-gdb-full-201611071723.txt
          49 kB
        10. crash.log
          1.0 kB
        11. emerge.info
          5 kB
        12. my.cnf
          4 kB
        13. scanner.sql
          1 kB

        Activity

          jhejl Jan Hejl added a comment -

          Nothing. MySQL recovered itself without intervention.
          Few days later MySQL wasn't able to recover from InnoDB error, the only was to set innodb_force_recovery = 1 and dump the DB contents.

          Because there's another instance of mariadb i use has the datadir on the XFS filesystem, i switched this instance to XFS too - dumped from EXT4 and restored to XFS. So far so good.
          EXT4 e2fsck shows clean filesystem, if there'll something wrong this should confirm the hard drives issue.

          jhejl Jan Hejl added a comment - Nothing. MySQL recovered itself without intervention. Few days later MySQL wasn't able to recover from InnoDB error, the only was to set innodb_force_recovery = 1 and dump the DB contents. Because there's another instance of mariadb i use has the datadir on the XFS filesystem, i switched this instance to XFS too - dumped from EXT4 and restored to XFS. So far so good. EXT4 e2fsck shows clean filesystem, if there'll something wrong this should confirm the hard drives issue.
          jhejl Jan Hejl added a comment -

          Unfotunatelly it happend even on the XFS filesystem + clean LVM volume. 20161118-mysqld.err attached. This is time options:

          stack-trace
          disable-gdb

          were commented.

          Is there something else I can do? Or is there someone able to tell it's a problem of the hard drives?

          jhejl Jan Hejl added a comment - Unfotunatelly it happend even on the XFS filesystem + clean LVM volume. 20161118-mysqld.err attached. This is time options: stack-trace disable-gdb were commented. Is there something else I can do? Or is there someone able to tell it's a problem of the hard drives?

          jplindst,

          Could you please take a look at the last comments (InnoDB-related)? Do you think it's likely to be a hardware problem?

          I find it additionally confusing that one time recovery failed and another time it succeeded without any intervention.

          elenst Elena Stepanova added a comment - jplindst , Could you please take a look at the last comments (InnoDB-related)? Do you think it's likely to be a hardware problem? I find it additionally confusing that one time recovery failed and another time it succeeded without any intervention.
          jplindst Jan Lindström (Inactive) added a comment - - edited

          Hi, yes most likely hardware problem both TokuDB and InnoDB complain corrupted database pages. I do not know enough about TokuDB to comment on those logs but InnoDB error is clearly about the fact that page that we read is corrupted.

          R: Jan

          jplindst Jan Lindström (Inactive) added a comment - - edited Hi, yes most likely hardware problem both TokuDB and InnoDB complain corrupted database pages. I do not know enough about TokuDB to comment on those logs but InnoDB error is clearly about the fact that page that we read is corrupted. R: Jan
          jhejl Jan Hejl added a comment -

          Thanks for the comment.

          I can confirm that in past few days I'm struggling with "FLUSH CACHE EXT" error on of the RAID drives. Interesting is the fact that the drive was replaced with new one, even newer model of the drive and the error happened again.
          After updating machine's BIOS TokuDB stopped crashing but the drive stopped working again. When i turned the machine off took the drive out and put it back again, it was operating normally for some time. Thus it's clearly some HW error.

          Thanks for all the help provided. I think this ticket might be closed.

          jhejl Jan Hejl added a comment - Thanks for the comment. I can confirm that in past few days I'm struggling with "FLUSH CACHE EXT" error on of the RAID drives. Interesting is the fact that the drive was replaced with new one, even newer model of the drive and the error happened again. After updating machine's BIOS TokuDB stopped crashing but the drive stopped working again. When i turned the machine off took the drive out and put it back again, it was operating normally for some time. Thus it's clearly some HW error. Thanks for all the help provided. I think this ticket might be closed.

          People

            jplindst Jan Lindström (Inactive)
            jhejl Jan Hejl
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.