[MDEV-11192] Error "Checsum failure while reading node partition in file" makes server crashes - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Not a Bug
Affects Version/s: 10.0.27
Fix Version/s: N/A
Component/s: Storage Engine - InnoDB
Labels:
- tokudb
Environment:

Hide
Gentoo linux - 4.4.8-hardened-r1 with PAX enabled
Supermicro machine with:
- two AMD Opteron(TM) Processor 6274, 64GB memory (approx. 40GB used regurarly)
- 4x 512GB SSD drives (different vendors) -> Linux MDRAID RAID5 -> LVM for /var/lib/mysql
emerge info and my.cnf attached

Show
Gentoo linux - 4.4.8-hardened-r1 with PAX enabled Supermicro machine with: - two AMD Opteron(TM) Processor 6274, 64GB memory (approx. 40GB used regurarly) - 4x 512GB SSD drives (different vendors) -> Linux MDRAID RAID5 -> LVM for /var/lib/mysql emerge info and my.cnf attached

Description

After replacing SSD drives laying under the the RAID+LVM (256GB drives to larger 512GB drives) and extending the LVM logical volume, server started crashing with:

Checksum failure while reading node partition in file ... error on different TokuDB files (example crash.log attached) - most of the time it's main files, sometime index
Table structure attached. Crash can be reproduced easily by reading the whole table (or identify the failing part of the data range and read its data).

The only solution was to drop index/table - in case of table it hurt because it contained 10M-100M lines of data.
What helps was to split large tables into partitions, smaller ones are not crashing anymore, but the the ones containing 30M+ lines per partition still crashes from time to time.

My question is if it's related to storage layer or tokudb/mariadb?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

20160711-mysqld.err
9 kB
2016-11-07 15:21
201611071723-mysql.err
12 kB
2016-11-07 16:50
201611110203-innodb-mysqld.err
10 kB
2016-11-11 15:49
20161118-mysqld.err
451 kB
2016-11-18 09:46
core-gdb-201611071338.txt
33 kB
2016-11-07 15:21
core-gdb-201611071723.txt
36 kB
2016-11-07 16:50
core-gdb-full-201611071338.txt
46 kB
2016-11-07 16:53
core-gdb-full-201611071338.txt
46 kB
2016-11-07 16:50
core-gdb-full-201611071723.txt
49 kB
2016-11-07 16:50
crash.log
1.0 kB
2016-10-31 14:27
emerge.info
5 kB
2016-10-31 14:19
my.cnf
4 kB
2016-10-31 14:19
scanner.sql
1 kB
2016-10-31 14:25

Activity

Ascending order - Click to sort in descending order

View 8 older comments

Jan Hejl added a comment - 2016-11-15 15:58

Nothing. MySQL recovered itself without intervention.
Few days later MySQL wasn't able to recover from InnoDB error, the only was to set innodb_force_recovery = 1 and dump the DB contents.

Because there's another instance of mariadb i use has the datadir on the XFS filesystem, i switched this instance to XFS too - dumped from EXT4 and restored to XFS. So far so good.
EXT4 e2fsck shows clean filesystem, if there'll something wrong this should confirm the hard drives issue.

Jan Hejl added a comment - 2016-11-15 15:58 Nothing. MySQL recovered itself without intervention. Few days later MySQL wasn't able to recover from InnoDB error, the only was to set innodb_force_recovery = 1 and dump the DB contents. Because there's another instance of mariadb i use has the datadir on the XFS filesystem, i switched this instance to XFS too - dumped from EXT4 and restored to XFS. So far so good. EXT4 e2fsck shows clean filesystem, if there'll something wrong this should confirm the hard drives issue.

Jan Hejl added a comment - 2016-11-18 09:48

Unfotunatelly it happend even on the XFS filesystem + clean LVM volume. 20161118-mysqld.err attached. This is time options:

stack-trace
disable-gdb

were commented.

Is there something else I can do? Or is there someone able to tell it's a problem of the hard drives?

Jan Hejl added a comment - 2016-11-18 09:48 Unfotunatelly it happend even on the XFS filesystem + clean LVM volume. 20161118-mysqld.err attached. This is time options: stack-trace disable-gdb were commented. Is there something else I can do? Or is there someone able to tell it's a problem of the hard drives?

Elena Stepanova added a comment - 2016-11-18 15:24

jplindst,

Could you please take a look at the last comments (InnoDB-related)? Do you think it's likely to be a hardware problem?

I find it additionally confusing that one time recovery failed and another time it succeeded without any intervention.

Elena Stepanova added a comment - 2016-11-18 15:24 jplindst , Could you please take a look at the last comments (InnoDB-related)? Do you think it's likely to be a hardware problem? I find it additionally confusing that one time recovery failed and another time it succeeded without any intervention.

Jan Lindström (Inactive) added a comment - 2016-11-30 06:12 - edited

Hi, yes most likely hardware problem both TokuDB and InnoDB complain corrupted database pages. I do not know enough about TokuDB to comment on those logs but InnoDB error is clearly about the fact that page that we read is corrupted.

R: Jan

Jan Lindström (Inactive) added a comment - 2016-11-30 06:12 - edited Hi, yes most likely hardware problem both TokuDB and InnoDB complain corrupted database pages. I do not know enough about TokuDB to comment on those logs but InnoDB error is clearly about the fact that page that we read is corrupted. R: Jan

Jan Hejl added a comment - 2016-11-30 11:49

Thanks for the comment.

I can confirm that in past few days I'm struggling with "FLUSH CACHE EXT" error on of the RAID drives. Interesting is the fact that the drive was replaced with new one, even newer model of the drive and the error happened again.
After updating machine's BIOS TokuDB stopped crashing but the drive stopped working again. When i turned the machine off took the drive out and put it back again, it was operating normally for some time. Thus it's clearly some HW error.

Thanks for all the help provided. I think this ticket might be closed.

Jan Hejl added a comment - 2016-11-30 11:49 Thanks for the comment. I can confirm that in past few days I'm struggling with "FLUSH CACHE EXT" error on of the RAID drives. Interesting is the fact that the drive was replaced with new one, even newer model of the drive and the error happened again. After updating machine's BIOS TokuDB stopped crashing but the drive stopped working again. When i turned the machine off took the drive out and put it back again, it was operating normally for some time. Thus it's clearly some HW error. Thanks for all the help provided. I think this ticket might be closed.

People

Assignee:: Jan Lindström (Inactive)

Reporter:: Jan Hejl

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2016-10-31 14:38

Updated:: 2016-11-30 13:11

Resolved:: 2016-11-30 13:11

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration