[MDEV-19743] Crash while reorganizing an index page - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Duplicate
Affects Version/s: 10.3.14
Fix Version/s: 10.3.17, 10.4.7
Component/s: Storage Engine - InnoDB
Labels:
None

Description

A user saw the following crash:

2019-06-11 18:00:09 0 [Note] WSREP: Created page /data/galera/gcache.page.000000 of size 299963248 bytes

2019-06-11 18:00:18 0 [Note] WSREP: Created page /data/galera/gcache.page.000001 of size 162454568 bytes

2019-06-11 18:00:35 0 [Warning] WSREP: Failed to report last committed 34212182, -110 (Connection timed out)

2019-06-11 18:00:45 0 [Note] WSREP: Deleted page /data/galera/gcache.page.000000

2019-06-11 18:00:57 0 [Note] WSREP: Deleted page /data/galera/gcache.page.000001

2019-06-11 18:01:35 5 [ERROR] InnoDB: Page old data size 890 new data size 2347, page old max ins size 15356 new max ins size 13899

2019-06-11 18:01:35 5 [ERROR] InnoDB: Submit a detailed bug report to https://jira.mariadb.org/

2019-06-11 18:01:35 4 [ERROR] [FATAL] InnoDB: Rec offset 99, cur1 offset 6181, cur2 offset 16009

190611 18:01:35 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.3.14-MariaDB-log

key_buffer_size=134217728

read_buffer_size=131072

max_used_connections=14

max_threads=5002

thread_count=16

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 11126969 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f43c00009a8

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x7f43d625ac30 thread_stack 0x49000

*** buffer overflow detected ***: /usr/sbin/mysqld terminated

======= Backtrace: =========

/lib64/libc.so.6(__fortify_fail+0x37)[0x7f5d8a7f39e7]

/lib64/libc.so.6(+0x115b62)[0x7f5d8a7f1b62]

/lib64/libc.so.6(+0x117947)[0x7f5d8a7f3947]

/usr/sbin/mysqld(my_addr_resolve+0xda)[0x560c0de6e8fa]

/usr/sbin/mysqld(my_print_stacktrace+0x1c2)[0x560c0de57f92]

/usr/sbin/mysqld(handle_fatal_signal+0x357)[0x560c0d8f7cc7]

/lib64/libpthread.so.0(+0xf5d0)[0x7f5d8c43f5d0]

/lib64/libc.so.6(gsignal+0x37)[0x7f5d8a712207]

/lib64/libc.so.6(abort+0x148)[0x7f5d8a7138f8]

/usr/sbin/mysqld(+0xa6c663)[0x560c0dbe1663]

/usr/sbin/mysqld(+0x99b801)[0x560c0db10801]

/usr/sbin/mysqld(+0xa7656e)[0x560c0dbeb56e]

/usr/sbin/mysqld(+0xa79b1d)[0x560c0dbeeb1d]

/usr/sbin/mysqld(+0xa7b016)[0x560c0dbf0016]

/usr/sbin/mysqld(+0xa8b94e)[0x560c0dc0094e]

/usr/sbin/mysqld(+0xa8e0ee)[0x560c0dc030ee]

/usr/sbin/mysqld(+0x9f51be)[0x560c0db6a1be]

/usr/sbin/mysqld(+0x9f7939)[0x560c0db6c939]

/usr/sbin/mysqld(+0x9f859b)[0x560c0db6d59b]

/usr/sbin/mysqld(+0x9b3d7a)[0x560c0db28d7a]

/usr/sbin/mysqld(+0xa219cd)[0x560c0db969cd]

/lib64/libpthread.so.0(+0x7dd5)[0x7f5d8c437dd5]

/lib64/libc.so.6(clone+0x6d)[0x7f5d8a7d9ead]

There seems to be two different errors here.

The first error is:

2019-06-11 18:01:35 5 [ERROR] InnoDB: Page old data size 890 new data size 2347, page old max ins size 15356 new max ins size 13899

2019-06-11 18:01:35 5 [ERROR] InnoDB: Submit a detailed bug report to https://jira.mariadb.org/

This first error seems to occur here:

https://github.com/MariaDB/server/blob/mariadb-10.3.14/storage/innobase/btr/btr0btr.cc#L1706

This first error was also seen in ~~MDEV-18519~~.

The second error is:

2019-06-11 18:01:35 4 [ERROR] [FATAL] InnoDB: Rec offset 99, cur1 offset 6181, cur2 offset 16009

This second error seems to occur here:

https://github.com/MariaDB/server/blob/mariadb-10.3.14/storage/innobase/page/page0page.cc#L612

Some of the messages immediately prior to the crash indicate that a large Galera Cluster transaction was executed:

2019-06-11 18:00:09 0 [Note] WSREP: Created page /data/galera/gcache.page.000000 of size 299963248 bytes

2019-06-11 18:00:18 0 [Note] WSREP: Created page /data/galera/gcache.page.000001 of size 162454568 bytes

2019-06-11 18:00:35 0 [Warning] WSREP: Failed to report last committed 34212182, -110 (Connection timed out)

2019-06-11 18:00:45 0 [Note] WSREP: Deleted page /data/galera/gcache.page.000000

2019-06-11 18:00:57 0 [Note] WSREP: Deleted page /data/galera/gcache.page.000001

Note that "Failed to report last committed" sounds scary, but it is pretty harmless. See ~~MDEV-17550~~.

Attachments

Issue Links

duplicates

MDEV-18519 0x7f0118195700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.3.7/storage/innobase/btr/btr0cur.cc line 4308

Closed

MDEV-19916 Corruption after instant ADD/DROP and shrinking the index tree

Closed

relates to

MDEV-17550 Improve WSREP's "Failed to report last committed" warning message

Closed

MDEV-19783 Random crashes and corrupt data in INSTANT-added columns

Closed

MDEV-20413 Mariadb Galera Cluster Crash and won't restart

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2019-06-13 05:26

The error messages immediately preceding the intentional crash are complaining that the payload size of the index page is changing during a reorganize operation. The reorganize should only do ‘garbage collection’, not affecting the total size of the contained records.

It is possible that the page was corrupted when the reorganize was initiated.

I remember seeing this a lot during the development of ~~MDEV-15562~~ (instant DROP COLUMN) in 10.4, but that was fixed before the initial push. Here, it is being reported for 10.3. Potentially, ~~MDEV-11369~~ (instant ADD COLUMN) in 10.3 could affect this, but I would rather tend to believe that the bug is possible also in earlier versions.

I am assigning this to Matthias, for the creation of a test case.

Marko Mäkelä added a comment - 2019-06-13 05:26 The error messages immediately preceding the intentional crash are complaining that the payload size of the index page is changing during a reorganize operation. The reorganize should only do ‘garbage collection’, not affecting the total size of the contained records. It is possible that the page was corrupted when the reorganize was initiated. I remember seeing this a lot during the development of MDEV-15562 (instant DROP COLUMN) in 10.4, but that was fixed before the initial push. Here, it is being reported for 10.3. Potentially, MDEV-11369 (instant ADD COLUMN) in 10.3 could affect this, but I would rather tend to believe that the bug is possible also in earlier versions. I am assigning this to Matthias, for the creation of a test case.

Bernard Grymonpon added a comment - 2019-06-17 19:28

I just found this case, and this seems similar to what we encountered and I tried to describe in https://jira.mariadb.org/browse/MDEV-19783 (i linked both cases).

Bernard Grymonpon added a comment - 2019-06-17 19:28 I just found this case, and this seems similar to what we encountered and I tried to describe in https://jira.mariadb.org/browse/MDEV-19783 (i linked both cases).

Marko Mäkelä added a comment - 2019-08-23 11:43

I have a plausible explanation of this bug in ~~MDEV-19783~~, which makes also this bug a very likely duplicate of ~~MDEV-18519~~.

Marko Mäkelä added a comment - 2019-08-23 11:43 I have a plausible explanation of this bug in MDEV-19783 , which makes also this bug a very likely duplicate of MDEV-18519 .

Marko Mäkelä added a comment - 2019-11-12 10:04

I believe that I may have found an explanation why mleich failed to repeat this corruption with recent 10.3 or 10.4.
This error could be fully explained by ~~MDEV-19916~~, which was fixed in MariaDB Server 10.3.17 and 10.4.7.

Note: upgrading to a fixed version will not fix existing corruption. I am afraid that the corruption can only be fixed by restoring the table from a logical or physical backup. You might try your luck with

ALTER TABLE tablename FORCE;

Even if it did not crash, it could mean that some data (in particular, the contents of any instantly added columns) could be corrupted.

Marko Mäkelä added a comment - 2019-11-12 10:04 I believe that I may have found an explanation why mleich failed to repeat this corruption with recent 10.3 or 10.4. This error could be fully explained by MDEV-19916 , which was fixed in MariaDB Server 10.3.17 and 10.4.7. Note: upgrading to a fixed version will not fix existing corruption. I am afraid that the corruption can only be fixed by restoring the table from a logical or physical backup. You might try your luck with ALTER TABLE tablename FORCE ; Even if it did not crash, it could mean that some data (in particular, the contents of any instantly added columns) could be corrupted.

Marko Mäkelä added a comment - 2019-11-15 11:56

I think that this report duplicates ~~MDEV-19916~~, which was fixed in MariaDB 10.3.17 and 10.4.7.

Marko Mäkelä added a comment - 2019-11-15 11:56 I think that this report duplicates MDEV-19916 , which was fixed in MariaDB 10.3.17 and 10.4.7.

MariaDB Server

Crash while reorganizing an index page

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration