[MDEV-19743] Crash while reorganizing an index page Created: 2019-06-12  Updated: 2020-08-25  Resolved: 2019-11-15

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.3.14
Fix Version/s: 10.3.17, 10.4.7

Type: Bug Priority: Major
Reporter: Geoff Montee (Inactive) Assignee: Marko Mäkelä
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates MDEV-18519 0x7f0118195700 InnoDB: Assertion fai... Closed
duplicates MDEV-19916 Corruption after instant ADD/DROP and... Closed
Relates
relates to MDEV-17550 Improve WSREP's "Failed to report las... Open
relates to MDEV-19783 Random crashes and corrupt data in IN... Closed
relates to MDEV-20413 Mariadb Galera Cluster Crash and won'... Closed

 Description   

A user saw the following crash:

2019-06-11 18:00:09 0 [Note] WSREP: Created page /data/galera/gcache.page.000000 of size 299963248 bytes
2019-06-11 18:00:18 0 [Note] WSREP: Created page /data/galera/gcache.page.000001 of size 162454568 bytes
2019-06-11 18:00:35 0 [Warning] WSREP: Failed to report last committed 34212182, -110 (Connection timed out)
2019-06-11 18:00:45 0 [Note] WSREP: Deleted page /data/galera/gcache.page.000000
2019-06-11 18:00:57 0 [Note] WSREP: Deleted page /data/galera/gcache.page.000001
2019-06-11 18:01:35 5 [ERROR] InnoDB: Page old data size 890 new data size 2347, page old max ins size 15356 new max ins size 13899
2019-06-11 18:01:35 5 [ERROR] InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
2019-06-11 18:01:35 4 [ERROR] [FATAL] InnoDB: Rec offset 99, cur1 offset 6181, cur2 offset 16009
190611 18:01:35 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.3.14-MariaDB-log
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=14
max_threads=5002
thread_count=16
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 11126969 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f43c00009a8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f43d625ac30 thread_stack 0x49000
*** buffer overflow detected ***: /usr/sbin/mysqld terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x7f5d8a7f39e7]
/lib64/libc.so.6(+0x115b62)[0x7f5d8a7f1b62]
/lib64/libc.so.6(+0x117947)[0x7f5d8a7f3947]
/usr/sbin/mysqld(my_addr_resolve+0xda)[0x560c0de6e8fa]
/usr/sbin/mysqld(my_print_stacktrace+0x1c2)[0x560c0de57f92]
/usr/sbin/mysqld(handle_fatal_signal+0x357)[0x560c0d8f7cc7]
/lib64/libpthread.so.0(+0xf5d0)[0x7f5d8c43f5d0]
/lib64/libc.so.6(gsignal+0x37)[0x7f5d8a712207]
/lib64/libc.so.6(abort+0x148)[0x7f5d8a7138f8]
/usr/sbin/mysqld(+0xa6c663)[0x560c0dbe1663]
/usr/sbin/mysqld(+0x99b801)[0x560c0db10801]
/usr/sbin/mysqld(+0xa7656e)[0x560c0dbeb56e]
/usr/sbin/mysqld(+0xa79b1d)[0x560c0dbeeb1d]
/usr/sbin/mysqld(+0xa7b016)[0x560c0dbf0016]
/usr/sbin/mysqld(+0xa8b94e)[0x560c0dc0094e]
/usr/sbin/mysqld(+0xa8e0ee)[0x560c0dc030ee]
/usr/sbin/mysqld(+0x9f51be)[0x560c0db6a1be]
/usr/sbin/mysqld(+0x9f7939)[0x560c0db6c939]
/usr/sbin/mysqld(+0x9f859b)[0x560c0db6d59b]
/usr/sbin/mysqld(+0x9b3d7a)[0x560c0db28d7a]
/usr/sbin/mysqld(+0xa219cd)[0x560c0db969cd]
/lib64/libpthread.so.0(+0x7dd5)[0x7f5d8c437dd5]
/lib64/libc.so.6(clone+0x6d)[0x7f5d8a7d9ead]

There seems to be two different errors here.

The first error is:

2019-06-11 18:01:35 5 [ERROR] InnoDB: Page old data size 890 new data size 2347, page old max ins size 15356 new max ins size 13899
2019-06-11 18:01:35 5 [ERROR] InnoDB: Submit a detailed bug report to https://jira.mariadb.org/

This first error seems to occur here:

https://github.com/MariaDB/server/blob/mariadb-10.3.14/storage/innobase/btr/btr0btr.cc#L1706

This first error was also seen in MDEV-18519.

The second error is:

2019-06-11 18:01:35 4 [ERROR] [FATAL] InnoDB: Rec offset 99, cur1 offset 6181, cur2 offset 16009

This second error seems to occur here:

https://github.com/MariaDB/server/blob/mariadb-10.3.14/storage/innobase/page/page0page.cc#L612

Some of the messages immediately prior to the crash indicate that a large Galera Cluster transaction was executed:

2019-06-11 18:00:09 0 [Note] WSREP: Created page /data/galera/gcache.page.000000 of size 299963248 bytes
2019-06-11 18:00:18 0 [Note] WSREP: Created page /data/galera/gcache.page.000001 of size 162454568 bytes
2019-06-11 18:00:35 0 [Warning] WSREP: Failed to report last committed 34212182, -110 (Connection timed out)
2019-06-11 18:00:45 0 [Note] WSREP: Deleted page /data/galera/gcache.page.000000
2019-06-11 18:00:57 0 [Note] WSREP: Deleted page /data/galera/gcache.page.000001

Note that "Failed to report last committed" sounds scary, but it is pretty harmless. See MDEV-17550.



 Comments   
Comment by Marko Mäkelä [ 2019-06-13 ]

The error messages immediately preceding the intentional crash are complaining that the payload size of the index page is changing during a reorganize operation. The reorganize should only do ‘garbage collection’, not affecting the total size of the contained records.

It is possible that the page was corrupted when the reorganize was initiated.

I remember seeing this a lot during the development of MDEV-15562 (instant DROP COLUMN) in 10.4, but that was fixed before the initial push. Here, it is being reported for 10.3. Potentially, MDEV-11369 (instant ADD COLUMN) in 10.3 could affect this, but I would rather tend to believe that the bug is possible also in earlier versions.

I am assigning this to Matthias, for the creation of a test case.

Comment by Bernard Grymonpon [ 2019-06-17 ]

I just found this case, and this seems similar to what we encountered and I tried to describe in https://jira.mariadb.org/browse/MDEV-19783 (i linked both cases).

Comment by Marko Mäkelä [ 2019-08-23 ]

I have a plausible explanation of this bug in MDEV-19783, which makes also this bug a very likely duplicate of MDEV-18519.

Comment by Marko Mäkelä [ 2019-11-12 ]

I believe that I may have found an explanation why mleich failed to repeat this corruption with recent 10.3 or 10.4.
This error could be fully explained by MDEV-19916, which was fixed in MariaDB Server 10.3.17 and 10.4.7.

Note: upgrading to a fixed version will not fix existing corruption. I am afraid that the corruption can only be fixed by restoring the table from a logical or physical backup. You might try your luck with

ALTER TABLE tablename FORCE;

Even if it did not crash, it could mean that some data (in particular, the contents of any instantly added columns) could be corrupted.

Comment by Marko Mäkelä [ 2019-11-15 ]

I think that this report duplicates MDEV-19916, which was fixed in MariaDB 10.3.17 and 10.4.7.

Generated at Thu Feb 08 08:54:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.