[MDEV-32044] Mariadb crash after upgrading to 11.0.3: Failing assertion: local_len >= BTR_EXTERN_FIELD_REF_SIZE - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Affects Version/s: 11.0.3
Fix Version/s: 11.0.4, 11.1.3
Component/s: Server, Storage Engine - InnoDB
Labels:
- crash
Environment:
OpenSUSE 15.4

Description

After hitting this issue https://jira.mariadb.org/browse/MDEV-31579 and waiting for a fix in 11.0.3, I thought we were in the clear.

However, a couple of minutes into running 11.0.3, the server crashed, and now keeps crashing every few minutes.

2023-08-30 14:09:49 0x7fadf061e700  InnoDB: Assertion failure in file /home/abuild/rpmbuild/BUILD/mariadb-11.0.3/storage/innobase/btr/btr0cur.cc line 6702

InnoDB: Failing assertion: local_len >= BTR_EXTERN_FIELD_REF_SIZE

InnoDB: We intentionally generate a memory trap.

InnoDB: Submit a detailed bug report to https://jira.mariadb.org/

InnoDB: If you get repeated assertion failures or crashes, even

InnoDB: immediately after the mariadbd startup, there may be

InnoDB: corruption in the InnoDB tablespace. Please refer to

InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/

InnoDB: about forcing recovery.

230830 14:09:49 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 11.0.3-MariaDB-log source revision: 70905bcb9059dcc40db3b73bc46a36c7d40f1e10

key_buffer_size=536870912

read_buffer_size=262144

max_used_connections=24

max_threads=2050

thread_count=22

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 5301930 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7fad3c000c58

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x7fadf061dd08 thread_stack 0x49000

/usr/sbin/mysqld(my_print_stacktrace+0x3d)[0x55ea4835712d]

/usr/sbin/mysqld(handle_fatal_signal+0x575)[0x55ea47e60215]

corrupted size vs. prev_size

Any idea what this crash is about?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

ibuf_page.bin.xz
0.1 kB
2023-09-19 12:25
image-2023-09-07-12-33-35-832.png
2 kB
2023-09-07 19:33
page.bin
16 kB
2023-09-01 04:39
screenshot-1.png
86 kB
2023-09-01 06:36

Issue Links

blocks

MDEV-31443 assert [FATAL] InnoDB: Unable to find charset-collation for 254 after upgrade from 10.11.4 to 11.0

Closed

MDEV-31579 Mariadb fails to start following automatic upgrade to 11.0.2 - Unable to find charset-collation for 271

Closed

relates to

MDEV-13542 Crashing on a corrupted page is unhelpful

Closed

MDEV-32132 DROP INDEX followed by CREATE INDEX may corrupt data

Closed

MDEV-29694 Remove the InnoDB change buffer

Closed

Activity

Ascending order - Click to sort in descending order

View 19 older comments

Artem Russakovskii added a comment - 2023-09-08 20:47

Thank you, Marko, for all your help. As we reached a conclusion that it wasn't possible to repair (or at least easily repair) the table and prevent crashes, I blew up the slave data on the 11.0.3 machine, ran an extended check and then fixed whatever index issues that were found with an optimize command on another slave that was still on 10.11, and then copied that slave's fixed data onto the 11.0.3 machine. This time, the crashes did not occur anymore, and the server has been running smoothly.

Artem Russakovskii added a comment - 2023-09-08 20:47 Thank you, Marko, for all your help. As we reached a conclusion that it wasn't possible to repair (or at least easily repair) the table and prevent crashes, I blew up the slave data on the 11.0.3 machine, ran an extended check and then fixed whatever index issues that were found with an optimize command on another slave that was still on 10.11, and then copied that slave's fixed data onto the 11.0.3 machine. This time, the crashes did not occur anymore, and the server has been running smoothly.

Marko Mäkelä added a comment - 2023-09-13 14:57

I think that we can try to make the upgrade logic similar to how the slow shutdown fix in ~~MDEV-30009~~ works: disregard any buffered changes for pages for which there are no buffered changes according to the change buffer bitmap.

This should avoid corruption of any actual data (clustered index pages), unless the change buffer bitmaps are really corrupted.

Marko Mäkelä added a comment - 2023-09-13 14:57 I think that we can try to make the upgrade logic similar to how the slow shutdown fix in MDEV-30009 works: disregard any buffered changes for pages for which there are no buffered changes according to the change buffer bitmap. This should avoid corruption of any actual data (clustered index pages), unless the change buffer bitmaps are really corrupted.

Marko Mäkelä added a comment - 2023-09-19 10:58

To prove that the current upgrade in 11.0 does not care about the change buffer bitmap bits, I ran the following test modification on an older 10.6 server (right before ~~MDEV-30400~~) and then started 11.0 on the data directory that was left behind by the failing test:

diff --git a/mysql-test/suite/innodb/t/ibuf_not_empty.test b/mysql-test/suite/innodb/t/ibuf_not_empty.test

index 9362f8daffa..a1895f4c49c 100644

--- a/mysql-test/suite/innodb/t/ibuf_not_empty.test

+++ b/mysql-test/suite/innodb/t/ibuf_not_empty.test

@@ -33,6 +33,8 @@ SET GLOBAL innodb_change_buffering=all;

 # used for modifying the secondary index page. There must be multiple

 # index pages, because changes to the root page are never buffered.

 INSERT INTO t1 SELECT 0,'x',1 FROM seq_1_to_1024;

+UPDATE t1 SET b='y';

+UPDATE t1 SET b='x';

 let MYSQLD_DATADIR=`select @@datadir`;

 let PAGE_SIZE=`select @@innodb_page_size`;

@@ -69,6 +71,7 @@ syswrite(FILE, $page, $ps)==$ps || die "Unable to write $file\n";

 close(FILE) || die "Unable to close $file";

EOF

+stop_already_here;

 --let $restart_parameters= --innodb-force-recovery=6 --innodb-change-buffer-dump

 --source include/start_mysqld.inc

This is similar to the upgrade test ~~MDEV-29694~~, except that we stop after some Perl code had corrupted the change buffer bitmap in t1.ibd to claim that no buffered changes exist. It is noteworthy that this test fails to actually use the change buffer when the ~~MDEV-30400~~ fix is present, so I used an older build. Regrettably I did not record the exact revision in ~~MDEV-29694~~:

10.6 e0dbec1ce38688e25cab3720044d6e7e676381f6
innodb.ibuf_not_empty 'innodb,strict_crc32' [ fail ]
Test ended at 2023-09-19 13:52:09

CURRENT_TEST: innodb.ibuf_not_empty
mysqltest: At line 74: query 'stop_already_here' failed: <Unknown> (2006): Server has gone away

Unfortunately, no matter what I try (including stopping the test before the Perl code corrupts the .ibd file), I keep getting the following output on the upgrade:

2023-09-19 13:53:11 0 [Note] InnoDB: Upgrading the change buffer

2023-09-19 13:53:11 0 [Note] InnoDB: Upgraded the change buffer: 0 tablespaces, 0 pages

That is, the change buffer was actually empty to begin with.

I will next attempt creating a modified version of the data.tar.xz that was attached to ~~MDEV-31443~~, with a modified change buffer bitmap that claims that no changes need to be applied. By using such a modified data directory, I expect to get some corrupted secondary indexes reported when this bug has been fixed. The plain 11.0 without the fix should recover that kind of a corruption fine (apply the possibly stale changes to the index pages).

Marko Mäkelä added a comment - 2023-09-19 10:58 To prove that the current upgrade in 11.0 does not care about the change buffer bitmap bits, I ran the following test modification on an older 10.6 server (right before MDEV-30400 ) and then started 11.0 on the data directory that was left behind by the failing test: diff --git a/mysql-test/suite/innodb/t/ibuf_not_empty.test b/mysql-test/suite/innodb/t/ibuf_not_empty.test index 9362f8daffa..a1895f4c49c 100644 --- a/mysql-test/suite/innodb/t/ibuf_not_empty.test +++ b/mysql-test/suite/innodb/t/ibuf_not_empty.test @@ -33,6 +33,8 @@ SET GLOBAL innodb_change_buffering=all; # used for modifying the secondary index page. There must be multiple # index pages, because changes to the root page are never buffered. INSERT INTO t1 SELECT 0,'x',1 FROM seq_1_to_1024; +UPDATE t1 SET b='y'; +UPDATE t1 SET b='x'; let MYSQLD_DATADIR=`select @@datadir`; let PAGE_SIZE=`select @@innodb_page_size`; @@ -69,6 +71,7 @@ syswrite(FILE, $page, $ps)==$ps || die "Unable to write $file\n"; close(FILE) || die "Unable to close $file"; EOF +stop_already_here; --let $restart_parameters= --innodb-force-recovery=6 --innodb-change-buffer-dump --source include/start_mysqld.inc This is similar to the upgrade test MDEV-29694 , except that we stop after some Perl code had corrupted the change buffer bitmap in t1.ibd to claim that no buffered changes exist. It is noteworthy that this test fails to actually use the change buffer when the MDEV-30400 fix is present, so I used an older build. Regrettably I did not record the exact revision in MDEV-29694 : 10.6 e0dbec1ce38688e25cab3720044d6e7e676381f6 innodb.ibuf_not_empty 'innodb,strict_crc32' [ fail ] Test ended at 2023-09-19 13:52:09 CURRENT_TEST: innodb.ibuf_not_empty mysqltest: At line 74: query 'stop_already_here' failed: <Unknown> (2006): Server has gone away Unfortunately, no matter what I try (including stopping the test before the Perl code corrupts the .ibd file), I keep getting the following output on the upgrade: 2023-09-19 13:53:11 0 [Note] InnoDB: Upgrading the change buffer 2023-09-19 13:53:11 0 [Note] InnoDB: Upgraded the change buffer: 0 tablespaces, 0 pages That is, the change buffer was actually empty to begin with. I will next attempt creating a modified version of the data.tar.xz that was attached to MDEV-31443 , with a modified change buffer bitmap that claims that no changes need to be applied. By using such a modified data directory, I expect to get some corrupted secondary indexes reported when this bug has been fixed. The plain 11.0 without the fix should recover that kind of a corruption fine (apply the possibly stale changes to the index pages).

Marko Mäkelä added a comment - 2023-09-19 12:28

I created ibuf_page.bin.xz to corrupt the data.tar.xz of ~~MDEV-31443~~ as follows:

tar xJf data.tar.xz

xzcat ibuf_page.bin.xz|

dd bs=4096 count=1 seek=1 conv=notrunc of=data/test/table100_innodb_int_autoinc.ibd

If I start up the unmodified 11.0 server on this, it will apply changes to one page:

11.0 d515fa72a0294d97e3f8d6894714c10331bd771d
2023-09-19 15:23:49 0 [Note] InnoDB: Upgrading the change buffer
2023-09-19 15:23:49 0 [Note] InnoDB: Upgraded the change buffer: 1 tablespaces, 1 pages

After this, CHECK TABLE test.table100_innodb_int_autoinc EXTENDED; does not report any error. I hope that after my not-yet-written code fix, it will report some error. (Yes, this is corrupting the file the "wrong way around", opposite of ~~MDEV-32132~~.)

Marko Mäkelä added a comment - 2023-09-19 12:28 I created ibuf_page.bin.xz to corrupt the data.tar.xz of MDEV-31443 as follows: tar xJf data.tar.xz xzcat ibuf_page.bin.xz| dd bs=4096 count=1 seek=1 conv=notrunc of=data/test/table100_innodb_int_autoinc.ibd If I start up the unmodified 11.0 server on this, it will apply changes to one page: 11.0 d515fa72a0294d97e3f8d6894714c10331bd771d 2023-09-19 15:23:49 0 [Note] InnoDB: Upgrading the change buffer 2023-09-19 15:23:49 0 [Note] InnoDB: Upgraded the change buffer: 1 tablespaces, 1 pages After this, CHECK TABLE test.table100_innodb_int_autoinc EXTENDED; does not report any error. I hope that after my not-yet-written code fix, it will report some error. (Yes, this is corrupting the file the "wrong way around", opposite of MDEV-32132 .)

Marko Mäkelä added a comment - 2023-09-19 15:11

The above data set contains exactly 2 buffered changes in the change buffer root page (file offset 0x4000 in data/ibdata1), both for the same record in page 5:0x3a: a delete-marking and an insert of the record at file byte offset 0x3adbe (page 0x3a, byte 0xdbe) in file table100_innodb_int_autoinc.ibd. The funny thing is that these two operations will cancel out each other. So, for this particular dataset, there will be no corruption no matter whether the changes to the page are merged or discarded.

I debugged with rr record and rr replay that the revised upgrade logic works as expected: the bitmap bit in the original data.tar.xz is set, and clear when I artificially corrupt the change buffer bitmap page. The changes were applied or discarded accordingly. The message

2023-09-19 18:03:59 0 [Note] InnoDB: Upgraded the change buffer: 1 tablespaces, 1 pages

will count all pages for which change buffer records exist, including those for which the changes will be skipped due to the IBUF_BITMAP_BUFFERED bit being clean.

Marko Mäkelä added a comment - 2023-09-19 15:11 The above data set contains exactly 2 buffered changes in the change buffer root page (file offset 0x4000 in data/ibdata1 ), both for the same record in page 5:0x3a: a delete-marking and an insert of the record at file byte offset 0x3adbe (page 0x3a, byte 0xdbe) in file table100_innodb_int_autoinc.ibd . The funny thing is that these two operations will cancel out each other. So, for this particular dataset, there will be no corruption no matter whether the changes to the page are merged or discarded. I debugged with rr record and rr replay that the revised upgrade logic works as expected: the bitmap bit in the original data.tar.xz is set, and clear when I artificially corrupt the change buffer bitmap page. The changes were applied or discarded accordingly. The message 2023-09-19 18:03:59 0 [Note] InnoDB: Upgraded the change buffer: 1 tablespaces, 1 pages will count all pages for which change buffer records exist, including those for which the changes will be skipped due to the IBUF_BITMAP_BUFFERED bit being clean.

MariaDB Server

Mariadb crash after upgrading to 11.0.3: Failing assertion: local_len >= BTR_EXTERN_FIELD_REF_SIZE

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration