[MDEV-30009] InnoDB shutdown hangs when the change buffer is corrupted Created: 2022-11-14  Updated: 2023-09-08  Resolved: 2022-11-23

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11
Fix Version/s: 10.11.2, 10.5.19, 10.6.12, 10.7.8, 10.8.7, 10.9.5, 10.10.3

Type: Bug Priority: Critical
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: regression-10.5

Issue Links:
Blocks
blocks MDEV-26917 InnoDB: Clustered record for sec rec ... Closed
Duplicate
is duplicated by MDEV-31065 MariaDB crashes with [ERROR] [FATAL] ... Closed
Relates
relates to MDEV-29694 Remove the InnoDB change buffer Closed
relates to MDEV-32132 DROP INDEX followed by CREATE INDEX m... Closed
relates to MDEV-9663 InnoDB assertion failure: *cursor->in... Closed
relates to MDEV-20934 Infinite loop on innodb_fast_shutdown... Closed
relates to MDEV-22340 Server hangs on ibuf_merge_or_delete_... Closed
relates to MDEV-22986 [ERROR] InnoDB: Record in index was n... Closed
relates to MDEV-27734 Set innodb_change_buffering=none by d... Closed
relates to MDEV-29694 Remove the InnoDB change buffer Closed
relates to MDEV-29905 Change buffer operations fail to chec... Closed
relates to MDEV-29984 innodb_fast_shutdown=0 fails to repor... Closed

 Description   

It looks like MDEV-20934 was reintroduced in MariaDB 10.5, possibly due to MDEV-19514 or some changes.

It turns out that as part of fixing MDEV-23399 (MariaDB 10.5.7), I actually disarmed the regression test innodb.ibuf_not_empty by making the slow shutdown step run with innodb_force_recovery=2, which will prevent a change buffer merge. The following effectively reverts that change:

diff --git a/mysql-test/suite/innodb/t/ibuf_not_empty.test b/mysql-test/suite/innodb/t/ibuf_not_empty.test
index 207427d9ab8..f5203a337bf 100644
--- a/mysql-test/suite/innodb/t/ibuf_not_empty.test
+++ b/mysql-test/suite/innodb/t/ibuf_not_empty.test
@@ -106,7 +106,7 @@ truncate(FILE, $ps * $pages);
 close(FILE) || die "Unable to close $file";
 EOF
 
---let $restart_parameters=
+--let $restart_parameters=--innodb-force-recovery=0
 --source include/start_mysqld.inc
 SET GLOBAL innodb_fast_shutdown=0;
 --source include/restart_mysqld.inc

With that, we will observe a hang. The following includes the progress reporting of MDEV-29984:

10.6 ae6ebafd819d48c965d2615fc78f1f950e0fbf40

...
2022-11-14 16:40:19 0 [ERROR] InnoDB: Trying to read 4096 bytes at 69632 outside the bounds of the file: ./test/t1.ibd
2022-11-14 16:40:19 0 [Note] /dev/shm/10.6m/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown
2022-11-14 16:40:19 0 [Note] InnoDB: FTS optimize thread exiting.
2022-11-14 16:40:19 0 [Note] InnoDB: to purge 9 transactions
2022-11-14 16:40:19 0 [Note] InnoDB: Starting shutdown...
2022-11-14 16:40:19 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/10.6m/mysql-test/var/mysqld.1/data/ib_buffer_pool
2022-11-14 16:40:19 0 [Note] InnoDB: Buffer pool(s) dump completed at 221114 16:40:19
2022-11-14 16:40:34 0 [Note] Completing change buffer merge; 409 page reads initiated; 1 change buffer pages remain
2022-11-14 16:40:49 0 [Note] Completing change buffer merge; 409 page reads initiated; 1 change buffer pages remain
2022-11-14 16:41:04 0 [Note] Completing change buffer merge; 409 page reads initiated; 1 change buffer pages remain
2022-11-14 16:41:19 0 [Note] Completing change buffer merge; 409 page reads initiated; 1 change buffer pages remain
221114 16:41:20 [ERROR] mysqld got signal 6 ;

On 10.5, this modified test will cause a crash when the purge of transaction history is attempting to access a corrupted page, which was intentionally corrupted by the test.

I believe that the case that made me file MDEV-29905 is actually about this type of a shutdown hang.



 Comments   
Comment by Thirunarayanan Balathandayuthapani [ 2022-11-15 ]

Patch is in bb-10.6-MDEV-30009

Comment by Marko Mäkelä [ 2022-11-16 ]

Thank you, this is a step to the right direction. I think that we need to enable the logic also in release builds. Furthermore, I think that we must ensure that the purge of transaction history will not add more entries to the change buffer during a slow shutdown.

Comment by Marko Mäkelä [ 2022-11-23 ]

In an rr replay trace that was produced with innodb_change_buffering_debug=1, it turns out that we must remove the change buffer records in the if (!bitmap_bits) code path of ibuf_merge_or_delete_for_page(), to avoid causing corruption. We got an rr replay trace where changes had been buffered for a 3-field index, the index was dropped, created as a 2-field index and stored in the same page. During ibuf_read_merge_pages(), we reset the bitmap bits but skipped the call to ibuf_delete_recs(). Finally, during a change buffer merge, we attempted to merge bogus 3-field records to the 2-field index page and hit an assertion in page_validate() when misinterpreting the records of the new 2-field index page as 3-field ones:

mysqld: /data/Server/bb-10.6-MDEV-30009/storage/innobase/rem/rem0rec.cc:443: void rec_init_offsets_comp_ordinary(const rec_t*, const dict_index_t*, rec_offs*, ulint, const dict_col_t::def_t*, rec_leaf_format) [with bool mblob = false; bool redundant_temp = false; rec_t = unsigned char; rec_offs = short unsigned int; ulint = long unsigned int]: Assertion `index->is_primary()' failed.

This corruption occurred on an .ibd file. While ALTER TABLE…DROP/ADD INDEX was involved in this case, I believe that the corruption is possible even for DML, when lots of records are being deleted.

In MDEV-20934 some code to remove change buffer records was added to this code path, but only in the case when shutdown is running. It was refactored on a merge to 10.5 due to MDEV-19514. The code was later revised in MDEV-23839.

To avoid reintroducing the corruption MDEV-25783, we must thread carefully here. I think that outside shutdown, that code path should only delete the change buffer entries if (block) (in which case we are holding an exclusive latch on the index page). MDEV-25783 was actually caused by some code in ibuf_read_merge_pages() and not ibuf_merge_or_delete_for_page().

Comment by Marko Mäkelä [ 2023-02-17 ]

This bug was featured in https://fosdem.org/2023/schedule/event/innodb_change_buffer/

Generated at Thu Feb 08 10:12:57 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.