[MDEV-30009] InnoDB shutdown hangs when the change buffer is corrupted - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.5, 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11
Fix Version/s: 10.11.2, 10.5.19, 10.6.12, 10.7.8, 10.8.7, 10.9.5, 10.10.3
Component/s: Storage Engine - InnoDB
Labels:
- regression-10.5

Description

It looks like ~~MDEV-20934~~ was reintroduced in MariaDB 10.5, possibly due to ~~MDEV-19514~~ or some changes.

It turns out that as part of fixing ~~MDEV-23399~~ (MariaDB 10.5.7), I actually disarmed the regression test innodb.ibuf_not_empty by making the slow shutdown step run with innodb_force_recovery=2, which will prevent a change buffer merge. The following effectively reverts that change:

diff --git a/mysql-test/suite/innodb/t/ibuf_not_empty.test b/mysql-test/suite/innodb/t/ibuf_not_empty.test

index 207427d9ab8..f5203a337bf 100644

--- a/mysql-test/suite/innodb/t/ibuf_not_empty.test

+++ b/mysql-test/suite/innodb/t/ibuf_not_empty.test

@@ -106,7 +106,7 @@ truncate(FILE, $ps * $pages);

 close(FILE) || die "Unable to close $file";

EOF

---let $restart_parameters=

+--let $restart_parameters=--innodb-force-recovery=0

 --source include/start_mysqld.inc

 SET GLOBAL innodb_fast_shutdown=0;

 --source include/restart_mysqld.inc

With that, we will observe a hang. The following includes the progress reporting of ~~MDEV-29984~~:

10.6 ae6ebafd819d48c965d2615fc78f1f950e0fbf40
...
2022-11-14 16:40:19 0 [ERROR] InnoDB: Trying to read 4096 bytes at 69632 outside the bounds of the file: ./test/t1.ibd
2022-11-14 16:40:19 0 [Note] /dev/shm/10.6m/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown
2022-11-14 16:40:19 0 [Note] InnoDB: FTS optimize thread exiting.
2022-11-14 16:40:19 0 [Note] InnoDB: to purge 9 transactions
2022-11-14 16:40:19 0 [Note] InnoDB: Starting shutdown...
2022-11-14 16:40:19 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/10.6m/mysql-test/var/mysqld.1/data/ib_buffer_pool
2022-11-14 16:40:19 0 [Note] InnoDB: Buffer pool(s) dump completed at 221114 16:40:19
2022-11-14 16:40:34 0 [Note] Completing change buffer merge; 409 page reads initiated; 1 change buffer pages remain
2022-11-14 16:40:49 0 [Note] Completing change buffer merge; 409 page reads initiated; 1 change buffer pages remain
2022-11-14 16:41:04 0 [Note] Completing change buffer merge; 409 page reads initiated; 1 change buffer pages remain
2022-11-14 16:41:19 0 [Note] Completing change buffer merge; 409 page reads initiated; 1 change buffer pages remain
221114 16:41:20 [ERROR] mysqld got signal 6 ;

On 10.5, this modified test will cause a crash when the purge of transaction history is attempting to access a corrupted page, which was intentionally corrupted by the test.

I believe that the case that made me file ~~MDEV-29905~~ is actually about this type of a shutdown hang.

Attachments

Issue Links

blocks

MDEV-26917 InnoDB: Clustered record for sec rec not found index

Closed

is duplicated by

MDEV-31065 MariaDB crashes with [ERROR] [FATAL] InnoDB: Page old data size 8483 new data size 7800, page old max ins size 7519 new max ins size 8202

Closed

relates to

MDEV-29694 Remove the InnoDB change buffer

Closed

MDEV-32132 DROP INDEX followed by CREATE INDEX may corrupt data

Closed

MDEV-9663 InnoDB assertion failure: *cursor->index->name == TEMP_INDEX_PREFIX, or !cursor->index->is_committed()

Closed

MDEV-20934 Infinite loop on innodb_fast_shutdown=0 with inconsistent change buffer

Closed

MDEV-22340 Server hangs on ibuf_merge_or_delete_for_page during shutdown with innodb_fast_shutdown=0

Closed

MDEV-22986 [ERROR] InnoDB: Record in index was not found on update in mysqld.log on slave

Closed

MDEV-27734 Set innodb_change_buffering=none by default

Closed

MDEV-29694 Remove the InnoDB change buffer

Closed

MDEV-29905 Change buffer operations fail to check for log file overflow

Closed

MDEV-29984 innodb_fast_shutdown=0 fails to report change buffer merge progress

Closed

(7 relates to)

Activity

Ascending order - Click to sort in descending order

Thirunarayanan Balathandayuthapani added a comment - 2022-11-15 16:42

Patch is in bb-10.6-MDEV-30009

Thirunarayanan Balathandayuthapani added a comment - 2022-11-15 16:42 Patch is in bb-10.6- MDEV-30009

Marko Mäkelä added a comment - 2022-11-16 08:33

Thank you, this is a step to the right direction. I think that we need to enable the logic also in release builds. Furthermore, I think that we must ensure that the purge of transaction history will not add more entries to the change buffer during a slow shutdown.

Marko Mäkelä added a comment - 2022-11-16 08:33 Thank you, this is a step to the right direction. I think that we need to enable the logic also in release builds. Furthermore, I think that we must ensure that the purge of transaction history will not add more entries to the change buffer during a slow shutdown.

Marko Mäkelä added a comment - 2022-11-23 08:57

In an rr replay trace that was produced with innodb_change_buffering_debug=1, it turns out that we must remove the change buffer records in the if (!bitmap_bits) code path of ibuf_merge_or_delete_for_page(), to avoid causing corruption. We got an rr replay trace where changes had been buffered for a 3-field index, the index was dropped, created as a 2-field index and stored in the same page. During ibuf_read_merge_pages(), we reset the bitmap bits but skipped the call to ibuf_delete_recs(). Finally, during a change buffer merge, we attempted to merge bogus 3-field records to the 2-field index page and hit an assertion in page_validate() when misinterpreting the records of the new 2-field index page as 3-field ones:

mysqld: /data/Server/bb-10.6-MDEV-30009/storage/innobase/rem/rem0rec.cc:443: void rec_init_offsets_comp_ordinary(const rec_t*, const dict_index_t*, rec_offs*, ulint, const dict_col_t::def_t*, rec_leaf_format) [with bool mblob = false; bool redundant_temp = false; rec_t = unsigned char; rec_offs = short unsigned int; ulint = long unsigned int]: Assertion `index->is_primary()' failed.

This corruption occurred on an .ibd file. While ALTER TABLE…DROP/ADD INDEX was involved in this case, I believe that the corruption is possible even for DML, when lots of records are being deleted.

In ~~MDEV-20934~~ some code to remove change buffer records was added to this code path, but only in the case when shutdown is running. It was refactored on a merge to 10.5 due to ~~MDEV-19514~~. The code was later revised in ~~MDEV-23839~~.

To avoid reintroducing the corruption ~~MDEV-25783~~, we must thread carefully here. I think that outside shutdown, that code path should only delete the change buffer entries if (block) (in which case we are holding an exclusive latch on the index page). ~~MDEV-25783~~ was actually caused by some code in ibuf_read_merge_pages() and not ibuf_merge_or_delete_for_page().

Marko Mäkelä added a comment - 2022-11-23 08:57 In an rr replay trace that was produced with innodb_change_buffering_debug=1 , it turns out that we must remove the change buffer records in the if (!bitmap_bits) code path of ibuf_merge_or_delete_for_page() , to avoid causing corruption. We got an rr replay trace where changes had been buffered for a 3-field index, the index was dropped, created as a 2-field index and stored in the same page. During ibuf_read_merge_pages() , we reset the bitmap bits but skipped the call to ibuf_delete_recs() . Finally, during a change buffer merge, we attempted to merge bogus 3-field records to the 2-field index page and hit an assertion in page_validate() when misinterpreting the records of the new 2-field index page as 3-field ones: mysqld: /data/Server/bb-10.6-MDEV-30009/storage/innobase/rem/rem0rec.cc:443: void rec_init_offsets_comp_ordinary(const rec_t*, const dict_index_t*, rec_offs*, ulint, const dict_col_t::def_t*, rec_leaf_format) [with bool mblob = false; bool redundant_temp = false; rec_t = unsigned char; rec_offs = short unsigned int; ulint = long unsigned int]: Assertion `index->is_primary()' failed. This corruption occurred on an .ibd file. While ALTER TABLE…DROP/ADD INDEX was involved in this case, I believe that the corruption is possible even for DML, when lots of records are being deleted. In MDEV-20934 some code to remove change buffer records was added to this code path, but only in the case when shutdown is running. It was refactored on a merge to 10.5 due to MDEV-19514 . The code was later revised in MDEV-23839 . To avoid reintroducing the corruption MDEV-25783 , we must thread carefully here. I think that outside shutdown, that code path should only delete the change buffer entries if (block) (in which case we are holding an exclusive latch on the index page). MDEV-25783 was actually caused by some code in ibuf_read_merge_pages() and not ibuf_merge_or_delete_for_page() .

Marko Mäkelä added a comment - 2023-02-17 14:00

This bug was featured in https://fosdem.org/2023/schedule/event/innodb_change_buffer/

Marko Mäkelä added a comment - 2023-02-17 14:00 This bug was featured in https://fosdem.org/2023/schedule/event/innodb_change_buffer/

MariaDB Server

InnoDB shutdown hangs when the change buffer is corrupted

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration