[MDEV-20934] Infinite loop on innodb_fast_shutdown=0 with inconsistent change buffer - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 5.5(EOL), 10.0(EOL), 10.1(EOL), 10.2(EOL), 10.3(EOL), 10.4(EOL)
Fix Version/s: 10.2.29, 10.3.20, 10.4.10
Component/s: Storage Engine - InnoDB, Storage Engine - XtraDB
Labels:
- corruption
- shutdown

Description

Due to a data corruption bug in the past (such as MySQL Bug #69122 InnoDB doesn't redo-log insert buffer merge operation if it is done in-place) it seems possible that the InnoDB change buffer ends up containing entries, while no buffered changes exist according to the change buffer bitmap pages in the .ibd files.

The logic on slow shutdown would proceed as follows:

ibuf_merge_pages() calls btr_pcur_open_at_rnd_pos(), which will find a change buffer leaf page
page numbers are read from the change buffer records
page reads requests will be posted
on read completion, ibuf_merge_or_delete_for_page() will be invoked
Alas, the bitmap page in the .ibd says that there are no buffered changes, and nothing will be done.
Because the ‘orphan’ records for the page were not deleted from the change buffer, this will keep looping.

To fix this, I think that we should change the following code in ibuf_merge_or_delete_for_page():

			if (!bitmap_bits) {

				/* No inserts buffered for this page */

				fil_space_release(space);

				return;

Before returning, we should check if slow shutdown is in progress. If yes, we should attempt to delete any change buffer entries for page_id. We should not try this during normal operation, because it would cause a lot of unnecessary work.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

stack.txt
37 kB
2020-04-22 13:07

Issue Links

causes

MDEV-23839 innodb_fast_shutdown=0 hang on change buffer merge

Closed

MDEV-25783 CHECK TABLE harvests InnoDB: Index 'abdcef' contains 10001 entries, should be 10000

Closed

relates to

MDEV-21069 Crash on DROP TABLE if the data file is corrupted

Closed

MDEV-21152 Bogus debug assertion btr_pcur_is_after_last_in_tree() in ibuf code

Closed

MDEV-24449 Corruption of system tablespace or last recovered page

Closed

MDEV-30009 InnoDB shutdown hangs when the change buffer is corrupted

Closed

MDEV-32132 DROP INDEX followed by CREATE INDEX may corrupt data

Closed

MDEV-20864 Introduce debug option innodb_change_buffer_dump

Closed

(3 relates to)

Activity

Ascending order - Click to sort in descending order

View 7 older comments

Marko Mäkelä added a comment - 2020-05-07 06:26

I found a likely cause of the scenario that caused change buffer merge to hang. In ibuf_insert_low() we update the change buffer bitmap in a separate mini-transaction, ahead of writing the data to the change buffer:

	/* Set the bitmap bit denoting that the insert buffer contains

	buffered entries for this index page, if the bit is not set yet */

	old_bit_value = ibuf_bitmap_page_get_bits(bitmap_page, page_no,

					IBUF_BITMAP_BUFFERED, &bitmap_mtr);

	if (!old_bit_value) {

		ibuf_bitmap_page_set_bits(bitmap_page, page_no,

				IBUF_BITMAP_BUFFERED, TRUE, &bitmap_mtr);

	mtr_commit(&bitmap_mtr);

The above was introduced with the initial commit of InnoDB into MySQL 3.23.34. If the server is killed or a backup is finished between the logical time of the commit of bitmap_mtr and the subsequent mini-transaction commit that inserts the record into the change buffer, then we will have the bitmap page indicating that there exist unbuffered changes for a page, although none might actually exist.

I do not think that this non-atomicity can be fixed, so the change buffer merge will have to deal with this situation.

Marko Mäkelä added a comment - 2020-05-07 06:26 I found a likely cause of the scenario that caused change buffer merge to hang. In ibuf_insert_low() we update the change buffer bitmap in a separate mini-transaction, ahead of writing the data to the change buffer: /* Set the bitmap bit denoting that the insert buffer contains buffered entries for this index page, if the bit is not set yet */ old_bit_value = ibuf_bitmap_page_get_bits(bitmap_page, page_no, IBUF_BITMAP_BUFFERED, &bitmap_mtr); if (!old_bit_value) { ibuf_bitmap_page_set_bits(bitmap_page, page_no, IBUF_BITMAP_BUFFERED, TRUE, &bitmap_mtr); } mtr_commit(&bitmap_mtr); The above was introduced with the initial commit of InnoDB into MySQL 3.23.34 . If the server is killed or a backup is finished between the logical time of the commit of bitmap_mtr and the subsequent mini-transaction commit that inserts the record into the change buffer, then we will have the bitmap page indicating that there exist unbuffered changes for a page, although none might actually exist. I do not think that this non-atomicity can be fixed, so the change buffer merge will have to deal with this situation.

Bernardo Perez added a comment - 2020-05-07 06:49

Thanks a lot for the update Marko.

Does that update also refer to https://jira.mariadb.org/browse/MDEV-22340 ?

Thanks

Bernardo Perez added a comment - 2020-05-07 06:49 Thanks a lot for the update Marko. Does that update also refer to https://jira.mariadb.org/browse/MDEV-22340 ? Thanks

Marko Mäkelä added a comment - 2021-03-05 16:49

I think that ~~MDEV-24449~~ is a rather likely cause of corrupting not only the change buffer, but also the system tablespace and any secondary index leaf page in user tables.

Marko Mäkelä added a comment - 2021-03-05 16:49 I think that MDEV-24449 is a rather likely cause of corrupting not only the change buffer, but also the system tablespace and any secondary index leaf page in user tables.

Marko Mäkelä added a comment - 2022-11-10 14:08

Unfortunately, an attempt to fix this corruption caused further corruption related to the change buffer in MariaDB Server 10.5 or later; see ~~MDEV-25783~~.

We recently reproduced this type of scenario in house, and we are working on a better fix.

Marko Mäkelä added a comment - 2022-11-10 14:08 Unfortunately, an attempt to fix this corruption caused further corruption related to the change buffer in MariaDB Server 10.5 or later; see MDEV-25783 . We recently reproduced this type of scenario in house, and we are working on a better fix.

Marko Mäkelä added a comment - 2022-11-14 14:52

I filed ~~MDEV-30009~~ for a 10.5 regression where the slow shutdown hangs when this type of corruption is present.

Marko Mäkelä added a comment - 2022-11-14 14:52 I filed MDEV-30009 for a 10.5 regression where the slow shutdown hangs when this type of corruption is present.

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2019-10-31 12:34

Updated:: 2024-07-07 22:33

Resolved:: 2019-11-06 13:23

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.