[MDEV-26631] Race condition in the test innodb.doublewrite - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.6
Fix Version/s: 10.6.5
Component/s: Storage Engine - InnoDB, Tests
Labels:
- debug
- race
- recovery
- rr-profile

Description

~~MDEV-26626~~ exposed the following failure on my system, when running a 32-bit build:

./mtr --rr innodb.doublewrite,4k,strict_full_crc32

10.6 106b16a5af7a8e29889e0ef8930bd2901a9ad00e
innodb.doublewrite '4k,innodb,strict_full_crc32' [ fail ]
Test ended at 2021-09-17 08:28:10

CURRENT_TEST: innodb.doublewrite
mysqltest: At line 192: query 'select f1, f2 from t1' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'

If I revert the ~~MDEV-26626~~ fix, the failure goes away. The test does not fail in the 10.5 branch for me. The cause of the failure appears to be that a write of a page that the test is intentionally corrupting was missed before recovery:

10.6 106b16a5af7a8e29889e0ef8930bd2901a9ad00e

2021-09-17  8:28:09 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=58220,58446

2021-09-17  8:28:09 0 [ERROR] InnoDB: Cannot apply log to [page id: space=5, page number=0] of corrupted file './test/t1.ibd'

At first, I was suspecting that the debug variable innodb_buf_flush_list_now is not working reliably, but the function buf_flush_sync() does appear to write out all pages from the buffer pool before the server was killed and restarted.

The failure has not occured on any CI system yet, and it does not fail reliably for me when not using rr. Initially, it did fail without rr for the strict_full_crc32 combinations, for the 32-bit build.

Attachments

Issue Links

relates to

MDEV-27014 InnoDB fails to restore page 0 from the doublewrite buffer

Closed

MDEV-24626 Remove synchronous write of page0 and flushing file during file creation

Closed

MDEV-26626 InnoDB fails to advance the log checkpoint

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2021-09-17 06:13

During a run of

./mtr --big-test --parallel=60 --force

I reliably see all strict_full_crc32 combinations failing on my system (with nproc 40), using a 32-bit build, all in that same test line 192:

innodb.doublewrite '32k,innodb,strict_full_crc32' w18 [ fail ]

…

innodb.doublewrite '64k,innodb,strict_full_crc32' w51 [ fail ]

…

innodb.doublewrite '16k,innodb,strict_full_crc32' w21 [ fail ]

…

innodb.doublewrite '4k,innodb,strict_full_crc32' w3 [ fail ]

…

innodb.doublewrite '8k,innodb,strict_full_crc32' w27 [ fail ]

If I only run ./mtr --parallel=auto innodb.doublewrite (all 2×5 combinations in parallel), they will pass. It definitely is some kind of a race condition.

Marko Mäkelä added a comment - 2021-09-17 06:13 During a run of ./mtr --big-test --parallel=60 --force I reliably see all strict_full_crc32 combinations failing on my system (with nproc 40), using a 32-bit build, all in that same test line 192: innodb.doublewrite '32k,innodb,strict_full_crc32' w18 [ fail ] … innodb.doublewrite '64k,innodb,strict_full_crc32' w51 [ fail ] … innodb.doublewrite '16k,innodb,strict_full_crc32' w21 [ fail ] … innodb.doublewrite '4k,innodb,strict_full_crc32' w3 [ fail ] … innodb.doublewrite '8k,innodb,strict_full_crc32' w27 [ fail ] If I only run ./mtr --parallel=auto innodb.doublewrite (all 2×5 combinations in parallel), they will pass. It definitely is some kind of a race condition.

Marko Mäkelä added a comment - 2021-09-17 07:47

The recovery fails here:

868	  ib::error() << "Cannot apply log to " << first

(rr) p deferred_spaces

$1 = {defers = std::map with 1 element = {[5] = {lsn = 58446, file_name = "./test/t1.ibd", deleted = false}}}

In my initial debugging, I failed to notice that the recovered log is not for page number 0 (the page that is being corrupted by the test) but for page number 5. This code and data structure is part of the ~~MDEV-24626~~ recovery. If we notice a corrupted first page during recovery and the recovered log records will not initialize page 0 of the tablespace, we will fail.

That log record was written by the following before the server was killed:

Thread 7 hit Hardware watchpoint 1: -location log_sys.lsn._M_i

Old value = 58431

New value = 58446

mtr_t::finish_write (this=<optimized out>, this@entry=0xeebf6744, len=<optimized out>, len@entry=15) at /mariadb/10.6/storage/innobase/include/log0log.ic:290

290		return lsn;

(rr) bt

#0  mtr_t::finish_write (this=<optimized out>, this@entry=0xeebf6744, len=<optimized out>, len@entry=15) at /mariadb/10.6/storage/innobase/include/log0log.ic:290

#1  0x57425c5e in mtr_t::commit (this=0xeebf6744) at /mariadb/10.6/storage/innobase/mtr/mtr0mtr.cc:410

#2  0x574a5202 in row_purge_reset_trx_id (node=<optimized out>, mtr=0xeebf6744) at /mariadb/10.6/storage/innobase/row/row0purge.cc:816

Before the first kill, the test did try to ensure that purge will not run until that point:

connect (stop_purge,localhost,root,,);

START TRANSACTION WITH CONSISTENT SNAPSHOT;

connection default;

Adding a wait for purge before the second kill did not fix this failure after another kill:

diff --git a/mysql-test/suite/innodb/t/doublewrite.test b/mysql-test/suite/innodb/t/doublewrite.test

index cedd2c9942b..f03d16bd680 100644

--- a/mysql-test/suite/innodb/t/doublewrite.test

+++ b/mysql-test/suite/innodb/t/doublewrite.test

@@ -144,6 +144,9 @@ EOF

 --source include/start_mysqld.inc

+SET GLOBAL innodb_purge_rseg_truncate_frequency=1;

+--source include/wait_all_purged.inc

 check table t1;

 select f1, f2 from t1;

If I omit that change and run the test with -mysqld=-innodb-force-recovery=2 to disable purge, it will get past that restart and fail in a different way later:

2021-09-17  9:30:20 0 [Note] InnoDB: Starting final batch to recover 1 pages from redo log.

mariadbd: /mariadb/10.6/storage/innobase/log/log0recv.cc:274: log_phys_t::apply_status log_phys_t::apply(const buf_block_t&, uint16_t&) const: Assertion `mach_read_from_4(frame + 4U) == block.page.id().page_no()' failed.

That would be an all-zero frame of the page 1 of table t1 that was corrupted in a subsequent test step. The write of that log record originated from debug instrumentation, as expected:

Thread 9 hit Hardware watchpoint 1: -location log_sys.lsn

Old value = {<std::__atomic_base<unsigned long long>> = {_M_i = 57715}, <No data fields>}

New value = {<std::__atomic_base<unsigned long long>> = {_M_i = 57721}, <No data fields>}

mtr_t::finish_write (this=<optimized out>, this@entry=0xecfa90e0, len=<optimized out>, len@entry=6) at /mariadb/10.6/storage/innobase/include/log0log.ic:290

290		return lsn;

(rr) bt

#0  mtr_t::finish_write (this=<optimized out>, this@entry=0xecfa90e0, len=<optimized out>, len@entry=6) at /mariadb/10.6/storage/innobase/include/log0log.ic:290

#1  0x573e5c5e in mtr_t::commit (this=0xecfa90e0) at /mariadb/10.6/storage/innobase/mtr/mtr0mtr.cc:410

#2  0x57359d73 in innodb_make_page_dirty (save=0xf2116fd8) at /mariadb/10.6/storage/innobase/handler/ha_innodb.cc:17447

Before that page was written, there was a doublewrite batch that covered this page. A copy of the page was written to offset 0x104000 in the system tablespace as well as the final location. There were no other data file writes before the server was killed. We did recover a copy of the page in buf_dblwr_t::init_or_load_pages(), but recovery will not find the tablespace for t1 (tablespace ID 5) at this point:

10.6 a8b16e921e6bb73b1fff5486314b2d89b9d286aa
0x575459d4 in buf_dblwr_t::recover (this=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0dblwr.cc:364
364 fil_space_t *space= fil_space_t::get(space_id);
Value returned is $11 = (fil_space_t *) 0x0
(rr) bt
#0 0x575459d4 in buf_dblwr_t::recover (this=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0dblwr.cc:364
#1 0x57413ff2 in recv_recovery_from_checkpoint_start (flush_lsn=<optimized out>) at /mariadb/10.6/storage/innobase/log/log0recv.cc:4061
#2 0x56b5df4a in srv_start (create_new_db=<optimized out>) at /mariadb/10.6/storage/innobase/srv/srv0start.cc:1441
#3 0x5739bfad in innodb_init (p=0x5a8fdb88) at /mariadb/10.6/storage/innobase/handler/ha_innodb.cc:4196

There was an earlier call to fil_ibd_load() that returned FIL_LOAD_DEFER, but that could be due to an ‘old sin’ that caused us to fail at line 192 already (if I didn’t disable purge): we would apparently fail to restore a copy of page 0 from the doublewrite buffer.

My following work-around attempt did not avoid an assertion failure for applying log to page 1 (with page number mismatch on the page):

diff --git a/mysql-test/suite/innodb/t/doublewrite.test b/mysql-test/suite/innodb/t/doublewrite.test

index cedd2c9942b..fa08d0e09fc 100644

--- a/mysql-test/suite/innodb/t/doublewrite.test

+++ b/mysql-test/suite/innodb/t/doublewrite.test

@@ -209,6 +209,8 @@ insert into t1 values (6, repeat('%', 400));

 --source ../include/no_checkpoint_start.inc

 --echo # Make the 2nd page dirty for table t1

+set global innodb_saved_page_number_debug = 0;

+set global innodb_fil_make_page_dirty_debug = @space_id;

 set global innodb_saved_page_number_debug = 1;

 set global innodb_fil_make_page_dirty_debug = @space_id;

These failures are possibly related to ~~MDEV-24626~~, because I could not get this test to fail on 10.5. It looks like we are not restoring the doublewrite buffer correctly in all cases.

Marko Mäkelä added a comment - 2021-09-17 07:47 The recovery fails here: 868 ib::error() << "Cannot apply log to " << first (rr) p deferred_spaces $1 = {defers = std::map with 1 element = {[5] = {lsn = 58446, file_name = "./test/t1.ibd", deleted = false}}} In my initial debugging, I failed to notice that the recovered log is not for page number 0 (the page that is being corrupted by the test) but for page number 5. This code and data structure is part of the MDEV-24626 recovery. If we notice a corrupted first page during recovery and the recovered log records will not initialize page 0 of the tablespace, we will fail. That log record was written by the following before the server was killed: Thread 7 hit Hardware watchpoint 1: -location log_sys.lsn._M_i Old value = 58431 New value = 58446 mtr_t::finish_write (this=<optimized out>, this@entry=0xeebf6744, len=<optimized out>, len@entry=15) at /mariadb/10.6/storage/innobase/include/log0log.ic:290 290 return lsn; (rr) bt #0 mtr_t::finish_write (this=<optimized out>, this@entry=0xeebf6744, len=<optimized out>, len@entry=15) at /mariadb/10.6/storage/innobase/include/log0log.ic:290 #1 0x57425c5e in mtr_t::commit (this=0xeebf6744) at /mariadb/10.6/storage/innobase/mtr/mtr0mtr.cc:410 #2 0x574a5202 in row_purge_reset_trx_id (node=<optimized out>, mtr=0xeebf6744) at /mariadb/10.6/storage/innobase/row/row0purge.cc:816 Before the first kill, the test did try to ensure that purge will not run until that point: connect (stop_purge,localhost,root,,); START TRANSACTION WITH CONSISTENT SNAPSHOT; connection default ; Adding a wait for purge before the second kill did not fix this failure after another kill: diff --git a/mysql-test/suite/innodb/t/doublewrite.test b/mysql-test/suite/innodb/t/doublewrite.test index cedd2c9942b..f03d16bd680 100644 --- a/mysql-test/suite/innodb/t/doublewrite.test +++ b/mysql-test/suite/innodb/t/doublewrite.test @@ -144,6 +144,9 @@ EOF --source include/start_mysqld.inc +SET GLOBAL innodb_purge_rseg_truncate_frequency=1; +--source include/wait_all_purged.inc + check table t1; select f1, f2 from t1; If I omit that change and run the test with - mysqld= -innodb-force-recovery=2 to disable purge, it will get past that restart and fail in a different way later: 2021-09-17 9:30:20 0 [Note] InnoDB: Starting final batch to recover 1 pages from redo log. mariadbd: /mariadb/10.6/storage/innobase/log/log0recv.cc:274: log_phys_t::apply_status log_phys_t::apply(const buf_block_t&, uint16_t&) const: Assertion `mach_read_from_4(frame + 4U) == block.page.id().page_no()' failed. That would be an all-zero frame of the page 1 of table t1 that was corrupted in a subsequent test step. The write of that log record originated from debug instrumentation, as expected: Thread 9 hit Hardware watchpoint 1: -location log_sys.lsn Old value = {<std::__atomic_base<unsigned long long>> = {_M_i = 57715}, <No data fields>} New value = {<std::__atomic_base<unsigned long long>> = {_M_i = 57721}, <No data fields>} mtr_t::finish_write (this=<optimized out>, this@entry=0xecfa90e0, len=<optimized out>, len@entry=6) at /mariadb/10.6/storage/innobase/include/log0log.ic:290 290 return lsn; (rr) bt #0 mtr_t::finish_write (this=<optimized out>, this@entry=0xecfa90e0, len=<optimized out>, len@entry=6) at /mariadb/10.6/storage/innobase/include/log0log.ic:290 #1 0x573e5c5e in mtr_t::commit (this=0xecfa90e0) at /mariadb/10.6/storage/innobase/mtr/mtr0mtr.cc:410 #2 0x57359d73 in innodb_make_page_dirty (save=0xf2116fd8) at /mariadb/10.6/storage/innobase/handler/ha_innodb.cc:17447 Before that page was written, there was a doublewrite batch that covered this page. A copy of the page was written to offset 0x104000 in the system tablespace as well as the final location. There were no other data file writes before the server was killed. We did recover a copy of the page in buf_dblwr_t::init_or_load_pages() , but recovery will not find the tablespace for t1 (tablespace ID 5) at this point: 10.6 a8b16e921e6bb73b1fff5486314b2d89b9d286aa 0x575459d4 in buf_dblwr_t::recover (this=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0dblwr.cc:364 364 fil_space_t *space= fil_space_t::get(space_id); Value returned is $11 = (fil_space_t *) 0x0 (rr) bt #0 0x575459d4 in buf_dblwr_t::recover (this=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0dblwr.cc:364 #1 0x57413ff2 in recv_recovery_from_checkpoint_start (flush_lsn=<optimized out>) at /mariadb/10.6/storage/innobase/log/log0recv.cc:4061 #2 0x56b5df4a in srv_start (create_new_db=<optimized out>) at /mariadb/10.6/storage/innobase/srv/srv0start.cc:1441 #3 0x5739bfad in innodb_init (p=0x5a8fdb88) at /mariadb/10.6/storage/innobase/handler/ha_innodb.cc:4196 There was an earlier call to fil_ibd_load() that returned FIL_LOAD_DEFER , but that could be due to an ‘old sin’ that caused us to fail at line 192 already (if I didn’t disable purge): we would apparently fail to restore a copy of page 0 from the doublewrite buffer. My following work-around attempt did not avoid an assertion failure for applying log to page 1 (with page number mismatch on the page): diff --git a/mysql-test/suite/innodb/t/doublewrite.test b/mysql-test/suite/innodb/t/doublewrite.test index cedd2c9942b..fa08d0e09fc 100644 --- a/mysql-test/suite/innodb/t/doublewrite.test +++ b/mysql-test/suite/innodb/t/doublewrite.test @@ -209,6 +209,8 @@ insert into t1 values (6, repeat('%', 400)); --source ../include/no_checkpoint_start.inc --echo # Make the 2nd page dirty for table t1 +set global innodb_saved_page_number_debug = 0; +set global innodb_fil_make_page_dirty_debug = @space_id; set global innodb_saved_page_number_debug = 1; set global innodb_fil_make_page_dirty_debug = @space_id; These failures are possibly related to MDEV-24626 , because I could not get this test to fail on 10.5. It looks like we are not restoring the doublewrite buffer correctly in all cases.

Marko Mäkelä added a comment - 2021-09-17 08:10

Removing the following fault injection makes the test pass, both with ./mtr --rr and when running ./mtr --big-test --parallel=60 --force:

diff --git a/mysql-test/suite/innodb/t/doublewrite.test b/mysql-test/suite/innodb/t/doublewrite.test

index cedd2c9942b..65499f51c3c 100644

--- a/mysql-test/suite/innodb/t/doublewrite.test

+++ b/mysql-test/suite/innodb/t/doublewrite.test

@@ -182,7 +182,7 @@ open(FILE, "+<", $fname) or die;

 sysread(FILE, $page, $page_size)==$page_size||die "Unable to read $name\n";

 substr($page, 28, 4) = pack("N", 1000);

 sysseek(FILE, 0, 0)||die "Unable to seek $fname\n";

-die unless syswrite(FILE, $page, $page_size) == $page_size;

+#die unless syswrite(FILE, $page, $page_size) == $page_size;

 close FILE;

EOF

The root cause of all failures could be that ~~MDEV-24626~~ has broken the recovery of page 0 from the doublewrite buffer.

Marko Mäkelä added a comment - 2021-09-17 08:10 Removing the following fault injection makes the test pass, both with ./mtr --rr and when running ./mtr --big-test --parallel=60 --force : diff --git a/mysql-test/suite/innodb/t/doublewrite.test b/mysql-test/suite/innodb/t/doublewrite.test index cedd2c9942b..65499f51c3c 100644 --- a/mysql-test/suite/innodb/t/doublewrite.test +++ b/mysql-test/suite/innodb/t/doublewrite.test @@ -182,7 +182,7 @@ open(FILE, "+<", $fname) or die; sysread(FILE, $page, $page_size)==$page_size||die "Unable to read $name\n"; substr($page, 28, 4) = pack("N", 1000); sysseek(FILE, 0, 0)||die "Unable to seek $fname\n"; -die unless syswrite(FILE, $page, $page_size) == $page_size; +#die unless syswrite(FILE, $page, $page_size) == $page_size; close FILE; EOF The root cause of all failures could be that MDEV-24626 has broken the recovery of page 0 from the doublewrite buffer.

Thirunarayanan Balathandayuthapani added a comment - 2021-09-21 08:02

Patch is in bb-10.6-~~MDEV-26631~~

Thirunarayanan Balathandayuthapani added a comment - 2021-09-21 08:02 Patch is in bb-10.6- MDEV-26631

Marko Mäkelä added a comment - 2021-09-24 06:02

I think that this is a step to the right direction. But, a bit more work is needed in order to ensure that we are not giving up recovery in the case that the first page had been partially written without using the doublewrite buffer. (Doublewrite buffer would be skipped when the page was initialized by redo log records.) I suspect that the proposed fix could break the deferred recovery in this case.

Marko Mäkelä added a comment - 2021-09-24 06:02 I think that this is a step to the right direction. But, a bit more work is needed in order to ensure that we are not giving up recovery in the case that the first page had been partially written without using the doublewrite buffer. (Doublewrite buffer would be skipped when the page was initialized by redo log records.) I suspect that the proposed fix could break the deferred recovery in this case.

People

Assignee:: Thirunarayanan Balathandayuthapani

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2021-09-17 06:00

Updated:: 2021-11-10 11:36

Resolved:: 2021-09-27 05:40

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server