[MDEV-26631] Race condition in the test innodb.doublewrite Created: 2021-09-17  Updated: 2021-11-10  Resolved: 2021-09-27

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB, Tests
Affects Version/s: 10.6
Fix Version/s: 10.6.5

Type: Bug Priority: Critical
Reporter: Marko Mäkelä Assignee: Thirunarayanan Balathandayuthapani
Resolution: Fixed Votes: 0
Labels: debug, race, recovery, rr-profile

Issue Links:
Relates
relates to MDEV-27014 InnoDB fails to restore page 0 from t... Closed
relates to MDEV-24626 Remove synchronous write of page0 and... Closed
relates to MDEV-26626 InnoDB fails to advance the log check... Closed

 Description   

MDEV-26626 exposed the following failure on my system, when running a 32-bit build:

./mtr --rr innodb.doublewrite,4k,strict_full_crc32

10.6 106b16a5af7a8e29889e0ef8930bd2901a9ad00e

innodb.doublewrite '4k,innodb,strict_full_crc32' [ fail ]
        Test ended at 2021-09-17 08:28:10
 
CURRENT_TEST: innodb.doublewrite
mysqltest: At line 192: query 'select f1, f2 from t1' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'

If I revert the MDEV-26626 fix, the failure goes away. The test does not fail in the 10.5 branch for me. The cause of the failure appears to be that a write of a page that the test is intentionally corrupting was missed before recovery:

10.6 106b16a5af7a8e29889e0ef8930bd2901a9ad00e

2021-09-17  8:28:09 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=58220,58446
2021-09-17  8:28:09 0 [ERROR] InnoDB: Cannot apply log to [page id: space=5, page number=0] of corrupted file './test/t1.ibd'

At first, I was suspecting that the debug variable innodb_buf_flush_list_now is not working reliably, but the function buf_flush_sync() does appear to write out all pages from the buffer pool before the server was killed and restarted.

The failure has not occured on any CI system yet, and it does not fail reliably for me when not using rr. Initially, it did fail without rr for the strict_full_crc32 combinations, for the 32-bit build.



 Comments   
Comment by Marko Mäkelä [ 2021-09-17 ]

During a run of

./mtr --big-test --parallel=60 --force

I reliably see all strict_full_crc32 combinations failing on my system (with nproc 40), using a 32-bit build, all in that same test line 192:

innodb.doublewrite '32k,innodb,strict_full_crc32' w18 [ fail ]
innodb.doublewrite '64k,innodb,strict_full_crc32' w51 [ fail ]
innodb.doublewrite '16k,innodb,strict_full_crc32' w21 [ fail ]
innodb.doublewrite '4k,innodb,strict_full_crc32' w3 [ fail ]
innodb.doublewrite '8k,innodb,strict_full_crc32' w27 [ fail ]

If I only run ./mtr --parallel=auto innodb.doublewrite (all 2×5 combinations in parallel), they will pass. It definitely is some kind of a race condition.

Comment by Marko Mäkelä [ 2021-09-17 ]

The recovery fails here:

868	  ib::error() << "Cannot apply log to " << first
(rr) p deferred_spaces
$1 = {defers = std::map with 1 element = {[5] = {lsn = 58446, file_name = "./test/t1.ibd", deleted = false}}}

In my initial debugging, I failed to notice that the recovered log is not for page number 0 (the page that is being corrupted by the test) but for page number 5. This code and data structure is part of the MDEV-24626 recovery. If we notice a corrupted first page during recovery and the recovered log records will not initialize page 0 of the tablespace, we will fail.

That log record was written by the following before the server was killed:

Thread 7 hit Hardware watchpoint 1: -location log_sys.lsn._M_i
 
Old value = 58431
New value = 58446
mtr_t::finish_write (this=<optimized out>, this@entry=0xeebf6744, len=<optimized out>, len@entry=15) at /mariadb/10.6/storage/innobase/include/log0log.ic:290
290		return lsn;
(rr) bt
#0  mtr_t::finish_write (this=<optimized out>, this@entry=0xeebf6744, len=<optimized out>, len@entry=15) at /mariadb/10.6/storage/innobase/include/log0log.ic:290
#1  0x57425c5e in mtr_t::commit (this=0xeebf6744) at /mariadb/10.6/storage/innobase/mtr/mtr0mtr.cc:410
#2  0x574a5202 in row_purge_reset_trx_id (node=<optimized out>, mtr=0xeebf6744) at /mariadb/10.6/storage/innobase/row/row0purge.cc:816

Before the first kill, the test did try to ensure that purge will not run until that point:

connect (stop_purge,localhost,root,,);
START TRANSACTION WITH CONSISTENT SNAPSHOT;
connection default;

Adding a wait for purge before the second kill did not fix this failure after another kill:

diff --git a/mysql-test/suite/innodb/t/doublewrite.test b/mysql-test/suite/innodb/t/doublewrite.test
index cedd2c9942b..f03d16bd680 100644
--- a/mysql-test/suite/innodb/t/doublewrite.test
+++ b/mysql-test/suite/innodb/t/doublewrite.test
@@ -144,6 +144,9 @@ EOF
 
 --source include/start_mysqld.inc
 
+SET GLOBAL innodb_purge_rseg_truncate_frequency=1;
+--source include/wait_all_purged.inc
+
 check table t1;
 select f1, f2 from t1;
 

If I omit that change and run the test with -mysqld=-innodb-force-recovery=2 to disable purge, it will get past that restart and fail in a different way later:

2021-09-17  9:30:20 0 [Note] InnoDB: Starting final batch to recover 1 pages from redo log.
mariadbd: /mariadb/10.6/storage/innobase/log/log0recv.cc:274: log_phys_t::apply_status log_phys_t::apply(const buf_block_t&, uint16_t&) const: Assertion `mach_read_from_4(frame + 4U) == block.page.id().page_no()' failed.

That would be an all-zero frame of the page 1 of table t1 that was corrupted in a subsequent test step. The write of that log record originated from debug instrumentation, as expected:

Thread 9 hit Hardware watchpoint 1: -location log_sys.lsn
 
Old value = {<std::__atomic_base<unsigned long long>> = {_M_i = 57715}, <No data fields>}
New value = {<std::__atomic_base<unsigned long long>> = {_M_i = 57721}, <No data fields>}
mtr_t::finish_write (this=<optimized out>, this@entry=0xecfa90e0, len=<optimized out>, len@entry=6) at /mariadb/10.6/storage/innobase/include/log0log.ic:290
290		return lsn;
(rr) bt
#0  mtr_t::finish_write (this=<optimized out>, this@entry=0xecfa90e0, len=<optimized out>, len@entry=6) at /mariadb/10.6/storage/innobase/include/log0log.ic:290
#1  0x573e5c5e in mtr_t::commit (this=0xecfa90e0) at /mariadb/10.6/storage/innobase/mtr/mtr0mtr.cc:410
#2  0x57359d73 in innodb_make_page_dirty (save=0xf2116fd8) at /mariadb/10.6/storage/innobase/handler/ha_innodb.cc:17447

Before that page was written, there was a doublewrite batch that covered this page. A copy of the page was written to offset 0x104000 in the system tablespace as well as the final location. There were no other data file writes before the server was killed. We did recover a copy of the page in buf_dblwr_t::init_or_load_pages(), but recovery will not find the tablespace for t1 (tablespace ID 5) at this point:

10.6 a8b16e921e6bb73b1fff5486314b2d89b9d286aa

0x575459d4 in buf_dblwr_t::recover (this=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0dblwr.cc:364
364	    fil_space_t *space= fil_space_t::get(space_id);
Value returned is $11 = (fil_space_t *) 0x0
(rr) bt
#0  0x575459d4 in buf_dblwr_t::recover (this=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0dblwr.cc:364
#1  0x57413ff2 in recv_recovery_from_checkpoint_start (flush_lsn=<optimized out>) at /mariadb/10.6/storage/innobase/log/log0recv.cc:4061
#2  0x56b5df4a in srv_start (create_new_db=<optimized out>) at /mariadb/10.6/storage/innobase/srv/srv0start.cc:1441
#3  0x5739bfad in innodb_init (p=0x5a8fdb88) at /mariadb/10.6/storage/innobase/handler/ha_innodb.cc:4196

There was an earlier call to fil_ibd_load() that returned FIL_LOAD_DEFER, but that could be due to an ‘old sin’ that caused us to fail at line 192 already (if I didn’t disable purge): we would apparently fail to restore a copy of page 0 from the doublewrite buffer.

My following work-around attempt did not avoid an assertion failure for applying log to page 1 (with page number mismatch on the page):

diff --git a/mysql-test/suite/innodb/t/doublewrite.test b/mysql-test/suite/innodb/t/doublewrite.test
index cedd2c9942b..fa08d0e09fc 100644
--- a/mysql-test/suite/innodb/t/doublewrite.test
+++ b/mysql-test/suite/innodb/t/doublewrite.test
@@ -209,6 +209,8 @@ insert into t1 values (6, repeat('%', 400));
 --source ../include/no_checkpoint_start.inc
 
 --echo # Make the 2nd page dirty for table t1
+set global innodb_saved_page_number_debug = 0;
+set global innodb_fil_make_page_dirty_debug = @space_id;
 set global innodb_saved_page_number_debug = 1;
 set global innodb_fil_make_page_dirty_debug = @space_id;
 

These failures are possibly related to MDEV-24626, because I could not get this test to fail on 10.5. It looks like we are not restoring the doublewrite buffer correctly in all cases.

Comment by Marko Mäkelä [ 2021-09-17 ]

Removing the following fault injection makes the test pass, both with ./mtr --rr and when running ./mtr --big-test --parallel=60 --force:

diff --git a/mysql-test/suite/innodb/t/doublewrite.test b/mysql-test/suite/innodb/t/doublewrite.test
index cedd2c9942b..65499f51c3c 100644
--- a/mysql-test/suite/innodb/t/doublewrite.test
+++ b/mysql-test/suite/innodb/t/doublewrite.test
@@ -182,7 +182,7 @@ open(FILE, "+<", $fname) or die;
 sysread(FILE, $page, $page_size)==$page_size||die "Unable to read $name\n";
 substr($page, 28, 4) = pack("N", 1000);
 sysseek(FILE, 0, 0)||die "Unable to seek $fname\n";
-die unless syswrite(FILE, $page, $page_size) == $page_size;
+#die unless syswrite(FILE, $page, $page_size) == $page_size;
 close FILE;
 EOF
 

The root cause of all failures could be that MDEV-24626 has broken the recovery of page 0 from the doublewrite buffer.

Comment by Thirunarayanan Balathandayuthapani [ 2021-09-21 ]

Patch is in bb-10.6-MDEV-26631

Comment by Marko Mäkelä [ 2021-09-24 ]

I think that this is a step to the right direction. But, a bit more work is needed in order to ensure that we are not giving up recovery in the case that the first page had been partially written without using the doublewrite buffer. (Doublewrite buffer would be skipped when the page was initialized by redo log records.) I suspect that the proposed fix could break the deferred recovery in this case.

Generated at Thu Feb 08 09:46:46 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.