[MDEV-22889] InnoDB occasionally breaks the isolation of a recovered transaction - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5
Fix Version/s: 10.5
Component/s: Storage Engine - InnoDB
Labels:

Description

The test case innodb.innodb_force_recovery_rollback that was added for ~~MDEV-21217~~ should have only two possible outcomes for the locking SELECT statement:

The statement is blocked if the fix of ~~MDEV-21217~~ is missing, and the test will eventually fail with a lock wait timeout
The lock conflict will ensure that the statement will execute after the rollback has completed, and an empty table will be observed.

What occasionally happens is that the locking SELECT will return nonempty contents for the table. I was not able to repeat it in my local environment, but it did occur on buildbot for several branches and builders.

Such breakage of ACID is serious, and we must find and fix the reason.

I believe that we must execute something like

./mtr innodb.innodb_force_recovery_rollback

so that the server after the restart runs under rr record, so that we will get an execution trace of the failure for debugging.

Here is an example of a failure: http://buildbot.askmonty.org/buildbot/builders/kvm-zyp-opensuse150-amd64/builds/3589/steps/mtr/logs/stdio

10.5-ish 70a3e3ef552c0c65248151daaa45a2e978cfe86c
CURRENT_TEST: innodb.innodb_force_recovery_rollback
--- /usr/share/mysql-test/suite/innodb/r/innodb_force_recovery_rollback.result 2020-06-13 16:37:19.000000000 +0000
+++ /dev/shm/var/4/log/innodb_force_recovery_rollback.reject 2020-06-13 20:35:22.444004488 +0000
@@ -15,4 +15,1004 @@
connection default;
SELECT * FROM t0 LOCK IN SHARE MODE;
a
+1
+2
+3
+4
…
+997
+998
+999
+1000
DROP TABLE t0,t1;

mysqltest: Result length mismatch

Apparently, only the Isolation is being being violated: the locking read will return all 1000 records that the incomplete transaction had inserted.

It is also theoretically possible that the incomplete transaction that inserted those 1,000 records was unexpectedly committed when the server was killed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

MDEV-22889.test
2020-07-28 16:17
1 kB
Matthias Leich

Issue Links

relates to

MDEV-21217 innodb_force_recovery=2 may wrongly abort the rollback of recovered transactions

Closed

MDEV-23339 innodb_force_recovery=2 may still abort the rollback of recovered transactions

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2020-07-30 04:44

mleich, sorry, but it seems to me that MDEV-22889.test is not testing the scenario that I reported (the query would return nonempty table contents). It is only reporting a lock wait timeout.

10.2 fe39d02f51b96536dccca7ff89faf05e13548877ish
Thread 2 hit Breakpoint 1, lock_wait_suspend_thread (
thr=thr@entry=0x4d9f68132c18)
at /home/mleich/bb-10.2-MDEV-14711/storage/innobase/lock/lock0wait.cc:231
231 {
(rr) p thr.graph.trx.mysql_thd.query_string
$1 = {string = {str = 0x4d9f68013950 "SELECT * FROM t0 LOCK IN SHARE MODE",
length = 35}, cs = 0x559b2746d840 <my_charset_latin1>}
(rr) when
Current event: 98001

The reason for this conflict is that in trx_sys->rw_trx_list there is exactly one transaction, and trx_sys->rw_trx_list->start.recovered holds. Notably, trx_roll_crash_recv_trx=NULL, that is, the transaction is not being rolled back. An attempt to roll back all transactions was made earlier:

Thread 10 hit Breakpoint 2, trx_rollback_or_clean_recovered (all=all@entry=1) at /home/mleich/bb-10.2-MDEV-14711/storage/innobase/trx/trx0roll.cc:799

799	{

(rr) when

Current event: 81974

The recovery reached the following code, whose intention is to abort the rollback of recovered transactions if shutdown has been initiated:

	case TRX_STATE_ACTIVE:

		if (!srv_is_being_started

		    && !srv_undo_sources && srv_fast_shutdown) {

fake_prepared:

			trx->state = TRX_STATE_PREPARED;

			*all = FALSE;

			goto func_exit;

But, at this point we are still starting up, not shutting down! This means that the fix of ~~MDEV-21217~~ is incomplete and we should have replaced the !srv_is_being_started with srv_shutdown_state != SRV_SHUTDOWN_NONE also in this function.

I filed this separate bug as ~~MDEV-23339~~. This reported bug is still open and an rr replay trace will be needed for the case where the SELECT is returning nonempty table contents.

Marko Mäkelä added a comment - 2020-07-30 04:44 mleich , sorry, but it seems to me that MDEV-22889.test is not testing the scenario that I reported (the query would return nonempty table contents). It is only reporting a lock wait timeout. 10.2 fe39d02f51b96536dccca7ff89faf05e13548877ish Thread 2 hit Breakpoint 1, lock_wait_suspend_thread ( thr=thr@entry=0x4d9f68132c18) at /home/mleich/bb-10.2-MDEV-14711/storage/innobase/lock/lock0wait.cc:231 231 { (rr) p thr.graph.trx.mysql_thd.query_string $1 = {string = {str = 0x4d9f68013950 "SELECT * FROM t0 LOCK IN SHARE MODE", length = 35}, cs = 0x559b2746d840 <my_charset_latin1>} (rr) when Current event: 98001 The reason for this conflict is that in trx_sys->rw_trx_list there is exactly one transaction, and trx_sys->rw_trx_list->start.recovered holds. Notably, trx_roll_crash_recv_trx=NULL , that is, the transaction is not being rolled back. An attempt to roll back all transactions was made earlier: Thread 10 hit Breakpoint 2, trx_rollback_or_clean_recovered (all=all@entry=1) at /home/mleich/bb-10.2-MDEV-14711/storage/innobase/trx/trx0roll.cc:799 799 { (rr) when Current event: 81974 The recovery reached the following code, whose intention is to abort the rollback of recovered transactions if shutdown has been initiated: case TRX_STATE_ACTIVE: if (!srv_is_being_started && !srv_undo_sources && srv_fast_shutdown) { fake_prepared: trx->state = TRX_STATE_PREPARED; *all = FALSE; goto func_exit; } But, at this point we are still starting up, not shutting down! This means that the fix of MDEV-21217 is incomplete and we should have replaced the !srv_is_being_started with srv_shutdown_state != SRV_SHUTDOWN_NONE also in this function. I filed this separate bug as MDEV-23339 . This reported bug is still open and an rr replay trace will be needed for the case where the SELECT is returning nonempty table contents.

Marko Mäkelä added a comment - 2020-07-30 06:47

mleich, in ~~MDEV-23339~~ I enabled the test innodb.innodb_force_recovery_rollback, adding a LIMIT 0 clause to work around this bug. I think that you should revise MDEV-22889.test to do something like the following:

--disable_abort_on_error

SELECT COUNT(*) INTO @count FROM t0 LOCK IN SHARE MODE;

let $ml_errno= $mysql_errno;

let $count=`SELECT @count`;

If either let assigns a nonzero value, that is an error. A nonzero $count is what we are looking for in this ticket.

Marko Mäkelä added a comment - 2020-07-30 06:47 mleich , in MDEV-23339 I enabled the test innodb.innodb_force_recovery_rollback , adding a LIMIT 0 clause to work around this bug. I think that you should revise MDEV-22889.test to do something like the following: --disable_abort_on_error SELECT COUNT (*) INTO @ count FROM t0 LOCK IN SHARE MODE; let $ml_errno= $mysql_errno; let $ count =` SELECT @ count `; If either let assigns a nonzero value, that is an error. A nonzero $count is what we are looking for in this ticket.

Matthias Leich added a comment - 2020-07-30 07:55

Ok, I was so overwhelmed by the fact that the first SQL statement after a DB server restart could fail because of not getting a lock and
assumed that the base reason might be the same.

Matthias Leich added a comment - 2020-07-30 07:55 Ok, I was so overwhelmed by the fact that the first SQL statement after a DB server restart could fail because of not getting a lock and assumed that the base reason might be the same.

MariaDB Server

InnoDB occasionally breaks the isolation of a recovered transaction

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration