[MDEV-14589] InnoDB should not lock a delete-marked record - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 10.2.12, 10.3.3, 10.1.34
Component/s: Storage Engine - InnoDB
Labels:

Description

When a transaction is able to delete-mark a record and commit the change, no active transaction may possibly have seen (and locked) the record. The record can be removed by the purge thread at any time, and it is logically equivalent to a gap (non-existing record).

However, currently InnoDB transactions will lock committed delete-marked records.

How to repeat:
Run at least 3 identical DELETE statements in parallel, and possibly some INSERT with a key value slightly smaller than that of the deleted record. InnoDB should report a bogus deadlock between 2 DELETE statements:

trx (1) has obtained an X-lock on the delete-marked committed record
trx (2) is waiting for an X-lock on the record and the preceding gap
trx (1) is waiting for an X-lock on the record and the preceding gap

Suggested fix:
Transactions should ignore committed delete-marked records when they come across them.

The solution we will try to implement is that when using isolation level is READ COMMITTED then DELETE will ignore any delete marked rows.

This is same as https://bugs.mysql.com/bug.php?id=19762

Attachments

Issue Links

relates to

MDEV-26475 InnoDB must not lock delete-marked records

Open

MDEV-26642 Weird SELECT view when a record is modified to the same value by two transactions

Closed

MDEV-32898 Phantom rows caused by UPDATE of PRIMARY KEY

Closed

MDEV-10962 Deadlock with 3 concurrent DELETEs by unique key

Closed

MDEV-16675 Unnecessary explicit lock acquisition during UPDATE or DELETE

Closed

MDEV-20605 Awaken transaction can miss inserted by other transaction records due to wrong persistent cursor restoration

Closed

(1 relates to)

Activity

Ascending order - Click to sort in descending order

View 5 older comments

Jan Lindström (Inactive) added a comment - 2017-12-11 11:20

It would be nice if test would include a case where another transaction tries to insert to that gap using > READ_COMMITTED isolation level. But actual fix looks correct to me and ok to push.

Jan Lindström (Inactive) added a comment - 2017-12-11 11:20 It would be nice if test would include a case where another transaction tries to insert to that gap using > READ_COMMITTED isolation level. But actual fix looks correct to me and ok to push.

Marko Mäkelä added a comment - 2017-12-11 12:48

The added test case innodb.lock_deleted now covers the READ UNCOMMITTED, READ COMMITTED, and REPEATABLE READ isolation levels.

At the REPEATABLE READ isolation level, the committed delete-marked secondary index record will be locked and will prevent an INSERT of the same record. The gap preceding the delete-marked record will not be locked. This seems wrong at the SERIALIZABLE isolation level (we should lock all the visited records and prevent updates or inserts).

But we already know that the SERIALIZABLE isolation level is broken by design in other ways: if records were committed by a later-started transaction, the InnoDB SERIALIZABLE level will happily lock those records and treat them as READ COMMITTED.

Marko Mäkelä added a comment - 2017-12-11 12:48 The added test case innodb.lock_deleted now covers the READ UNCOMMITTED, READ COMMITTED, and REPEATABLE READ isolation levels. At the REPEATABLE READ isolation level, the committed delete-marked secondary index record will be locked and will prevent an INSERT of the same record. The gap preceding the delete-marked record will not be locked. This seems wrong at the SERIALIZABLE isolation level (we should lock all the visited records and prevent updates or inserts). But we already know that the SERIALIZABLE isolation level is broken by design in other ways: if records were committed by a later-started transaction, the InnoDB SERIALIZABLE level will happily lock those records and treat them as READ COMMITTED.

Marko Mäkelä added a comment - 2018-04-30 15:54

My claim on SERIALIZABLE not being quite serializable is my opinion. Microsoft SQL Server documentation seems to define SERIALIZABLE in the same way as InnoDB:

The highest isolation level, serializable, guarantees that a transaction will retrieve exactly the same data every time it repeats a read operation, but it does this by performing a level of locking that is likely to impact other users in multi-user systems.

Below is a copy of my comment to a blog post from February 2018:

I discussed the SERIALIZABLE behaviour with my colleagues at MariaDB today.
I came up with a scenario like this:

connection rr; SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

START TRANSACTION WITH CONSISTENT SNAPSHOT; # assigns a read view immediately

connection ser; SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

BEGIN;

connection rr;

send set @a=(SELECT a FROM t1 LIMIT 1);

connection ser;

UPDATE t1 SET a=a+1;

connection rr;

reap; INSERT INTO t2 SET a=@a; COMMIT;

disconnect rr;

connection ser;

–error ER_SNAPSHOT_TOO_OLD # would be nice to get

SELECT * FROM t2;

The transaction of connection rr would see the old data of the t1.a, independent of when the concurrent UPDATE was completed. Then, it would insert and commit this old data to t2.a, and the supposedly SERIALIZABLE transaction would see the old data in t2. This does not look serializable to me.
This could mean that if you want SERIALIZABLE execution, all write transactions in the system must be serializable. All literature and RDBMS documentation that I was able to find seemed to imply this. So, maybe it is not a bug per se. Maybe it is just simply wrong to expect consistent results if some transactions mix writes with non-locking (MVCC) reads.

Theoretically, we could introduce a new PARANOID_SERIALIZABLE isolation level that would return the ER_SNAPSHOT_TOO_OLD error (simply by looking up and read-locking the primary key record and checking if its DB_TRX_ID is greater than the current transaction ID). Unfortunately, I think that it could return this error also in a case where the later-started-and-earlier-committed transaction was at SERIALIZABLE isolation level and the execution would actually have been serializable.

Marko Mäkelä added a comment - 2018-04-30 15:54 My claim on SERIALIZABLE not being quite serializable is my opinion. Microsoft SQL Server documentation seems to define SERIALIZABLE in the same way as InnoDB: The highest isolation level, serializable, guarantees that a transaction will retrieve exactly the same data every time it repeats a read operation, but it does this by performing a level of locking that is likely to impact other users in multi-user systems. Below is a copy of my comment to a blog post from February 2018: I discussed the SERIALIZABLE behaviour with my colleagues at MariaDB today. I came up with a scenario like this: connection rr; SET TRANSACTION ISOLATION LEVEL REPEATABLE READ ; START TRANSACTION WITH CONSISTENT SNAPSHOT; # assigns a read view immediately connection ser; SET TRANSACTION ISOLATION LEVEL SERIALIZABLE ; BEGIN ; connection rr; send set @a=( SELECT a FROM t1 LIMIT 1); connection ser; UPDATE t1 SET a=a+1; connection rr; reap; INSERT INTO t2 SET a=@a; COMMIT ; disconnect rr; connection ser; –error ER_SNAPSHOT_TOO_OLD # would be nice to get SELECT * FROM t2; The transaction of connection rr would see the old data of the t1.a, independent of when the concurrent UPDATE was completed. Then, it would insert and commit this old data to t2.a, and the supposedly SERIALIZABLE transaction would see the old data in t2. This does not look serializable to me. This could mean that if you want SERIALIZABLE execution, all write transactions in the system must be serializable. All literature and RDBMS documentation that I was able to find seemed to imply this. So, maybe it is not a bug per se. Maybe it is just simply wrong to expect consistent results if some transactions mix writes with non-locking (MVCC) reads. Theoretically, we could introduce a new PARANOID_SERIALIZABLE isolation level that would return the ER_SNAPSHOT_TOO_OLD error (simply by looking up and read-locking the primary key record and checking if its DB_TRX_ID is greater than the current transaction ID). Unfortunately, I think that it could return this error also in a case where the later-started-and-earlier-committed transaction was at SERIALIZABLE isolation level and the execution would actually have been serializable.

Marko Mäkelä added a comment - 2018-05-29 06:25

On customer request, I backported this to MariaDB 10.1.34.

Marko Mäkelä added a comment - 2018-05-29 06:25 On customer request, I backported this to MariaDB 10.1.34.

Marko Mäkelä added a comment - 2024-04-16 14:11

After discussing this with Alexey Gotsman, who is one of the contributors to the Jepsen tool, I tried to revisit my comment about the SERIALIZABLE isolation level. The above SQL snippet is incomplete pseudocode. I had to substantially modify it to get it to run with the mtr tool:

--source include/have_innodb.inc

CREATE TABLE t1(a INT PRIMARY KEY) ENGINE=InnoDB;

INSERT INTO t1 VALUES(10),(20),(30);

CREATE TABLE t2(a INT PRIMARY KEY) ENGINE=InnoDB;

connect rr,localhost,root;

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

START TRANSACTION WITH CONSISTENT SNAPSHOT;

connect ser,localhost,root;

SET innodb_snapshot_isolation=ON;

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

BEGIN;

connection rr;

send set @a=(SELECT a FROM t1 LIMIT 1);

connection ser;

send UPDATE t1 SET a=a+1 WHERE a>15;

connection rr;

reap; INSERT INTO t2 SET a=@a; COMMIT;

disconnect rr;

connection ser;

reap;

SELECT * FROM t1;

SELECT * FROM t2;

disconnect ser;

connection default;

DROP TABLE t1,t2;

Note: the parameter innodb_snapshot_isolation was recently introduced in a fix of ~~MDEV-32898~~, causing an error ER_CHECKREAD to be returned if an attempt is made to acquire a lock on a record that does not exist in the current read view. The result (a)=(10) for the final SELECT appears to be fine after all. For the penultimate SELECT we get the results (a)=(10),(21),(31). If I remove the condition WHERE a>15, then the test will fail:

10.6 8dda6027019c93fb58d34095cb15908cdfa209e6
mysqltest: At line 19: query 'reap' failed: ER_LOCK_WAIT_TIMEOUT (1205): Lock wait timeout exceeded; try restarting transaction

So, there might not seem to be any issue, or we should spend effort to recapture that scenario once more.

By the way, the bug fix that was applied is addressing a different scenario, which is demonstrated by the test case that I had posted on 2017-12-11. When there are many comments in a ticket, Jira may collapse older ones, but you can expand them.

Marko Mäkelä added a comment - 2024-04-16 14:11 After discussing this with Alexey Gotsman, who is one of the contributors to the Jepsen tool, I tried to revisit my comment about the SERIALIZABLE isolation level. The above SQL snippet is incomplete pseudocode. I had to substantially modify it to get it to run with the mtr tool: --source include/have_innodb.inc CREATE TABLE t1(a INT PRIMARY KEY ) ENGINE=InnoDB; INSERT INTO t1 VALUES (10),(20),(30); CREATE TABLE t2(a INT PRIMARY KEY ) ENGINE=InnoDB; connect rr,localhost,root; SET TRANSACTION ISOLATION LEVEL REPEATABLE READ ; START TRANSACTION WITH CONSISTENT SNAPSHOT; connect ser,localhost,root; SET innodb_snapshot_isolation= ON ; SET TRANSACTION ISOLATION LEVEL SERIALIZABLE ; BEGIN ; connection rr; send set @a=( SELECT a FROM t1 LIMIT 1); connection ser; send UPDATE t1 SET a=a+1 WHERE a>15; connection rr; reap; INSERT INTO t2 SET a=@a; COMMIT ; disconnect rr; connection ser; reap; SELECT * FROM t1; SELECT * FROM t2; disconnect ser; connection default ; DROP TABLE t1,t2; Note: the parameter innodb_snapshot_isolation was recently introduced in a fix of MDEV-32898 , causing an error ER_CHECKREAD to be returned if an attempt is made to acquire a lock on a record that does not exist in the current read view. The result (a)=(10) for the final SELECT appears to be fine after all. For the penultimate SELECT we get the results (a)=(10),(21),(31). If I remove the condition WHERE a>15 , then the test will fail: 10.6 8dda6027019c93fb58d34095cb15908cdfa209e6 mysqltest: At line 19: query 'reap' failed: ER_LOCK_WAIT_TIMEOUT (1205): Lock wait timeout exceeded; try restarting transaction So, there might not seem to be any issue, or we should spend effort to recapture that scenario once more. By the way, the bug fix that was applied is addressing a different scenario, which is demonstrated by the test case that I had posted on 2017-12-11. When there are many comments in a ticket, Jira may collapse older ones, but you can expand them.

MariaDB Server

InnoDB should not lock a delete-marked record

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration