[MDEV-14589] InnoDB should not lock a delete-marked record Created: 2017-12-05  Updated: 2023-11-30  Due: 2017-12-15  Resolved: 2017-12-11

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: None
Fix Version/s: 10.2.12, 10.3.3, 10.1.34

Type: Bug Priority: Major
Reporter: Michael Widenius Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: upstream, upstream-5.5, upstream-wontfix

Issue Links:
Relates
relates to MDEV-26475 InnoDB must not lock delete-marked re... Open
relates to MDEV-26642 Weird SELECT view when a record is mo... Confirmed
relates to MDEV-32898 Phantom rows caused by UPDATE of PRIM... Stalled
relates to MDEV-10962 Deadlock with 3 concurrent DELETEs by... Closed
relates to MDEV-16675 Unnecessary explicit lock acquisition... Closed
relates to MDEV-20605 Awaken transaction can miss inserted ... Closed

 Description   

When a transaction is able to delete-mark a record and commit the change, no active transaction may possibly have seen (and locked) the record. The record can be removed by the purge thread at any time, and it is logically equivalent to a gap (non-existing record).

However, currently InnoDB transactions will lock committed delete-marked records.

How to repeat:
Run at least 3 identical DELETE statements in parallel, and possibly some INSERT with a key value slightly smaller than that of the deleted record. InnoDB should report a bogus deadlock between 2 DELETE statements:

trx (1) has obtained an X-lock on the delete-marked committed record
trx (2) is waiting for an X-lock on the record and the preceding gap
trx (1) is waiting for an X-lock on the record and the preceding gap

Suggested fix:
Transactions should ignore committed delete-marked records when they come across them.

The solution we will try to implement is that when using isolation level is READ COMMITTED then DELETE will ignore any delete marked rows.

This is same as https://bugs.mysql.com/bug.php?id=19762



 Comments   
Comment by Marko Mäkelä [ 2017-12-08 ]

I think that we can easily fix one special case of this problem: At READ COMMITTED isolation level, skip delete-marked records.

It may be the case that at READ UNCOMMITTED, we are already skipping delete-marked records.

Fixing the full problem is challenging. It might be easier when we have a per-table partitioned undo log format that would allow out-of-order purge.

Comment by Marko Mäkelä [ 2017-12-11 ]

The following test case demonstrates the problem.

--source include/have_innodb.inc
--source include/have_debug.inc
--source include/have_debug_sync.inc
 
--source include/count_sessions.inc
 
CREATE TABLE t1(a INT PRIMARY KEY, b INT UNIQUE) ENGINE=InnoDB;
 
INSERT INTO t1 VALUES(1,1);
COMMIT;
BEGIN;
DELETE FROM t1;
 
connect(stop_purge, localhost, root,,);
START TRANSACTION WITH CONSISTENT SNAPSHOT;
 
connection default;
COMMIT;
 
SET DEBUG_SYNC='row_ins_sec_index_unique SIGNAL inserted WAIT_FOR locked';
send INSERT INTO t1 VALUES(1,1);
 
connect(delete, localhost, root,,);
SET DEBUG_SYNC='now WAIT_FOR inserted';
SET DEBUG_SYNC='innodb_row_search_for_mysql_exit SIGNAL locked';
SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;
BEGIN;
send DELETE FROM t1 WHERE b=1;
 
connection default;
reap;
connection delete;
reap;
COMMIT;
disconnect delete;
disconnect stop_purge;
 
connection default;
SET DEBUG_SYNC='RESET';
DROP TABLE t1;
--source include/wait_until_count_sessions.inc

The test will not hang if the BEGIN before the DELETE is omitted.
The test will also hang if the transaction isolation level is set to READ UNCOMMITTED.

I think that both at READ COMMITTED and READ UNCOMMITTED, it is safe to ignore committed delete-marked records without placing locks on them.
With REPEATABLE READ (the default) and SERIALIZABLE, we must continue to acquire locks for now. The generic solution (see MySQL Bug #19762 for some discussion) is hard.

Comment by Marko Mäkelä [ 2017-12-11 ]

This turns out to be an omission in MySQL Bug#3300: The function row_unlock_for_mysql() is only effective for clustered index records.

It would seem to be more efficient to avoid the lock acquisition on delete-marked records in row_search_mvcc() née row_search_for_mysql() in the first place. For clustered index records, this is easy, because we can determine if the DB_TRX_ID in the delete-marked record is referring to an active transaction. For secondary index records, we look up find the matching clustered index record and then skip the lock if the clustered index record is not found or if it no longer refers to an active transaction.

Comment by Marko Mäkelä [ 2017-12-11 ]

On a related note, the ‘do not lock non-matching rows’ logic of MySQL Bug#3300 (also known as the ‘semi-consistent read’) only works on the clustered index. Lock conflicts will not be avoided on secondary index records.

			if (UNIV_LIKELY(prebuilt->row_read_type
					!= ROW_READ_TRY_SEMI_CONSISTENT)
			    || unique_search
			    || index != clust_index) {
 
				goto lock_wait_or_error;
			}
 
			/* The following call returns 'offsets'
			associated with 'old_vers' */
			row_sel_build_committed_vers_for_mysql(
				clust_index, prebuilt, rec,
				&offsets, &heap, &old_vers, need_vrow ? &vrow : NULL,
			        &mtr);

Comment by Marko Mäkelä [ 2017-12-11 ]

bb-10.2-marko

Comment by Jan Lindström (Inactive) [ 2017-12-11 ]

It would be nice if test would include a case where another transaction tries to insert to that gap using > READ_COMMITTED isolation level. But actual fix looks correct to me and ok to push.

Comment by Marko Mäkelä [ 2017-12-11 ]

The added test case innodb.lock_deleted now covers the READ UNCOMMITTED, READ COMMITTED, and REPEATABLE READ isolation levels.

At the REPEATABLE READ isolation level, the committed delete-marked secondary index record will be locked and will prevent an INSERT of the same record. The gap preceding the delete-marked record will not be locked. This seems wrong at the SERIALIZABLE isolation level (we should lock all the visited records and prevent updates or inserts).

But we already know that the SERIALIZABLE isolation level is broken by design in other ways: if records were committed by a later-started transaction, the InnoDB SERIALIZABLE level will happily lock those records and treat them as READ COMMITTED.

Comment by Marko Mäkelä [ 2018-04-30 ]

My claim on SERIALIZABLE not being quite serializable is my opinion. Microsoft SQL Server documentation seems to define SERIALIZABLE in the same way as InnoDB:

The highest isolation level, serializable, guarantees that a transaction will retrieve exactly the same data every time it repeats a read operation, but it does this by performing a level of locking that is likely to impact other users in multi-user systems.

Below is a copy of my comment to a blog post from February 2018:

I discussed the SERIALIZABLE behaviour with my colleagues at MariaDB today.
I came up with a scenario like this:

connection rr; SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
START TRANSACTION WITH CONSISTENT SNAPSHOT; # assigns a read view immediately
connection ser; SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN;
connection rr;
send set @a=(SELECT a FROM t1 LIMIT 1);
connection ser;
UPDATE t1 SET a=a+1;
connection rr;
reap; INSERT INTO t2 SET a=@a; COMMIT;
disconnect rr;
connection ser;
–error ER_SNAPSHOT_TOO_OLD # would be nice to get
SELECT * FROM t2;

The transaction of connection rr would see the old data of the t1.a, independent of when the concurrent UPDATE was completed. Then, it would insert and commit this old data to t2.a, and the supposedly SERIALIZABLE transaction would see the old data in t2. This does not look serializable to me.
This could mean that if you want SERIALIZABLE execution, all write transactions in the system must be serializable. All literature and RDBMS documentation that I was able to find seemed to imply this. So, maybe it is not a bug per se. Maybe it is just simply wrong to expect consistent results if some transactions mix writes with non-locking (MVCC) reads.

Theoretically, we could introduce a new PARANOID_SERIALIZABLE isolation level that would return the ER_SNAPSHOT_TOO_OLD error (simply by looking up and read-locking the primary key record and checking if its DB_TRX_ID is greater than the current transaction ID). Unfortunately, I think that it could return this error also in a case where the later-started-and-earlier-committed transaction was at SERIALIZABLE isolation level and the execution would actually have been serializable.

Comment by Marko Mäkelä [ 2018-05-29 ]

On customer request, I backported this to MariaDB 10.1.34.

Generated at Thu Feb 08 08:14:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.