Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-14589

InnoDB should not lock a delete-marked record

Details

    Description

      When a transaction is able to delete-mark a record and commit the change, no active transaction may possibly have seen (and locked) the record. The record can be removed by the purge thread at any time, and it is logically equivalent to a gap (non-existing record).

      However, currently InnoDB transactions will lock committed delete-marked records.

      How to repeat:
      Run at least 3 identical DELETE statements in parallel, and possibly some INSERT with a key value slightly smaller than that of the deleted record. InnoDB should report a bogus deadlock between 2 DELETE statements:

      trx (1) has obtained an X-lock on the delete-marked committed record
      trx (2) is waiting for an X-lock on the record and the preceding gap
      trx (1) is waiting for an X-lock on the record and the preceding gap

      Suggested fix:
      Transactions should ignore committed delete-marked records when they come across them.

      The solution we will try to implement is that when using isolation level is READ COMMITTED then DELETE will ignore any delete marked rows.

      This is same as https://bugs.mysql.com/bug.php?id=19762

      Attachments

        Issue Links

          Activity

            It would be nice if test would include a case where another transaction tries to insert to that gap using > READ_COMMITTED isolation level. But actual fix looks correct to me and ok to push.

            jplindst Jan Lindström (Inactive) added a comment - It would be nice if test would include a case where another transaction tries to insert to that gap using > READ_COMMITTED isolation level. But actual fix looks correct to me and ok to push.

            The added test case innodb.lock_deleted now covers the READ UNCOMMITTED, READ COMMITTED, and REPEATABLE READ isolation levels.

            At the REPEATABLE READ isolation level, the committed delete-marked secondary index record will be locked and will prevent an INSERT of the same record. The gap preceding the delete-marked record will not be locked. This seems wrong at the SERIALIZABLE isolation level (we should lock all the visited records and prevent updates or inserts).

            But we already know that the SERIALIZABLE isolation level is broken by design in other ways: if records were committed by a later-started transaction, the InnoDB SERIALIZABLE level will happily lock those records and treat them as READ COMMITTED.

            marko Marko Mäkelä added a comment - The added test case innodb.lock_deleted now covers the READ UNCOMMITTED, READ COMMITTED, and REPEATABLE READ isolation levels. At the REPEATABLE READ isolation level, the committed delete-marked secondary index record will be locked and will prevent an INSERT of the same record. The gap preceding the delete-marked record will not be locked. This seems wrong at the SERIALIZABLE isolation level (we should lock all the visited records and prevent updates or inserts). But we already know that the SERIALIZABLE isolation level is broken by design in other ways: if records were committed by a later-started transaction, the InnoDB SERIALIZABLE level will happily lock those records and treat them as READ COMMITTED.

            My claim on SERIALIZABLE not being quite serializable is my opinion. Microsoft SQL Server documentation seems to define SERIALIZABLE in the same way as InnoDB:

            The highest isolation level, serializable, guarantees that a transaction will retrieve exactly the same data every time it repeats a read operation, but it does this by performing a level of locking that is likely to impact other users in multi-user systems.

            Below is a copy of my comment to a blog post from February 2018:

            I discussed the SERIALIZABLE behaviour with my colleagues at MariaDB today.
            I came up with a scenario like this:

            connection rr; SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
            START TRANSACTION WITH CONSISTENT SNAPSHOT; # assigns a read view immediately
            connection ser; SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
            BEGIN;
            connection rr;
            send set @a=(SELECT a FROM t1 LIMIT 1);
            connection ser;
            UPDATE t1 SET a=a+1;
            connection rr;
            reap; INSERT INTO t2 SET a=@a; COMMIT;
            disconnect rr;
            connection ser;
            –error ER_SNAPSHOT_TOO_OLD # would be nice to get
            SELECT * FROM t2;
            

            The transaction of connection rr would see the old data of the t1.a, independent of when the concurrent UPDATE was completed. Then, it would insert and commit this old data to t2.a, and the supposedly SERIALIZABLE transaction would see the old data in t2. This does not look serializable to me.
            This could mean that if you want SERIALIZABLE execution, all write transactions in the system must be serializable. All literature and RDBMS documentation that I was able to find seemed to imply this. So, maybe it is not a bug per se. Maybe it is just simply wrong to expect consistent results if some transactions mix writes with non-locking (MVCC) reads.

            Theoretically, we could introduce a new PARANOID_SERIALIZABLE isolation level that would return the ER_SNAPSHOT_TOO_OLD error (simply by looking up and read-locking the primary key record and checking if its DB_TRX_ID is greater than the current transaction ID). Unfortunately, I think that it could return this error also in a case where the later-started-and-earlier-committed transaction was at SERIALIZABLE isolation level and the execution would actually have been serializable.

            marko Marko Mäkelä added a comment - My claim on SERIALIZABLE not being quite serializable is my opinion. Microsoft SQL Server documentation seems to define SERIALIZABLE in the same way as InnoDB: The highest isolation level, serializable, guarantees that a transaction will retrieve exactly the same data every time it repeats a read operation, but it does this by performing a level of locking that is likely to impact other users in multi-user systems. Below is a copy of my comment to a blog post from February 2018: I discussed the SERIALIZABLE behaviour with my colleagues at MariaDB today. I came up with a scenario like this: connection rr; SET TRANSACTION ISOLATION LEVEL REPEATABLE READ ; START TRANSACTION WITH CONSISTENT SNAPSHOT; # assigns a read view immediately connection ser; SET TRANSACTION ISOLATION LEVEL SERIALIZABLE ; BEGIN ; connection rr; send set @a=( SELECT a FROM t1 LIMIT 1); connection ser; UPDATE t1 SET a=a+1; connection rr; reap; INSERT INTO t2 SET a=@a; COMMIT ; disconnect rr; connection ser; –error ER_SNAPSHOT_TOO_OLD # would be nice to get SELECT * FROM t2; The transaction of connection rr would see the old data of the t1.a, independent of when the concurrent UPDATE was completed. Then, it would insert and commit this old data to t2.a, and the supposedly SERIALIZABLE transaction would see the old data in t2. This does not look serializable to me. This could mean that if you want SERIALIZABLE execution, all write transactions in the system must be serializable. All literature and RDBMS documentation that I was able to find seemed to imply this. So, maybe it is not a bug per se. Maybe it is just simply wrong to expect consistent results if some transactions mix writes with non-locking (MVCC) reads. Theoretically, we could introduce a new PARANOID_SERIALIZABLE isolation level that would return the ER_SNAPSHOT_TOO_OLD error (simply by looking up and read-locking the primary key record and checking if its DB_TRX_ID is greater than the current transaction ID). Unfortunately, I think that it could return this error also in a case where the later-started-and-earlier-committed transaction was at SERIALIZABLE isolation level and the execution would actually have been serializable.

            On customer request, I backported this to MariaDB 10.1.34.

            marko Marko Mäkelä added a comment - On customer request, I backported this to MariaDB 10.1.34.

            After discussing this with Alexey Gotsman, who is one of the contributors to the Jepsen tool, I tried to revisit my comment about the SERIALIZABLE isolation level. The above SQL snippet is incomplete pseudocode. I had to substantially modify it to get it to run with the mtr tool:

            --source include/have_innodb.inc
             
            CREATE TABLE t1(a INT PRIMARY KEY) ENGINE=InnoDB;
            INSERT INTO t1 VALUES(10),(20),(30);
            CREATE TABLE t2(a INT PRIMARY KEY) ENGINE=InnoDB;
             
            connect rr,localhost,root;
            SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
            START TRANSACTION WITH CONSISTENT SNAPSHOT;
            connect ser,localhost,root;
            SET innodb_snapshot_isolation=ON;
            SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
            BEGIN;
            connection rr;
            send set @a=(SELECT a FROM t1 LIMIT 1);
            connection ser;
            send UPDATE t1 SET a=a+1 WHERE a>15;
            connection rr;
            reap; INSERT INTO t2 SET a=@a; COMMIT;
            disconnect rr;
            connection ser;
            reap;
            SELECT * FROM t1;
            SELECT * FROM t2;
            disconnect ser;
             
            connection default;
            DROP TABLE t1,t2;
            

            Note: the parameter innodb_snapshot_isolation was recently introduced in a fix of MDEV-32898, causing an error ER_CHECKREAD to be returned if an attempt is made to acquire a lock on a record that does not exist in the current read view. The result (a)=(10) for the final SELECT appears to be fine after all. For the penultimate SELECT we get the results (a)=(10),(21),(31). If I remove the condition WHERE a>15, then the test will fail:

            10.6 8dda6027019c93fb58d34095cb15908cdfa209e6

            mysqltest: At line 19: query 'reap' failed: ER_LOCK_WAIT_TIMEOUT (1205): Lock wait timeout exceeded; try restarting transaction
            

            So, there might not seem to be any issue, or we should spend effort to recapture that scenario once more.

            By the way, the bug fix that was applied is addressing a different scenario, which is demonstrated by the test case that I had posted on 2017-12-11. When there are many comments in a ticket, Jira may collapse older ones, but you can expand them.

            marko Marko Mäkelä added a comment - After discussing this with Alexey Gotsman, who is one of the contributors to the Jepsen tool, I tried to revisit my comment about the SERIALIZABLE isolation level. The above SQL snippet is incomplete pseudocode. I had to substantially modify it to get it to run with the mtr tool: --source include/have_innodb.inc   CREATE TABLE t1(a INT PRIMARY KEY ) ENGINE=InnoDB; INSERT INTO t1 VALUES (10),(20),(30); CREATE TABLE t2(a INT PRIMARY KEY ) ENGINE=InnoDB;   connect rr,localhost,root; SET TRANSACTION ISOLATION LEVEL REPEATABLE READ ; START TRANSACTION WITH CONSISTENT SNAPSHOT; connect ser,localhost,root; SET innodb_snapshot_isolation= ON ; SET TRANSACTION ISOLATION LEVEL SERIALIZABLE ; BEGIN ; connection rr; send set @a=( SELECT a FROM t1 LIMIT 1); connection ser; send UPDATE t1 SET a=a+1 WHERE a>15; connection rr; reap; INSERT INTO t2 SET a=@a; COMMIT ; disconnect rr; connection ser; reap; SELECT * FROM t1; SELECT * FROM t2; disconnect ser;   connection default ; DROP TABLE t1,t2; Note: the parameter innodb_snapshot_isolation was recently introduced in a fix of MDEV-32898 , causing an error ER_CHECKREAD to be returned if an attempt is made to acquire a lock on a record that does not exist in the current read view. The result (a)=(10) for the final SELECT appears to be fine after all. For the penultimate SELECT we get the results (a)=(10),(21),(31). If I remove the condition WHERE a>15 , then the test will fail: 10.6 8dda6027019c93fb58d34095cb15908cdfa209e6 mysqltest: At line 19: query 'reap' failed: ER_LOCK_WAIT_TIMEOUT (1205): Lock wait timeout exceeded; try restarting transaction So, there might not seem to be any issue, or we should spend effort to recapture that scenario once more. By the way, the bug fix that was applied is addressing a different scenario, which is demonstrated by the test case that I had posted on 2017-12-11. When there are many comments in a ticket, Jira may collapse older ones, but you can expand them.

            People

              marko Marko Mäkelä
              monty Michael Widenius
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.