[MDEV-20483] trx_lock_t::table_locks is not a subset of trx_lock_t::trx_locks Created: 2019-09-03  Updated: 2020-08-07  Resolved: 2019-09-17

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.2, 10.3, 10.4, 10.5
Fix Version/s: 10.2.28, 10.3.19, 10.4.9

Type: Bug Priority: Major
Reporter: Marko Mäkelä Assignee: Thirunarayanan Balathandayuthapani
Resolution: Fixed Votes: 0
Labels: corruption, transactions

Attachments: File MDEV-20483-B.test     File MDEV-20483-C.test    
Issue Links:
Relates
relates to MDEV-13426 Assertion failure ib_vector_is_empty ... Closed
relates to MDEV-15326 InnoDB: Failing assertion: !other_loc... Closed

 Description   

Testing a fix of MDEV-15326 revealed that if the call to trx->lock.table_locks.clear(); is skipped at transaction commit when trx->lock.trx_locks is empty, occasionally a transaction would be left with nonempty trx->lock.table_locks list. This proves that the table_locks is not a subset of trx_locks, like it is expected to be.

We will need a test case for this, so that the bug can be found and fixed. To repeat the assertion failures, revert the following part of the MDEV-15326 fix:

diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
index 1dcca7c1f72..b06e15d4f37 100644
--- a/storage/innobase/trx/trx0trx.cc
+++ b/storage/innobase/trx/trx0trx.cc
@@ -561,8 +561,6 @@ inline void trx_t::release_locks()
 
   if (UT_LIST_GET_LEN(lock.trx_locks))
     lock_trx_release_locks(this);
-  else
-    lock.table_locks.clear(); /* Work around MDEV-20483 */
 }
 
 /********************************************************************//**

This should lead to occasional assertion failures on trx->lock.table_locks.empty() on transaction commit.



 Comments   
Comment by Matthias Leich [ 2019-09-03 ]

Assertion `(trx)->lock.table_locks.empty()' failed

Comment by Matthias Leich [ 2019-09-05 ]

Please try the uploaded tests.
MDEV-20483-C.test  replays the assert.
MDEV-20483-B.test
              Is too some huge extent like MDEV-20483-C.test.
              It tries in addition first to insert illegal values via INSERT ... VALUES ...
              This gets denied (ERROR 22003: Out of range value for column 'pk' at row 1).
              Than it tries to insert the same ollegal values via INSERT ...  SELECT and harvests
              again ERROR 22003: Out of range value for column 'pk' at row 1
              whereas the same INSERT ... SELECT has success in MDEV-20483-C.test.
 
That MDEV-20483-C.test shows finally the assert is one thing.
But that MDEV-20483-B.test has just a denied statement in addition and does not
harvest finally the same assert is IMHO quite suspicious.
 

Comment by Marko Mäkelä [ 2019-09-06 ]

This appears to be intentional, after all. While the trx_lock_t::table_locks is not empty, it is filled with NULL pointers. The test MDEV-20483-C.test does not fail with 10.3 2842c369851a8afc2a944ce6f4f60fa052f20969 and the following patch:

diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
index f612be4d177..59a27ee112e 100644
--- a/storage/innobase/trx/trx0trx.cc
+++ b/storage/innobase/trx/trx0trx.cc
@@ -1363,6 +1363,13 @@ trx_commit_in_memory(
 		DBUG_LOG("trx", "Autocommit in memory: " << trx);
 		trx->state = TRX_STATE_NOT_STARTED;
 	} else {
+#ifdef UNIV_DEBUG
+		if (!UT_LIST_GET_LEN(trx->lock.trx_locks)) {
+			for (const lock_t* l : trx->lock.table_locks) {
+				ut_ad(!l);
+			}
+		}
+#endif
 		trx->commit_state();
 
 		if (trx->id) {

This patch checks that when trx_locks is empty and table_locks is not, then table_locks must consist of only NULL pointers.
The NULL pointers are being written at least by lock_trx_table_locks_remove().

mleich, can you please retest the latest 10.3 with the above patch applied? If the assertion in this patch does not fail, then we can conclude that there is no bug after all.

I am asking to retest, because only some callers of lock_table_dequeue() invoke also lock_trx_table_locks_remove(). So, I cannot rule out a bug yet.

Comment by Matthias Leich [ 2019-09-09 ]

The patch direct above avoids to hit the assertion Assertion `(trx)->lock.table_locks.empty()' failed
but certain tests hit
    mysqld: storage/innobase/trx/trx0trx.cc:1369: void trx_commit_in_memory(trx_t*, const mtr_t*): Assertion `!l' failed.
 
MTR Based replay test:
--source include/have_innodb.inc
--connect (con15,localhost,root,,)
CREATE TABLE t1 ( col1 INT, pk INTEGER AUTO_INCREMENT, PRIMARY KEY (pk)) ENGINE = InnoDB  ;
INSERT INTO t1 VALUES (NULL, NULL)  ;
 
--connect (con16,localhost,root,,)
--connection con16
# Without AUTOCOMMIT = OFF no assert!
SET AUTOCOMMIT = OFF  ;
# With LOCK TABLES t1 WRITE (no CONCURRENT) no assert! Just the MDL timout kicks in.
# LOCK TABLES t1 WRITE ;
  LOCK TABLES t1 WRITE CONCURRENT ;
 
--connection con15
SET SESSION lock_wait_timeout = 2;
SET SESSION innodb_lock_wait_timeout = 1;
 
# Variant 1 which replays
# SELECT * FROM t1 INTO OUTFILE 'load_t1' ; LOAD DATA INFILE 'load_t1' REPLACE INTO TABLE t1 ;
 
# Variant 2 which replays
  REPLACE INTO t1 (col1) SELECT col1 FROM (SELECT col1 FROM t1) AS load_tab ;
 
DROP TABLE t1;

Comment by Thirunarayanan Balathandayuthapani [ 2019-09-12 ]

The following patch removes the lock from table_locks vector. Previously lock_wait_timeout_thread doesn't remove
the table lock from vector. The following patch address the code path.

diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc
index b7e17097546..9fdc8feee90 100644
--- a/storage/innobase/lock/lock0lock.cc
+++ b/storage/innobase/lock/lock0lock.cc
@@ -6480,6 +6480,8 @@ lock_cancel_waiting_and_release(
                }
 
                lock_table_dequeue(lock);
+               /* Remove the lock from table lock vector too. */
+               lock_trx_table_locks_remove(lock);
        }
 
        /* Reset the wait flag and the back pointer to lock in trx. */
diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
index 1dcca7c1f72..200335d605e 100644
--- a/storage/innobase/trx/trx0trx.cc
+++ b/storage/innobase/trx/trx0trx.cc
@@ -1695,6 +1695,13 @@ trx_commit_in_memory(
                DBUG_LOG("trx", "Autocommit in memory: " << trx);
                trx->state = TRX_STATE_NOT_STARTED;
        } else {
+#ifdef UNIV_DEBUG
+               if (!UT_LIST_GET_LEN(trx->lock.trx_locks)) {
+                       for (const lock_t* l : trx->lock.table_locks) {
+                               ut_ad(!l);
+                       }
+               }
+#endif
                trx_mutex_enter(trx);
                trx->commit_state();
                trx_mutex_exit(trx);

Above patch based on 10.2 (df4dee4b84ddc34799fa3a9648c142f12564597f) version.
marko still have doubt that there is another place where the call is missing.
So mleich, can you please run the rqg test with the above patch to find out any other issue ?

Comment by Matthias Leich [ 2019-09-17 ]

When testing 10.2 + the patch above with RQG I got the usual fraction of known+open bugs.

Comment by Marko Mäkelä [ 2019-09-17 ]

Thanks, thiru! You forgot to remove the work-around from trx_t::release_locks() as part of your patch. Please remove that as well.

Indeed, you added the only missing call. The one in lock_release() is intentionally missing, as the only caller lock_trx_release_locks() is invoking trx->lock.table_locks.clear().

As far as I can tell, based on a comment in lock_table(), the primary motivation for table_locks to exist is that while trx_locks can be modified by other threads (which can be converting implicit locks to explicit ones), table_locks is only accessed by the thread where the transaction is running. Despite this, lock_table_create() is holding trx->mutex (and lock_sys->mutex) while appending to table_locks.

Given that the only extra locks in table_locks were cancelled wait locks, and that the only non-debug read access of table_locks is lock_table_has(), which ignores lock waits, there does not seem to be any correctness impact of this bug.

Generated at Thu Feb 08 08:59:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.