[MDEV-20483] trx_lock_t::table_locks is not a subset of trx_lock_t::trx_locks - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5
Fix Version/s: 10.2.28, 10.3.19, 10.4.9
Component/s: Storage Engine - InnoDB
Labels:
- corruption
- transactions

Description

Testing a fix of ~~MDEV-15326~~ revealed that if the call to trx->lock.table_locks.clear(); is skipped at transaction commit when trx->lock.trx_locks is empty, occasionally a transaction would be left with nonempty trx->lock.table_locks list. This proves that the table_locks is not a subset of trx_locks, like it is expected to be.

We will need a test case for this, so that the bug can be found and fixed. To repeat the assertion failures, revert the following part of the ~~MDEV-15326~~ fix:

diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc

index 1dcca7c1f72..b06e15d4f37 100644

--- a/storage/innobase/trx/trx0trx.cc

+++ b/storage/innobase/trx/trx0trx.cc

@@ -561,8 +561,6 @@ inline void trx_t::release_locks()

   if (UT_LIST_GET_LEN(lock.trx_locks))

     lock_trx_release_locks(this);

-  else

-    lock.table_locks.clear(); /* Work around MDEV-20483 */

 /********************************************************************//**

This should lead to occasional assertion failures on trx->lock.table_locks.empty() on transaction commit.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

MDEV-20483-B.test
2019-09-05 18:29
0.9 kB
Matthias Leich
MDEV-20483-C.test
2019-09-05 18:29
1.0 kB
Matthias Leich

Issue Links

relates to

MDEV-13426 Assertion failure ib_vector_is_empty at trx0trx.cc line 1078

Closed

MDEV-15326 InnoDB: Failing assertion: !other_lock || wsrep_thd_is_BF(lock->trx->mysql_thd, FALSE) || wsrep_thd_is_BF(other_lock->trx->mysql_thd, FALSE)

Closed

Activity

Ascending order - Click to sort in descending order

Matthias Leich added a comment - 2019-09-03 18:16

Assertion `(trx)->lock.table_locks.empty()' failed

Matthias Leich added a comment - 2019-09-03 18:16 Assertion `(trx)->lock.table_locks.empty()' failed

Matthias Leich added a comment - 2019-09-05 18:38 - edited

Please try the uploaded tests.

MDEV-20483-C.test  replays the assert.

MDEV-20483-B.test

              Is too some huge extent like MDEV-20483-C.test.

              It tries in addition first to insert illegal values via INSERT ... VALUES ...

              This gets denied (ERROR 22003: Out of range value for column 'pk' at row 1).

              Than it tries to insert the same ollegal values via INSERT ...  SELECT and harvests

              again ERROR 22003: Out of range value for column 'pk' at row 1

              whereas the same INSERT ... SELECT has success in MDEV-20483-C.test.

That MDEV-20483-C.test shows finally the assert is one thing.

But that MDEV-20483-B.test has just a denied statement in addition and does not

harvest finally the same assert is IMHO quite suspicious.

Matthias Leich added a comment - 2019-09-05 18:38 - edited Please try the uploaded tests. MDEV-20483-C.test replays the assert. MDEV-20483-B.test Is too some huge extent like MDEV-20483-C.test. It tries in addition first to insert illegal values via INSERT ... VALUES ... This gets denied (ERROR 22003: Out of range value for column 'pk' at row 1). Than it tries to insert the same ollegal values via INSERT ... SELECT and harvests again ERROR 22003: Out of range value for column 'pk' at row 1 whereas the same INSERT ... SELECT has success in MDEV-20483-C.test. That MDEV-20483-C.test shows finally the assert is one thing. But that MDEV-20483-B.test has just a denied statement in addition and does not harvest finally the same assert is IMHO quite suspicious.

Marko Mäkelä added a comment - 2019-09-06 15:09

This appears to be intentional, after all. While the trx_lock_t::table_locks is not empty, it is filled with NULL pointers. The test MDEV-20483-C.test does not fail with 10.3 2842c369851a8afc2a944ce6f4f60fa052f20969 and the following patch:

diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc

index f612be4d177..59a27ee112e 100644

--- a/storage/innobase/trx/trx0trx.cc

+++ b/storage/innobase/trx/trx0trx.cc

@@ -1363,6 +1363,13 @@ trx_commit_in_memory(

 		DBUG_LOG("trx", "Autocommit in memory: " << trx);

 		trx->state = TRX_STATE_NOT_STARTED;

 	} else {

+#ifdef UNIV_DEBUG

+		if (!UT_LIST_GET_LEN(trx->lock.trx_locks)) {

+			for (const lock_t* l : trx->lock.table_locks) {

+				ut_ad(!l);

+			}

+		}

+#endif

 		trx->commit_state();

 		if (trx->id) {

This patch checks that when trx_locks is empty and table_locks is not, then table_locks must consist of only NULL pointers.
The NULL pointers are being written at least by lock_trx_table_locks_remove().

mleich, can you please retest the latest 10.3 with the above patch applied? If the assertion in this patch does not fail, then we can conclude that there is no bug after all.

I am asking to retest, because only some callers of lock_table_dequeue() invoke also lock_trx_table_locks_remove(). So, I cannot rule out a bug yet.

Marko Mäkelä added a comment - 2019-09-06 15:09 This appears to be intentional, after all. While the trx_lock_t::table_locks is not empty, it is filled with NULL pointers. The test MDEV-20483-C.test does not fail with 10.3 2842c369851a8afc2a944ce6f4f60fa052f20969 and the following patch: diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index f612be4d177..59a27ee112e 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -1363,6 +1363,13 @@ trx_commit_in_memory( DBUG_LOG("trx", "Autocommit in memory: " << trx); trx->state = TRX_STATE_NOT_STARTED; } else { +#ifdef UNIV_DEBUG + if (!UT_LIST_GET_LEN(trx->lock.trx_locks)) { + for (const lock_t* l : trx->lock.table_locks) { + ut_ad(!l); + } + } +#endif trx->commit_state(); if (trx->id) { This patch checks that when trx_locks is empty and table_locks is not, then table_locks must consist of only NULL pointers. The NULL pointers are being written at least by lock_trx_table_locks_remove() . mleich , can you please retest the latest 10.3 with the above patch applied? If the assertion in this patch does not fail, then we can conclude that there is no bug after all. I am asking to retest, because only some callers of lock_table_dequeue() invoke also lock_trx_table_locks_remove() . So, I cannot rule out a bug yet.

Matthias Leich added a comment - 2019-09-09 15:46

The patch direct above avoids to hit the assertion Assertion `(trx)->lock.table_locks.empty()' failed

but certain tests hit

    mysqld: storage/innobase/trx/trx0trx.cc:1369: void trx_commit_in_memory(trx_t*, const mtr_t*): Assertion `!l' failed.

MTR Based replay test:

--source include/have_innodb.inc

--connect (con15,localhost,root,,)

CREATE TABLE t1 ( col1 INT, pk INTEGER AUTO_INCREMENT, PRIMARY KEY (pk)) ENGINE = InnoDB  ;

INSERT INTO t1 VALUES (NULL, NULL)  ;

--connect (con16,localhost,root,,)

--connection con16

# Without AUTOCOMMIT = OFF no assert!

SET AUTOCOMMIT = OFF  ;

# With LOCK TABLES t1 WRITE (no CONCURRENT) no assert! Just the MDL timout kicks in.

# LOCK TABLES t1 WRITE ;

  LOCK TABLES t1 WRITE CONCURRENT ;

--connection con15

SET SESSION lock_wait_timeout = 2;

SET SESSION innodb_lock_wait_timeout = 1;

# Variant 1 which replays

# SELECT * FROM t1 INTO OUTFILE 'load_t1' ; LOAD DATA INFILE 'load_t1' REPLACE INTO TABLE t1 ;

# Variant 2 which replays

  REPLACE INTO t1 (col1) SELECT col1 FROM (SELECT col1 FROM t1) AS load_tab ;

DROP TABLE t1;

Matthias Leich added a comment - 2019-09-09 15:46 The patch direct above avoids to hit the assertion Assertion `(trx)->lock.table_locks.empty()' failed but certain tests hit mysqld: storage/innobase/trx/trx0trx.cc:1369: void trx_commit_in_memory(trx_t*, const mtr_t*): Assertion `!l' failed. MTR Based replay test: --source include/have_innodb.inc --connect (con15,localhost,root,,) CREATE TABLE t1 ( col1 INT, pk INTEGER AUTO_INCREMENT, PRIMARY KEY (pk)) ENGINE = InnoDB ; INSERT INTO t1 VALUES (NULL, NULL) ; --connect (con16,localhost,root,,) --connection con16 # Without AUTOCOMMIT = OFF no assert! SET AUTOCOMMIT = OFF ; # With LOCK TABLES t1 WRITE (no CONCURRENT) no assert! Just the MDL timout kicks in. # LOCK TABLES t1 WRITE ; LOCK TABLES t1 WRITE CONCURRENT ; --connection con15 SET SESSION lock_wait_timeout = 2; SET SESSION innodb_lock_wait_timeout = 1; # Variant 1 which replays # SELECT * FROM t1 INTO OUTFILE 'load_t1' ; LOAD DATA INFILE 'load_t1' REPLACE INTO TABLE t1 ; # Variant 2 which replays REPLACE INTO t1 (col1) SELECT col1 FROM (SELECT col1 FROM t1) AS load_tab ; DROP TABLE t1;

Thirunarayanan Balathandayuthapani added a comment - 2019-09-12 09:11

The following patch removes the lock from table_locks vector. Previously lock_wait_timeout_thread doesn't remove
the table lock from vector. The following patch address the code path.

diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc

index b7e17097546..9fdc8feee90 100644

--- a/storage/innobase/lock/lock0lock.cc

+++ b/storage/innobase/lock/lock0lock.cc

@@ -6480,6 +6480,8 @@ lock_cancel_waiting_and_release(

                lock_table_dequeue(lock);

+               /* Remove the lock from table lock vector too. */

+               lock_trx_table_locks_remove(lock);

        /* Reset the wait flag and the back pointer to lock in trx. */

diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc

index 1dcca7c1f72..200335d605e 100644

--- a/storage/innobase/trx/trx0trx.cc

+++ b/storage/innobase/trx/trx0trx.cc

@@ -1695,6 +1695,13 @@ trx_commit_in_memory(

                DBUG_LOG("trx", "Autocommit in memory: " << trx);

                trx->state = TRX_STATE_NOT_STARTED;

        } else {

+#ifdef UNIV_DEBUG

+               if (!UT_LIST_GET_LEN(trx->lock.trx_locks)) {

+                       for (const lock_t* l : trx->lock.table_locks) {

+                               ut_ad(!l);

+                       }

+               }

+#endif

                trx_mutex_enter(trx);

                trx->commit_state();

                trx_mutex_exit(trx);

Above patch based on 10.2 (df4dee4b84ddc34799fa3a9648c142f12564597f) version.
marko still have doubt that there is another place where the call is missing.
So mleich, can you please run the rqg test with the above patch to find out any other issue ?

Thirunarayanan Balathandayuthapani added a comment - 2019-09-12 09:11 The following patch removes the lock from table_locks vector. Previously lock_wait_timeout_thread doesn't remove the table lock from vector. The following patch address the code path. diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc index b7e17097546..9fdc8feee90 100644 --- a/storage/innobase/lock/lock0lock.cc +++ b/storage/innobase/lock/lock0lock.cc @@ -6480,6 +6480,8 @@ lock_cancel_waiting_and_release( } lock_table_dequeue(lock); + /* Remove the lock from table lock vector too. */ + lock_trx_table_locks_remove(lock); } /* Reset the wait flag and the back pointer to lock in trx. */ diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index 1dcca7c1f72..200335d605e 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -1695,6 +1695,13 @@ trx_commit_in_memory( DBUG_LOG("trx", "Autocommit in memory: " << trx); trx->state = TRX_STATE_NOT_STARTED; } else { +#ifdef UNIV_DEBUG + if (!UT_LIST_GET_LEN(trx->lock.trx_locks)) { + for (const lock_t* l : trx->lock.table_locks) { + ut_ad(!l); + } + } +#endif trx_mutex_enter(trx); trx->commit_state(); trx_mutex_exit(trx); Above patch based on 10.2 (df4dee4b84ddc34799fa3a9648c142f12564597f) version. marko still have doubt that there is another place where the call is missing. So mleich , can you please run the rqg test with the above patch to find out any other issue ?

Matthias Leich added a comment - 2019-09-17 09:52

When testing 10.2 + the patch above with RQG I got the usual fraction of known+open bugs.

Matthias Leich added a comment - 2019-09-17 09:52 When testing 10.2 + the patch above with RQG I got the usual fraction of known+open bugs.

Marko Mäkelä added a comment - 2019-09-17 12:12

Thanks, thiru! You forgot to remove the work-around from trx_t::release_locks() as part of your patch. Please remove that as well.

Indeed, you added the only missing call. The one in lock_release() is intentionally missing, as the only caller lock_trx_release_locks() is invoking trx->lock.table_locks.clear().

As far as I can tell, based on a comment in lock_table(), the primary motivation for table_locks to exist is that while trx_locks can be modified by other threads (which can be converting implicit locks to explicit ones), table_locks is only accessed by the thread where the transaction is running. Despite this, lock_table_create() is holding trx->mutex (and lock_sys->mutex) while appending to table_locks.

Given that the only extra locks in table_locks were cancelled wait locks, and that the only non-debug read access of table_locks is lock_table_has(), which ignores lock waits, there does not seem to be any correctness impact of this bug.

Marko Mäkelä added a comment - 2019-09-17 12:12 Thanks, thiru ! You forgot to remove the work-around from trx_t::release_locks() as part of your patch. Please remove that as well. Indeed, you added the only missing call. The one in lock_release() is intentionally missing, as the only caller lock_trx_release_locks() is invoking trx->lock.table_locks.clear() . As far as I can tell, based on a comment in lock_table() , the primary motivation for table_locks to exist is that while trx_locks can be modified by other threads (which can be converting implicit locks to explicit ones), table_locks is only accessed by the thread where the transaction is running. Despite this, lock_table_create() is holding trx->mutex (and lock_sys->mutex ) while appending to table_locks . Given that the only extra locks in table_locks were cancelled wait locks, and that the only non-debug read access of table_locks is lock_table_has() , which ignores lock waits, there does not seem to be any correctness impact of this bug.

People

Assignee:: Thirunarayanan Balathandayuthapani

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2019-09-03 18:10

Updated:: 2020-08-07 12:26

Resolved:: 2019-09-17 14:36

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration