Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-20483

trx_lock_t::table_locks is not a subset of trx_lock_t::trx_locks

Details

    Description

      Testing a fix of MDEV-15326 revealed that if the call to trx->lock.table_locks.clear(); is skipped at transaction commit when trx->lock.trx_locks is empty, occasionally a transaction would be left with nonempty trx->lock.table_locks list. This proves that the table_locks is not a subset of trx_locks, like it is expected to be.

      We will need a test case for this, so that the bug can be found and fixed. To repeat the assertion failures, revert the following part of the MDEV-15326 fix:

      diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
      index 1dcca7c1f72..b06e15d4f37 100644
      --- a/storage/innobase/trx/trx0trx.cc
      +++ b/storage/innobase/trx/trx0trx.cc
      @@ -561,8 +561,6 @@ inline void trx_t::release_locks()
       
         if (UT_LIST_GET_LEN(lock.trx_locks))
           lock_trx_release_locks(this);
      -  else
      -    lock.table_locks.clear(); /* Work around MDEV-20483 */
       }
       
       /********************************************************************//**
      

      This should lead to occasional assertion failures on trx->lock.table_locks.empty() on transaction commit.

      Attachments

        1. MDEV-20483-B.test
          0.9 kB
          Matthias Leich
        2. MDEV-20483-C.test
          1.0 kB
          Matthias Leich

        Issue Links

          Activity

            Assertion `(trx)->lock.table_locks.empty()' failed

            mleich Matthias Leich added a comment - Assertion `(trx)->lock.table_locks.empty()' failed
            mleich Matthias Leich added a comment - - edited

            Please try the uploaded tests.
            MDEV-20483-C.test  replays the assert.
            MDEV-20483-B.test
                          Is too some huge extent like MDEV-20483-C.test.
                          It tries in addition first to insert illegal values via INSERT ... VALUES ...
                          This gets denied (ERROR 22003: Out of range value for column 'pk' at row 1).
                          Than it tries to insert the same ollegal values via INSERT ...  SELECT and harvests
                          again ERROR 22003: Out of range value for column 'pk' at row 1
                          whereas the same INSERT ... SELECT has success in MDEV-20483-C.test.
             
            That MDEV-20483-C.test shows finally the assert is one thing.
            But that MDEV-20483-B.test has just a denied statement in addition and does not
            harvest finally the same assert is IMHO quite suspicious.
             
            
            

            mleich Matthias Leich added a comment - - edited Please try the uploaded tests. MDEV-20483-C.test replays the assert. MDEV-20483-B.test Is too some huge extent like MDEV-20483-C.test. It tries in addition first to insert illegal values via INSERT ... VALUES ... This gets denied (ERROR 22003: Out of range value for column 'pk' at row 1). Than it tries to insert the same ollegal values via INSERT ... SELECT and harvests again ERROR 22003: Out of range value for column 'pk' at row 1 whereas the same INSERT ... SELECT has success in MDEV-20483-C.test.   That MDEV-20483-C.test shows finally the assert is one thing. But that MDEV-20483-B.test has just a denied statement in addition and does not harvest finally the same assert is IMHO quite suspicious.  

            This appears to be intentional, after all. While the trx_lock_t::table_locks is not empty, it is filled with NULL pointers. The test MDEV-20483-C.test does not fail with 10.3 2842c369851a8afc2a944ce6f4f60fa052f20969 and the following patch:

            diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
            index f612be4d177..59a27ee112e 100644
            --- a/storage/innobase/trx/trx0trx.cc
            +++ b/storage/innobase/trx/trx0trx.cc
            @@ -1363,6 +1363,13 @@ trx_commit_in_memory(
             		DBUG_LOG("trx", "Autocommit in memory: " << trx);
             		trx->state = TRX_STATE_NOT_STARTED;
             	} else {
            +#ifdef UNIV_DEBUG
            +		if (!UT_LIST_GET_LEN(trx->lock.trx_locks)) {
            +			for (const lock_t* l : trx->lock.table_locks) {
            +				ut_ad(!l);
            +			}
            +		}
            +#endif
             		trx->commit_state();
             
             		if (trx->id) {
            

            This patch checks that when trx_locks is empty and table_locks is not, then table_locks must consist of only NULL pointers.
            The NULL pointers are being written at least by lock_trx_table_locks_remove().

            mleich, can you please retest the latest 10.3 with the above patch applied? If the assertion in this patch does not fail, then we can conclude that there is no bug after all.

            I am asking to retest, because only some callers of lock_table_dequeue() invoke also lock_trx_table_locks_remove(). So, I cannot rule out a bug yet.

            marko Marko Mäkelä added a comment - This appears to be intentional, after all. While the trx_lock_t::table_locks is not empty, it is filled with NULL pointers. The test MDEV-20483-C.test does not fail with 10.3 2842c369851a8afc2a944ce6f4f60fa052f20969 and the following patch: diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index f612be4d177..59a27ee112e 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -1363,6 +1363,13 @@ trx_commit_in_memory( DBUG_LOG("trx", "Autocommit in memory: " << trx); trx->state = TRX_STATE_NOT_STARTED; } else { +#ifdef UNIV_DEBUG + if (!UT_LIST_GET_LEN(trx->lock.trx_locks)) { + for (const lock_t* l : trx->lock.table_locks) { + ut_ad(!l); + } + } +#endif trx->commit_state(); if (trx->id) { This patch checks that when trx_locks is empty and table_locks  is not, then table_locks must consist of only NULL pointers. The NULL pointers are being written at least by lock_trx_table_locks_remove() . mleich , can you please retest the latest 10.3 with the above patch applied? If the assertion in this patch does not fail, then we can conclude that there is no bug after all. I am asking to retest, because only some callers of lock_table_dequeue() invoke also lock_trx_table_locks_remove() . So, I cannot rule out a bug yet.

            The patch direct above avoids to hit the assertion Assertion `(trx)->lock.table_locks.empty()' failed
            but certain tests hit
                mysqld: storage/innobase/trx/trx0trx.cc:1369: void trx_commit_in_memory(trx_t*, const mtr_t*): Assertion `!l' failed.
             
            MTR Based replay test:
            --source include/have_innodb.inc
            --connect (con15,localhost,root,,)
            CREATE TABLE t1 ( col1 INT, pk INTEGER AUTO_INCREMENT, PRIMARY KEY (pk)) ENGINE = InnoDB  ;
            INSERT INTO t1 VALUES (NULL, NULL)  ;
             
            --connect (con16,localhost,root,,)
            --connection con16
            # Without AUTOCOMMIT = OFF no assert!
            SET AUTOCOMMIT = OFF  ;
            # With LOCK TABLES t1 WRITE (no CONCURRENT) no assert! Just the MDL timout kicks in.
            # LOCK TABLES t1 WRITE ;
              LOCK TABLES t1 WRITE CONCURRENT ;
             
            --connection con15
            SET SESSION lock_wait_timeout = 2;
            SET SESSION innodb_lock_wait_timeout = 1;
             
            # Variant 1 which replays
            # SELECT * FROM t1 INTO OUTFILE 'load_t1' ; LOAD DATA INFILE 'load_t1' REPLACE INTO TABLE t1 ;
             
            # Variant 2 which replays
              REPLACE INTO t1 (col1) SELECT col1 FROM (SELECT col1 FROM t1) AS load_tab ;
             
            DROP TABLE t1;
            

            mleich Matthias Leich added a comment - The patch direct above avoids to hit the assertion Assertion `(trx)->lock.table_locks.empty()' failed but certain tests hit mysqld: storage/innobase/trx/trx0trx.cc:1369: void trx_commit_in_memory(trx_t*, const mtr_t*): Assertion `!l' failed.   MTR Based replay test: --source include/have_innodb.inc --connect (con15,localhost,root,,) CREATE TABLE t1 ( col1 INT, pk INTEGER AUTO_INCREMENT, PRIMARY KEY (pk)) ENGINE = InnoDB ; INSERT INTO t1 VALUES (NULL, NULL) ;   --connect (con16,localhost,root,,) --connection con16 # Without AUTOCOMMIT = OFF no assert! SET AUTOCOMMIT = OFF ; # With LOCK TABLES t1 WRITE (no CONCURRENT) no assert! Just the MDL timout kicks in. # LOCK TABLES t1 WRITE ; LOCK TABLES t1 WRITE CONCURRENT ;   --connection con15 SET SESSION lock_wait_timeout = 2; SET SESSION innodb_lock_wait_timeout = 1;   # Variant 1 which replays # SELECT * FROM t1 INTO OUTFILE 'load_t1' ; LOAD DATA INFILE 'load_t1' REPLACE INTO TABLE t1 ;   # Variant 2 which replays REPLACE INTO t1 (col1) SELECT col1 FROM (SELECT col1 FROM t1) AS load_tab ;   DROP TABLE t1;

            The following patch removes the lock from table_locks vector. Previously lock_wait_timeout_thread doesn't remove
            the table lock from vector. The following patch address the code path.

            diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc
            index b7e17097546..9fdc8feee90 100644
            --- a/storage/innobase/lock/lock0lock.cc
            +++ b/storage/innobase/lock/lock0lock.cc
            @@ -6480,6 +6480,8 @@ lock_cancel_waiting_and_release(
                            }
             
                            lock_table_dequeue(lock);
            +               /* Remove the lock from table lock vector too. */
            +               lock_trx_table_locks_remove(lock);
                    }
             
                    /* Reset the wait flag and the back pointer to lock in trx. */
            diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
            index 1dcca7c1f72..200335d605e 100644
            --- a/storage/innobase/trx/trx0trx.cc
            +++ b/storage/innobase/trx/trx0trx.cc
            @@ -1695,6 +1695,13 @@ trx_commit_in_memory(
                            DBUG_LOG("trx", "Autocommit in memory: " << trx);
                            trx->state = TRX_STATE_NOT_STARTED;
                    } else {
            +#ifdef UNIV_DEBUG
            +               if (!UT_LIST_GET_LEN(trx->lock.trx_locks)) {
            +                       for (const lock_t* l : trx->lock.table_locks) {
            +                               ut_ad(!l);
            +                       }
            +               }
            +#endif
                            trx_mutex_enter(trx);
                            trx->commit_state();
                            trx_mutex_exit(trx);
            

            Above patch based on 10.2 (df4dee4b84ddc34799fa3a9648c142f12564597f) version.
            marko still have doubt that there is another place where the call is missing.
            So mleich, can you please run the rqg test with the above patch to find out any other issue ?

            thiru Thirunarayanan Balathandayuthapani added a comment - The following patch removes the lock from table_locks vector. Previously lock_wait_timeout_thread doesn't remove the table lock from vector. The following patch address the code path. diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc index b7e17097546..9fdc8feee90 100644 --- a/storage/innobase/lock/lock0lock.cc +++ b/storage/innobase/lock/lock0lock.cc @@ -6480,6 +6480,8 @@ lock_cancel_waiting_and_release( } lock_table_dequeue(lock); + /* Remove the lock from table lock vector too. */ + lock_trx_table_locks_remove(lock); } /* Reset the wait flag and the back pointer to lock in trx. */ diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index 1dcca7c1f72..200335d605e 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -1695,6 +1695,13 @@ trx_commit_in_memory( DBUG_LOG("trx", "Autocommit in memory: " << trx); trx->state = TRX_STATE_NOT_STARTED; } else { +#ifdef UNIV_DEBUG + if (!UT_LIST_GET_LEN(trx->lock.trx_locks)) { + for (const lock_t* l : trx->lock.table_locks) { + ut_ad(!l); + } + } +#endif trx_mutex_enter(trx); trx->commit_state(); trx_mutex_exit(trx); Above patch based on 10.2 (df4dee4b84ddc34799fa3a9648c142f12564597f) version. marko still have doubt that there is another place where the call is missing. So mleich , can you please run the rqg test with the above patch to find out any other issue ?

            When testing 10.2 + the patch above with RQG I got the usual fraction of known+open bugs.

            mleich Matthias Leich added a comment - When testing 10.2 + the patch above with RQG I got the usual fraction of known+open bugs.

            Thanks, thiru! You forgot to remove the work-around from trx_t::release_locks() as part of your patch. Please remove that as well.

            Indeed, you added the only missing call. The one in lock_release() is intentionally missing, as the only caller lock_trx_release_locks() is invoking trx->lock.table_locks.clear().

            As far as I can tell, based on a comment in lock_table(), the primary motivation for table_locks to exist is that while trx_locks can be modified by other threads (which can be converting implicit locks to explicit ones), table_locks is only accessed by the thread where the transaction is running. Despite this, lock_table_create() is holding trx->mutex (and lock_sys->mutex) while appending to table_locks.

            Given that the only extra locks in table_locks were cancelled wait locks, and that the only non-debug read access of table_locks is lock_table_has(), which ignores lock waits, there does not seem to be any correctness impact of this bug.

            marko Marko Mäkelä added a comment - Thanks, thiru ! You forgot to remove the work-around from trx_t::release_locks() as part of your patch. Please remove that as well. Indeed, you added the only missing call. The one in lock_release() is intentionally missing, as the only caller lock_trx_release_locks() is invoking trx->lock.table_locks.clear() . As far as I can tell, based on a comment in lock_table() , the primary motivation for table_locks to exist is that while trx_locks  can be modified by other threads (which can be converting implicit locks to explicit ones), table_locks is only accessed by the thread where the transaction is running. Despite this, lock_table_create() is holding trx->mutex (and lock_sys->mutex ) while appending to table_locks . Given that the only extra locks in table_locks were cancelled wait locks, and that the only non-debug read access of table_locks is lock_table_has() , which ignores lock waits, there does not seem to be any correctness impact of this bug.

            People

              thiru Thirunarayanan Balathandayuthapani
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.