Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24811

Assertion `find(table)' failed in dict_sys_t::remove upon concurrent COMMIT with innodb_evict_tables_on_commit_debug

Details

    Description

      The test case is for reproducing purposes only, don't put it to the regression suite. The loop is inside the test, so no need to run with --repeat. It currently fails very quickly for me on different machines and versions, but I guess it can still vary depending on the environment. Adjust the $run variable if necessary.

      --source include/have_innodb.inc
       
      CREATE TABLE t1 (pk INT AUTO_INCREMENT PRIMARY KEY) ENGINE=InnoDB;
      SET GLOBAL innodb_evict_tables_on_commit_debug= ON;
       
      --connect (con1,localhost,root,,test)
      --connect (con2,localhost,root,,test)
       
      --let $run= 1000
       
      while ($run)
      {
        --connection con1
        START TRANSACTION;
        INSERT INTO t1 () VALUES ();
        --connection con2
        START TRANSACTION;
        INSERT INTO t1 () VALUES ();
        --connection default
        FLUSH TABLES;
        --connection con2
        --send
          COMMIT;
        --connection con1
        COMMIT;
        --connection con2
        --reap
        --dec $run
      }
       
      --disconnect con1
      --disconnect con2
      --connection default
      DROP TABLE t1;
      SET GLOBAL innodb_evict_tables_on_commit_debug= DEFAULT;
      

      10.5 eacefbca

      mariadbd: /data/src/10.5/storage/innobase/dict/dict0dict.cc:1920: void dict_sys_t::remove(dict_table_t*, bool, bool): Assertion `find(table)' failed.
      210209  1:44:52 [ERROR] mysqld got signal 6 ;
       
      #7  0x00007fbc6e9faf36 in __GI___assert_fail (assertion=0x56495b1386d1 "find(table)", file=0x56495b139990 "/data/src/10.5/storage/innobase/dict/dict0dict.cc", line=1920, function=0x56495b13a5d0 "void dict_sys_t::remove(dict_table_t*, bool, bool)") at assert.c:101
      #8  0x000056495aa98594 in dict_sys_t::remove (this=0x56495b7233e0 <dict_sys>, table=0x7fbc5400ed08, lru=true, keep=false) at /data/src/10.5/storage/innobase/dict/dict0dict.cc:1920
      #9  0x000056495a9c527a in trx_update_mod_tables_timestamp (trx=0x7fbc6904e360) at /data/src/10.5/storage/innobase/trx/trx0trx.cc:1277
      #10 0x000056495a9ca477 in trx_t::commit_in_memory (this=0x7fbc6904e360, mtr=0x7fbc63d1bc70) at /data/src/10.5/storage/innobase/trx/trx0trx.cc:1383
      #11 0x000056495a9c5d21 in trx_t::commit_low (this=0x7fbc6904e360, mtr=0x7fbc63d1bc70) at /data/src/10.5/storage/innobase/trx/trx0trx.cc:1550
      #12 0x000056495a9c5db3 in trx_t::commit (this=0x7fbc6904e360) at /data/src/10.5/storage/innobase/trx/trx0trx.cc:1564
      #13 0x000056495a9c62e5 in trx_commit_for_mysql (trx=0x7fbc6904e360) at /data/src/10.5/storage/innobase/trx/trx0trx.cc:1696
      #14 0x000056495a739b71 in innobase_commit_low (trx=0x7fbc6904e360) at /data/src/10.5/storage/innobase/handler/ha_innodb.cc:4035
      #15 0x000056495a739eaf in innobase_commit_ordered_2 (trx=0x7fbc6904e360, thd=0x7fbc30000db8) at /data/src/10.5/storage/innobase/handler/ha_innodb.cc:4141
      #16 0x000056495a73a42a in innobase_commit (hton=0x56495c8dfe18, thd=0x7fbc30000db8, commit_trx=true) at /data/src/10.5/storage/innobase/handler/ha_innodb.cc:4245
      #17 0x000056495a2c351a in commit_one_phase_2 (thd=0x7fbc30000db8, all=true, trans=0x7fbc30004530, is_real_trans=true) at /data/src/10.5/sql/handler.cc:1941
      #18 0x000056495a2c3409 in ha_commit_one_phase (thd=0x7fbc30000db8, all=true) at /data/src/10.5/sql/handler.cc:1920
      #19 0x000056495a2c2545 in ha_commit_trans (thd=0x7fbc30000db8, all=true) at /data/src/10.5/sql/handler.cc:1714
      #20 0x000056495a117d0c in trans_commit (thd=0x7fbc30000db8) at /data/src/10.5/sql/transaction.cc:266
      #21 0x0000564959f5c966 in mysql_execute_command (thd=0x7fbc30000db8) at /data/src/10.5/sql/sql_parse.cc:5650
      #22 0x0000564959f642f2 in mysql_parse (thd=0x7fbc30000db8, rawbuf=0x7fbc30012ae0 "COMMIT", length=6, parser_state=0x7fbc63d1d510, is_com_multi=false, is_next_command=false) at /data/src/10.5/sql/sql_parse.cc:8062
      #23 0x0000564959f50275 in dispatch_command (command=COM_QUERY, thd=0x7fbc30000db8, packet=0x7fbc300090b9 "COMMIT", packet_length=6, is_com_multi=false, is_next_command=false) at /data/src/10.5/sql/sql_parse.cc:1889
      #24 0x0000564959f4ea69 in do_command (thd=0x7fbc30000db8) at /data/src/10.5/sql/sql_parse.cc:1370
      #25 0x000056495a0fc8eb in do_handle_one_connection (connect=0x56495ce26578, put_in_cache=true) at /data/src/10.5/sql/sql_connect.cc:1410
      #26 0x000056495a0fc64e in handle_one_connection (arg=0x56495ce26578) at /data/src/10.5/sql/sql_connect.cc:1312
      #27 0x000056495a65c62b in pfs_spawn_thread (arg=0x56495ce26658) at /data/src/10.5/storage/perfschema/pfs.cc:2201
      #28 0x00007fbc6ef12609 in start_thread (arg=<optimized out>) at pthread_create.c:477
      #29 0x00007fbc6eae6293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      Reproducible on 10.4, 10.5.
      10.6 also fails, but it needs an extra INSERT after CREATE TABLE, otherwise con2 hangs, apparently as an effect of MDEV-515 changes.
      The test case is not applicable to non-debug builds due to the use of the debug variable.
      Does not fail for me with rr, ever.

      10.3 fails on a different assertion instead:

      10.3 59eda73e

      mysqld: /data/src/10.3/storage/innobase/include/ut0rnd.ic:74: ulint ut_fold_string(const char*): Assertion `str' failed.
      210209  1:54:05 [ERROR] mysqld got signal 6 ;
       
      #7  0x00007f527c812f36 in __GI___assert_fail (assertion=0x563bf09fd782 "str", file=0x563bf09fd728 "/data/src/10.3/storage/innobase/include/ut0rnd.ic", line=74, function=0x563bf09fd788 "ulint ut_fold_string(const char*)") at assert.c:101
      #8  0x0000563bf039373b in ut_fold_string (str=0x0) at /data/src/10.3/storage/innobase/include/ut0rnd.ic:74
      #9  0x0000563bf039f5bf in dict_table_remove_from_cache_low (table=0x7f5244012420, lru_evict=1) at /data/src/10.3/storage/innobase/dict/dict0dict.cc:1873
      #10 0x0000563bf02b2e25 in trx_update_mod_tables_timestamp (trx=0x7f52764d0188) at /data/src/10.3/storage/innobase/trx/trx0trx.cc:1321
      #11 0x0000563bf02b35bd in trx_commit_in_memory (trx=0x7f52764d0188, mtr=0x7f5275981cc0) at /data/src/10.3/storage/innobase/trx/trx0trx.cc:1406
      #12 0x0000563bf02b4aa3 in trx_commit_low (trx=0x7f52764d0188, mtr=0x7f5275981cc0) at /data/src/10.3/storage/innobase/trx/trx0trx.cc:1601
      #13 0x0000563bf02b4b5f in trx_commit (trx=0x7f52764d0188) at /data/src/10.3/storage/innobase/trx/trx0trx.cc:1625
      #14 0x0000563bf02b5045 in trx_commit_for_mysql (trx=0x7f52764d0188) at /data/src/10.3/storage/innobase/trx/trx0trx.cc:1761
      #15 0x0000563bf00313df in innobase_commit_low (trx=0x7f52764d0188) at /data/src/10.3/storage/innobase/handler/ha_innodb.cc:4342
      #16 0x0000563bf00317b8 in innobase_commit_ordered_2 (trx=0x7f52764d0188, thd=0x7f5224000d90) at /data/src/10.3/storage/innobase/handler/ha_innodb.cc:4463
      #17 0x0000563bf0031df7 in innobase_commit (hton=0x563bf38a0250, thd=0x7f5224000d90, commit_trx=true) at /data/src/10.3/storage/innobase/handler/ha_innodb.cc:4578
      #18 0x0000563befdfd7d2 in commit_one_phase_2 (thd=0x7f5224000d90, all=true, trans=0x7f52240042a8, is_real_trans=true) at /data/src/10.3/sql/handler.cc:1642
      #19 0x0000563befdfd6c1 in ha_commit_one_phase (thd=0x7f5224000d90, all=true) at /data/src/10.3/sql/handler.cc:1622
      #20 0x0000563befdfcd5a in ha_commit_trans (thd=0x7f5224000d90, all=true) at /data/src/10.3/sql/handler.cc:1484
      #21 0x0000563befc5f40d in trans_commit (thd=0x7f5224000d90) at /data/src/10.3/sql/transaction.cc:292
      #22 0x0000563befad9639 in mysql_execute_command (thd=0x7f5224000d90) at /data/src/10.3/sql/sql_parse.cc:5553
      #23 0x0000563befae0b88 in mysql_parse (thd=0x7f5224000d90, rawbuf=0x7f52240116f8 "COMMIT", length=6, parser_state=0x7f52759835c0, is_com_multi=false, is_next_command=false) at /data/src/10.3/sql/sql_parse.cc:7840
      #24 0x0000563befacd2ec in dispatch_command (command=COM_QUERY, thd=0x7f5224000d90, packet=0x7f5224008f11 "COMMIT", packet_length=6, is_com_multi=false, is_next_command=false) at /data/src/10.3/sql/sql_parse.cc:1852
      #25 0x0000563befacbc8c in do_command (thd=0x7f5224000d90) at /data/src/10.3/sql/sql_parse.cc:1398
      #26 0x0000563befc4a295 in do_handle_one_connection (connect=0x563bf3bafff0) at /data/src/10.3/sql/sql_connect.cc:1403
      #27 0x0000563befc49ff1 in handle_one_connection (arg=0x563bf3bafff0) at /data/src/10.3/sql/sql_connect.cc:1308
      #28 0x0000563bf0616049 in pfs_spawn_thread (arg=0x563bf3bf1280) at /data/src/10.3/storage/perfschema/pfs.cc:1869
      #29 0x00007f527cd22609 in start_thread (arg=<optimized out>) at pthread_create.c:477
      #30 0x00007f527c8fe293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      This second assertion failure is mentioned in MDEV-20810, but I'm not sure whether it's the same problem – MDEV-20810 seems to have something to do with virtual columns, and this one doesn't; anyway, the suggested patch in MDEV-20810 doesn't do anything for the test case here – it doesn't make it fail without the debug variable.

      Attachments

        Issue Links

          Activity

            On 10.6, the test requires a little tweak to work around MDEV-24818:

            --connect (con1,localhost,root,,test)
            SET DEBUG_DBUG='+d,row_ins_row_level';
            --connect (con2,localhost,root,,test)
            SET DEBUG_DBUG='+d,row_ins_row_level';
            

            The crash repeats easily without rr record. With it, it takes more time but is repeatable for me.

            marko Marko Mäkelä added a comment - On 10.6, the test requires a little tweak to work around MDEV-24818 : --connect (con1,localhost,root,,test) SET DEBUG_DBUG= '+d,row_ins_row_level' ; --connect (con2,localhost,root,,test) SET DEBUG_DBUG= '+d,row_ins_row_level' ; The crash repeats easily without rr record . With it, it takes more time but is repeatable for me.

            This looks like a race condition in innodb_evict_tables_on_commit_debug. The first eviction happened here:

            10.6

            #0  dict_sys_t::remove (this=0x55be4ec4fc40 <dict_sys>, table=0x7f30fc027f98, 
                lru=true, keep=false)
                at /mariadb/10.6m/storage/innobase/dict/dict0dict.cc:1968
            #1  0x000055be4e11d373 in trx_t::commit_tables (this=this@entry=0x67f52cb8d3d8)
                at /mariadb/10.6m/storage/innobase/trx/trx0trx.cc:1266
            #2  0x000055be4e11c1c8 in trx_t::commit_in_memory (
                this=this@entry=0x67f52cb8d3d8, mtr=mtr@entry=0x5b311bfb21b8)
                at /mariadb/10.6m/storage/innobase/trx/trx0trx.cc:1369
            #3  0x000055be4e116f0c in trx_t::commit_low (this=this@entry=0x67f52cb8d3d8, 
                mtr=mtr@entry=0x5b311bfb21b8)
                at /mariadb/10.6m/storage/innobase/trx/trx0trx.cc:1538
            #4  0x000055be4e117a9a in trx_t::commit (this=this@entry=0x67f52cb8d3d8)
                at /mariadb/10.6m/storage/innobase/trx/trx0trx.cc:1552
            

            At this point of time, the thread where the assertion find(table) would fail would be executing an earlier step in a transaction commit:

            #5  0x000055be4d989223 in debug_sync (thd=0x64001b48, sync_point_name=0x55be4e62d98f "after_trx_committed_in_memory", name_len=29) at /mariadb/10.6m/sql/debug_sync.cc:1566
            #6  0x000055be4e11c169 in trx_t::commit_in_memory (this=this@entry=0x67f52cb8c2a0, mtr=mtr@entry=0x491834d211b8) at /mariadb/10.6m/storage/innobase/trx/trx0trx.cc:1361
            #7  0x000055be4e116f0c in trx_t::commit_low (this=this@entry=0x67f52cb8c2a0, mtr=mtr@entry=0x491834d211b8) at /mariadb/10.6m/storage/innobase/trx/trx0trx.cc:1538
            #8  0x000055be4e117a9a in trx_t::commit (this=this@entry=0x67f52cb8c2a0) at /mariadb/10.6m/storage/innobase/trx/trx0trx.cc:1552
            

            In 10.6, this code is somewhat different from earlier versions, because MDEV-20612 moved the query cache invalidation from lock_release() to trx_t::commit_tables().

            All other calls to dict_sys.remove() are from DDL, except this one and the call in dict_make_room_in_cache(). The check in dict_table_can_be_evicted() seems to be vulnerable to this same race condition: at the point of the above execution, the reference count of the table will be 0 and there will be no locks on the table, because lock_release() had already completed execution.

            It seems that we must process the trx_t::mod_tables before releasing the locks on the tables.

            marko Marko Mäkelä added a comment - This looks like a race condition in innodb_evict_tables_on_commit_debug . The first eviction happened here: 10.6 #0 dict_sys_t::remove (this=0x55be4ec4fc40 <dict_sys>, table=0x7f30fc027f98, lru=true, keep=false) at /mariadb/10.6m/storage/innobase/dict/dict0dict.cc:1968 #1 0x000055be4e11d373 in trx_t::commit_tables (this=this@entry=0x67f52cb8d3d8) at /mariadb/10.6m/storage/innobase/trx/trx0trx.cc:1266 #2 0x000055be4e11c1c8 in trx_t::commit_in_memory ( this=this@entry=0x67f52cb8d3d8, mtr=mtr@entry=0x5b311bfb21b8) at /mariadb/10.6m/storage/innobase/trx/trx0trx.cc:1369 #3 0x000055be4e116f0c in trx_t::commit_low (this=this@entry=0x67f52cb8d3d8, mtr=mtr@entry=0x5b311bfb21b8) at /mariadb/10.6m/storage/innobase/trx/trx0trx.cc:1538 #4 0x000055be4e117a9a in trx_t::commit (this=this@entry=0x67f52cb8d3d8) at /mariadb/10.6m/storage/innobase/trx/trx0trx.cc:1552 At this point of time, the thread where the assertion find(table) would fail would be executing an earlier step in a transaction commit: #5 0x000055be4d989223 in debug_sync (thd=0x64001b48, sync_point_name=0x55be4e62d98f "after_trx_committed_in_memory", name_len=29) at /mariadb/10.6m/sql/debug_sync.cc:1566 #6 0x000055be4e11c169 in trx_t::commit_in_memory (this=this@entry=0x67f52cb8c2a0, mtr=mtr@entry=0x491834d211b8) at /mariadb/10.6m/storage/innobase/trx/trx0trx.cc:1361 #7 0x000055be4e116f0c in trx_t::commit_low (this=this@entry=0x67f52cb8c2a0, mtr=mtr@entry=0x491834d211b8) at /mariadb/10.6m/storage/innobase/trx/trx0trx.cc:1538 #8 0x000055be4e117a9a in trx_t::commit (this=this@entry=0x67f52cb8c2a0) at /mariadb/10.6m/storage/innobase/trx/trx0trx.cc:1552 In 10.6, this code is somewhat different from earlier versions, because MDEV-20612 moved the query cache invalidation from lock_release() to trx_t::commit_tables() . All other calls to dict_sys.remove() are from DDL, except this one and the call in dict_make_room_in_cache() . The check in dict_table_can_be_evicted() seems to be vulnerable to this same race condition: at the point of the above execution, the reference count of the table will be 0 and there will be no locks on the table, because lock_release() had already completed execution. It seems that we must process the trx_t::mod_tables before releasing the locks on the tables.

            While the parameter innodb_evict_tables_on_commit_debug exists in debug builds for debugging purposes only, there turns out to be a real bug that MDEV-14477 introduced in MariaDB Server 10.3.3.

            Before that change, all tables in trx_t::mod_tables were protected against eviction by the locks that the transaction is holding. Because MDEV-14477 would only invoke trx_update_mod_tables_timestamp() after lock_release(), it is possible that some tables are evicted before trx_update_mod_tables_timestamp() is being executed.

            The fix is to process the trx_t::mod_tables before lock_release(). The logic of innodb_evict_tables_on_commit_debug must be moved to lock_release(): First, collect the table IDs of tables that might be evicted, and at the end, look up the tables (if they exist) and try to evict them if they are eligible.

            marko Marko Mäkelä added a comment - While the parameter innodb_evict_tables_on_commit_debug exists in debug builds for debugging purposes only, there turns out to be a real bug that MDEV-14477 introduced in MariaDB Server 10.3.3. Before that change, all tables in trx_t::mod_tables were protected against eviction by the locks that the transaction is holding. Because MDEV-14477 would only invoke trx_update_mod_tables_timestamp() after lock_release() , it is possible that some tables are evicted before trx_update_mod_tables_timestamp() is being executed. The fix is to process the trx_t::mod_tables before lock_release() . The logic of innodb_evict_tables_on_commit_debug must be moved to lock_release() : First, collect the table IDs of tables that might be evicted, and at the end, look up the tables (if they exist) and try to evict them if they are eligible.

            People

              marko Marko Mäkelä
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.