Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35647

Possible hang during CREATE TABLE…SELECT error handling

Details

    Description

      The test case is non-deterministic, run with --repeat=N. It usually fails for me within ~10 attempts on a plain debug build, but it can vary on different machines and builds. The test case is rr-able, although it may require some patience, hundreds of attempts may be needed (again, depending on the machine and build).
      I could only reproduce it on 11.7 and main.

      --source include/have_innodb.inc
       
      CREATE TABLE t1 (a INT) ENGINE=InnoDB;
       
      --source include/restart_mysqld.inc
      SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;
       
      --connect (con1,localhost,root,,)
      CREATE TABLE t2 (b BLOB) ENGINE=InnoDB;
      INSERT INTO t2 VALUES (1),('2025-01-21 00:00:00');
      --send
        CREATE TABLE t3 ENGINE=InnoDB AS SELECT * FROM t1;
       
      --connection default
      --error ER_TRUNCATED_WRONG_VALUE
        CREATE TABLE t4 (b BLOB CHECK (b)) ENGINE=InnoDB AS SELECT b FROM t2;
      --connection con1
      --reap
       
      DROP TABLE t1, t2, t3;
      

      11.7 d60efa269e798537efa77b2b915288425f6e73f1

      mariadbd: /data/MDEV-35647/11.7/storage/innobase/sync/srw_lock.cc:696: void srw_lock_debug::rd_lock(): Assertion `!have_any()' failed.
       
      (rr) bt
      #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
      #1  0x0000153cd6d71859 in __GI_abort () at abort.c:79
      #2  0x0000153cd6d71729 in __assert_fail_base (fmt=0x153cd6f07588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x564a3b7b6a34 "!have_any()", 
          file=0x564a3b7b67d0 "/data/MDEV-35647/11.7/storage/innobase/sync/srw_lock.cc", line=696, function=<optimized out>) at assert.c:92
      #3  0x0000153cd6d82fd6 in __GI___assert_fail (assertion=assertion@entry=0x564a3b7b6a34 "!have_any()", 
          file=file@entry=0x564a3b7b67d0 "/data/MDEV-35647/11.7/storage/innobase/sync/srw_lock.cc", line=line@entry=696, 
          function=function@entry=0x564a3b7b6bd8 "void srw_lock_debug::rd_lock()") at assert.c:101
      #4  0x0000564a3a9784fb in srw_lock_debug::rd_lock (this=0x564a3bf84140 <dict_sys+64>) at /data/MDEV-35647/11.7/storage/innobase/sync/srw_lock.cc:696
      #5  0x0000564a3b25f49c in dict_sys_t::freeze (this=0x564a3bf84100 <dict_sys>) at /data/MDEV-35647/11.7/storage/innobase/include/dict0dict.h:1476
      #6  dict_table_open_on_id (table_id=2, dict_locked=dict_locked@entry=false, table_op=table_op@entry=DICT_TABLE_OP_NORMAL, thd=thd@entry=0x0, mdl=mdl@entry=0x0)
          at /data/MDEV-35647/11.7/storage/innobase/dict/dict0dict.cc:871
      #7  0x0000564a3b1cd7e5 in row_undo_mod_parse_undo_rec (dict_locked=false, node=0x153cb803fb40) at /data/MDEV-35647/11.7/storage/innobase/row/row0umod.cc:1210
      #8  row_undo_mod (node=node@entry=0x153cb803fb40, thr=thr@entry=0x153cb8038cb0) at /data/MDEV-35647/11.7/storage/innobase/row/row0umod.cc:1327
      #9  0x0000564a3b1cf47f in row_undo (thr=0x153cb8038cb0, node=0x153cb803fb40) at /data/MDEV-35647/11.7/storage/innobase/include/trx0undo.inl:82
      #10 row_undo_step (thr=thr@entry=0x153cb8038cb0) at /data/MDEV-35647/11.7/storage/innobase/row/row0undo.cc:442
      #11 0x0000564a3b171b90 in que_thr_step (thr=0x153cb8038cb0) at /data/MDEV-35647/11.7/storage/innobase/que/que0que.cc:552
      #12 que_run_threads_low (thr=0x153cb8038cb0) at /data/MDEV-35647/11.7/storage/innobase/que/que0que.cc:610
      #13 que_run_threads (thr=0x153cb8038cb0) at /data/MDEV-35647/11.7/storage/innobase/que/que0que.cc:630
      #14 0x0000564a3b1f3ff0 in trx_t::rollback_low (this=this@entry=0x153cd5739680, savept=savept@entry=0x0)
          at /data/MDEV-35647/11.7/storage/innobase/trx/trx0roll.cc:129
      #15 0x0000564a3b1f256c in trx_t::rollback (this=this@entry=0x153cd5739680, savept=savept@entry=0x0)
          at /data/MDEV-35647/11.7/storage/innobase/trx/trx0roll.cc:183
      #16 0x0000564a3b0cc9f4 in ha_innobase::delete_table (this=<optimized out>, name=<optimized out>)
          at /data/MDEV-35647/11.7/storage/innobase/handler/ha_innodb.cc:13644
      #17 0x0000564a3ae4725c in hton_drop_table (hton=0x564a3dcda578, path=<optimized out>) at /data/MDEV-35647/11.7/sql/handler.cc:578
      #18 0x0000564a3ae4de15 in ha_delete_table (thd=thd@entry=0x153cb8012418, hton=hton@entry=0x564a3dcda578, path=path@entry=0x153ccd2a5650 "./test/t4", 
          db=db@entry=0x153cb8024d08, alias=alias@entry=0x153cb8024d18, generate_warning=generate_warning@entry=false) at /data/MDEV-35647/11.7/sql/handler.cc:3303
      #19 0x0000564a3ac64576 in quick_rm_table (thd=0x153cb8012418, base=0x564a3dcda578, db=0x153cb8024d08, table_name=0x153cb8024d18, flags=flags@entry=20, 
          table_path=table_path@entry=0x0) at /data/MDEV-35647/11.7/sql/sql_table.cc:2152
      #20 0x0000564a3ab30948 in drop_open_table (thd=<optimized out>, table=<optimized out>, db_name=<optimized out>, table_name=<optimized out>)
          at /data/MDEV-35647/11.7/sql/sql_base.cc:1462
      #21 0x0000564a3ab68eee in select_create::abort_result_set (this=0x153cb8026f80) at /data/MDEV-35647/11.7/sql/sql_insert.cc:5411
      #22 select_create::abort_result_set (this=0x153cb8026f80) at /data/MDEV-35647/11.7/sql/sql_insert.cc:5346
      #23 0x0000564a3ac2f812 in handle_select (thd=thd@entry=0x153cb8012418, lex=lex@entry=0x153cb8016660, result=result@entry=0x153cb8026f80, 
          setup_tables_done_option=setup_tables_done_option@entry=0) at /data/MDEV-35647/11.7/sql/sql_select.cc:651
      #24 0x0000564a3ac71e8f in Sql_cmd_create_table_like::execute (this=<optimized out>, thd=0x153cb8012418) at /data/MDEV-35647/11.7/sql/sql_table.cc:13377
      #25 0x0000564a3abaff4f in mysql_execute_command (thd=0x153cb8012418, is_called_from_prepared_stmt=<optimized out>)
          at /data/MDEV-35647/11.7/sql/sql_parse.cc:5872
      #26 0x0000564a3ab9e784 in mysql_parse (thd=0x153cb8012418, rawbuf=<optimized out>, length=<optimized out>, parser_state=<optimized out>)
          at /data/MDEV-35647/11.7/sql/sql_parse.cc:7901
      #27 0x0000564a3abac3a5 in dispatch_command (command=COM_QUERY, thd=0x153cb8012418, packet=<optimized out>, packet_length=<optimized out>, 
          blocking=<optimized out>) at /data/MDEV-35647/11.7/sql/sql_class.h:1656
      #28 0x0000564a3abae84e in do_command (thd=thd@entry=0x153cb8012418, blocking=blocking@entry=true) at /data/MDEV-35647/11.7/sql/sql_parse.cc:1416
      #29 0x0000564a3ace4dad in do_handle_one_connection (connect=<optimized out>, put_in_cache=true) at /data/MDEV-35647/11.7/sql/sql_connect.cc:1415
      #30 0x0000564a3ace50cd in handle_one_connection (arg=0x564a3e0e6818) at /data/MDEV-35647/11.7/sql/sql_connect.cc:1327
      #31 0x0000153cd729d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
      #32 0x0000153cd6e6e353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      Attachments

        Issue Links

          Activity

            The patch looks functionally correct. I have one improvement suggestion. I think the root cause of the issue is setting dict_operation_lock_mode much earlier to releasing the dict sys lock.

            1. Move trx->dict_operation_lock_mode= false; down while unlocking the dictionary.

            2. Improve [1] by calling row_mysql_unlock_data_dictionary instead of dict_sys.unlock(); to unset dict_operation_lock_mode together.

            debarun Debarun Banerjee added a comment - The patch looks functionally correct. I have one improvement suggestion. I think the root cause of the issue is setting dict_operation_lock_mode much earlier to releasing the dict sys lock. 1. Move trx->dict_operation_lock_mode= false; down while unlocking the dictionary. 2. Improve [1] by calling row_mysql_unlock_data_dictionary instead of dict_sys.unlock(); to unset dict_operation_lock_mode together.

            In the fix of MDEV-25919, an assignment had been moved across a call to trx->rollback(), causing an incorrect value of dict_locked=false to be passed during the rollback. Restoring the order fixes this:

            diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
            index 447d1533b6c..fe0c0ea177c 100644
            --- a/storage/innobase/handler/ha_innodb.cc
            +++ b/storage/innobase/handler/ha_innodb.cc
            @@ -13798,8 +13798,8 @@ int ha_innobase::delete_table(const char *name)
               if (err != DB_SUCCESS)
               {
             err_exit:
            -    trx->dict_operation_lock_mode= false;
                 trx->rollback();
            +    trx->dict_operation_lock_mode= false;
                 switch (err) {
                 case DB_CANNOT_DROP_CONSTRAINT:
                 case DB_LOCK_WAIT_TIMEOUT:
            

            marko Marko Mäkelä added a comment - In the fix of MDEV-25919 , an assignment had been moved across a call to trx->rollback() , causing an incorrect value of dict_locked=false to be passed during the rollback. Restoring the order fixes this: diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 447d1533b6c..fe0c0ea177c 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -13798,8 +13798,8 @@ int ha_innobase::delete_table(const char *name) if (err != DB_SUCCESS) { err_exit: - trx->dict_operation_lock_mode= false; trx->rollback(); + trx->dict_operation_lock_mode= false; switch (err) { case DB_CANNOT_DROP_CONSTRAINT: case DB_LOCK_WAIT_TIMEOUT:

            I can reproduce this on 10.6 when using innodb_snapshot_isolation=ON.

            The problem seems to be that the CREATE TABLE t4…SELECT needs to be rolled back because of a fatal error (in this case, ER_CHECKREAD, but I think that also a lock wait timeout would be applicable).

            The bug is in the error handling. The transaction rollback assumes that the caller is not holding dict_sys.latch in any mode.

            We must ensure that no DDL operation in 10.6 may hold dict_sys.latch while executing transaction rollback. CREATE…SELECT is quite different from everything else, and the logic was rewritten in 10.6 in order to achieve Atomic DDL.

            marko Marko Mäkelä added a comment - I can reproduce this on 10.6 when using innodb_snapshot_isolation=ON . The problem seems to be that the CREATE TABLE t4…SELECT needs to be rolled back because of a fatal error (in this case, ER_CHECKREAD , but I think that also a lock wait timeout would be applicable). The bug is in the error handling. The transaction rollback assumes that the caller is not holding dict_sys.latch in any mode. We must ensure that no DDL operation in 10.6 may hold dict_sys.latch while executing transaction rollback. CREATE…SELECT is quite different from everything else, and the logic was rewritten in 10.6 in order to achieve Atomic DDL.

            People

              marko Marko Mäkelä
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.