Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22733

XA PREPARE breaks MDL in pseudo_slave_mode=1

Details

    Description

      USE test;
      CREATE TABLE t(a INT);
      XA START '0';
      SET pseudo_slave_mode=1;
      INSERT INTO t VALUES(7050+0.75);
      XA PREPARE '0';
      XA END '0';
      XA PREPARE '0';
      TRUNCATE TABLE t;
      # Shutdown to observe hang (mysqladmin shutdown will hang)
      

      Will cause 10.5.4 8569dac1ec9f6853a0b2f3ea9bcbda67644ead24 (dbg+opt) to hang. Earlier releases not affected.

      Note the difference on the last command:

      10.5.4 8569dac1ec9f6853a0b2f3ea9bcbda67644ead24

      10.5.4>TRUNCATE TABLE t;
      Query OK, 0 rows affected (0.013 sec)
      

      10.4.14 ea7830eef48333e28f98a9b91f05a95735b465a3

      10.4.14>TRUNCATE TABLE t;
      ERROR 1399 (XAE07): XAER_RMFAIL: The command cannot be executed when global transaction is in the  PREPARED state
      

      Attachments

        Issue Links

          Activity

            As per discussion with Elena, likely not restricted to `SET pseudo_slave_mode=1;` only.

            Roel Roel Van de Paar added a comment - As per discussion with Elena, likely not restricted to `SET pseudo_slave_mode=1;` only.
            elenst Elena Stepanova added a comment - - edited

            wlad, marko, FYI. It seems to be the same as the one we discussed recently, given XA and row_drop_table_for_mysql_in_background upon PROCEDURE DELETE_FROM_***_STATS. Maybe either of you want to take over or anyhow participate.

            10.5 c2a92918, stack trace from Roel's test case

            #0  hash_get_nth_cell (table=0x55c66f975ca8, n=14926) at /data/src/10.5/storage/innobase/include/hash0hash.ic:38
            #1  0x000055c66b352604 in dict_table_check_if_in_cache_low (table_name=0x55c66ba1b5bb "mysql/innodb_table_stats") at /data/src/10.5/storage/innobase/include/dict0priv.ic:87
            #2  0x000055c66b352200 in dict_table_get_low (table_name=0x55c66ba1b5bb "mysql/innodb_table_stats") at /data/src/10.5/storage/innobase/include/dict0priv.ic:44
            #3  0x000055c66b35ffde in dict_table_schema_check (req_schema=0x7f4eefffdb40, errstr=0x7f4eefffdee0 "\200", errstr_sz=512) at /data/src/10.5/storage/innobase/dict/dict0dict.cc:4619
            #4  0x000055c66b3836c4 in dict_stats_persistent_storage_check (caller_has_dict_sys_mutex=true) at /data/src/10.5/storage/innobase/dict/dict0stats.cc:246
            #5  0x000055c66b3837fe in dict_stats_exec_sql (pinfo=0x7f4ee800b2d8, sql=0x55c66ba1cdf0 "PROCEDURE DELETE_FROM_INDEX_STATS () IS\nBEGIN\nDELETE FROM \"mysql/innodb_index_stats\" WHERE\ndatabase_name = :database_name AND\ntable_name = :table_name;\nEND;\n", trx=0x0) at /data/src/10.5/storage/innobase/dict/dict0stats.cc:289
            #6  0x000055c66b38b15f in dict_stats_delete_from_index_stats (database_name=0x7f4eefffe1b0 "test", table_name=0x7f4eefffe280 "#mysql50##sql-ib20") at /data/src/10.5/storage/innobase/dict/dict0stats.cc:3494
            #7  0x000055c66b38b272 in dict_stats_drop_table (db_and_table=0x7f4ee8000b88 "test/#sql-ib20", errstr=0x7f4eefffe760 "\200\347\377\357N\177", errstr_sz=1024) at /data/src/10.5/storage/innobase/dict/dict0stats.cc:3538
            #8  0x000055c66b1a9698 in row_drop_table_for_mysql (name=0x7f4ee8000b88 "test/#sql-ib20", trx=0x7f4f180db1e8, sqlcom=SQLCOM_TRUNCATE, create_failed=false, nonatomic=true) at /data/src/10.5/storage/innobase/row/row0mysql.cc:3427
            #9  0x000055c66b1a7527 in row_drop_table_for_mysql_in_background (name=0x7f4ee8000b88 "test/#sql-ib20") at /data/src/10.5/storage/innobase/row/row0mysql.cc:2608
            #10 0x000055c66b1a7823 in row_drop_tables_for_mysql_in_background () at /data/src/10.5/storage/innobase/row/row0mysql.cc:2677
            #11 0x000055c66b214def in srv_master_do_active_tasks () at /data/src/10.5/storage/innobase/srv/srv0srv.cc:1812
            #12 0x000055c66b21579f in srv_master_callback () at /data/src/10.5/storage/innobase/srv/srv0srv.cc:2000
            #13 0x000055c66b4467a4 in tpool::thread_pool_generic::timer_generic::run (this=0x55c66faa9d50) at /data/src/10.5/tpool/tpool_generic.cc:299
            #14 0x000055c66b446892 in tpool::thread_pool_generic::timer_generic::execute (arg=0x55c66faa9d50) at /data/src/10.5/tpool/tpool_generic.cc:317
            #15 0x000055c66b44b7c3 in tpool::task::execute (this=0x55c66faa9d90) at /data/src/10.5/tpool/task.cc:52
            #16 0x000055c66b444f94 in tpool::thread_pool_generic::worker_main (this=0x55c66edd7ec0, thread_var=0x55c66ede7a50) at /data/src/10.5/tpool/tpool_generic.cc:518
            #17 0x000055c66b44b207 in std::__invoke_impl<void, void (tpool::thread_pool_generic::* const&)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>(std::__invoke_memfun_deref, void (tpool::thread_pool_generic::* const&)(tpool::worker_data*), tpool::thread_pool_generic*&&, tpool::worker_data*&&) (__f=@0x7f4f14000aa8: (void (tpool::thread_pool_generic::*)(tpool::thread_pool_generic * const, tpool::worker_data *)) 0x55c66b444efc <tpool::thread_pool_generic::worker_main(tpool::worker_data*)>, __t=<unknown type in /data/bld/10.5-debug-nightly/bin/mariadbd, CU 0x8071beb, DIE 0x8083363>, __args#0=<unknown type in /data/bld/10.5-debug-nightly/bin/mariadbd, CU 0x8071beb, DIE 0x808336f>) at /usr/include/c++/6/functional:227
            #18 0x000055c66b44b179 in std::__invoke<void (tpool::thread_pool_generic::* const&)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>(void (tpool::thread_pool_generic::* const&)(tpool::worker_data*), tpool::thread_pool_generic*&&, tpool::worker_data*&&) (__fn=@0x7f4f14000aa8: (void (tpool::thread_pool_generic::*)(tpool::thread_pool_generic * const, tpool::worker_data *)) 0x55c66b444efc <tpool::thread_pool_generic::worker_main(tpool::worker_data*)>, __args#0=<unknown type in /data/bld/10.5-debug-nightly/bin/mariadbd, CU 0x8071beb, DIE 0x8083363>, __args#1=<unknown type in /data/bld/10.5-debug-nightly/bin/mariadbd, CU 0x8071beb, DIE 0x808336f>) at /usr/include/c++/6/functional:251
            #19 0x000055c66b44b10d in std::_Mem_fn_base<void (tpool::thread_pool_generic::*)(tpool::worker_data*), true>::operator()<tpool::thread_pool_generic*, tpool::worker_data*>(tpool::thread_pool_generic*&&, tpool::worker_data*&&) const (this=0x7f4f14000aa8, __args#0=<unknown type in /data/bld/10.5-debug-nightly/bin/mariadbd, CU 0x8071beb, DIE 0x8083363>, __args#1=<unknown type in /data/bld/10.5-debug-nightly/bin/mariadbd, CU 0x8071beb, DIE 0x808336f>) at /usr/include/c++/6/functional:604
            #20 0x000055c66b44b0bf in std::_Bind_simple<std::_Mem_fn<void (tpool::thread_pool_generic::*)(tpool::worker_data*)> (tpool::thread_pool_generic*, tpool::worker_data*)>::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) (this=0x7f4f14000a98) at /usr/include/c++/6/functional:1391
            #21 0x000055c66b44afbd in std::_Bind_simple<std::_Mem_fn<void (tpool::thread_pool_generic::*)(tpool::worker_data*)> (tpool::thread_pool_generic*, tpool::worker_data*)>::operator()() (this=0x7f4f14000a98) at /usr/include/c++/6/functional:1380
            #22 0x000055c66b44af9c in std::thread::_State_impl<std::_Bind_simple<std::_Mem_fn<void (tpool::thread_pool_generic::*)(tpool::worker_data*)> (tpool::thread_pool_generic*, tpool::worker_data*)> >::_M_run() (this=0x7f4f14000a90) at /usr/include/c++/6/thread:197
            #23 0x00007f4f19ecde6f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
            #24 0x00007f4f1b50e4a4 in start_thread (arg=0x7f4eeffff700) at pthread_create.c:456
            #25 0x00007f4f19642d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
            

            All threads are attached. threads1

            elenst Elena Stepanova added a comment - - edited wlad , marko , FYI. It seems to be the same as the one we discussed recently, given XA and row_drop_table_for_mysql_in_background upon PROCEDURE DELETE_FROM_***_STATS . Maybe either of you want to take over or anyhow participate. 10.5 c2a92918, stack trace from Roel's test case #0 hash_get_nth_cell (table=0x55c66f975ca8, n=14926) at /data/src/10.5/storage/innobase/include/hash0hash.ic:38 #1 0x000055c66b352604 in dict_table_check_if_in_cache_low (table_name=0x55c66ba1b5bb "mysql/innodb_table_stats") at /data/src/10.5/storage/innobase/include/dict0priv.ic:87 #2 0x000055c66b352200 in dict_table_get_low (table_name=0x55c66ba1b5bb "mysql/innodb_table_stats") at /data/src/10.5/storage/innobase/include/dict0priv.ic:44 #3 0x000055c66b35ffde in dict_table_schema_check (req_schema=0x7f4eefffdb40, errstr=0x7f4eefffdee0 "\200", errstr_sz=512) at /data/src/10.5/storage/innobase/dict/dict0dict.cc:4619 #4 0x000055c66b3836c4 in dict_stats_persistent_storage_check (caller_has_dict_sys_mutex=true) at /data/src/10.5/storage/innobase/dict/dict0stats.cc:246 #5 0x000055c66b3837fe in dict_stats_exec_sql (pinfo=0x7f4ee800b2d8, sql=0x55c66ba1cdf0 "PROCEDURE DELETE_FROM_INDEX_STATS () IS\nBEGIN\nDELETE FROM \"mysql/innodb_index_stats\" WHERE\ndatabase_name = :database_name AND\ntable_name = :table_name;\nEND;\n", trx=0x0) at /data/src/10.5/storage/innobase/dict/dict0stats.cc:289 #6 0x000055c66b38b15f in dict_stats_delete_from_index_stats (database_name=0x7f4eefffe1b0 "test", table_name=0x7f4eefffe280 "#mysql50##sql-ib20") at /data/src/10.5/storage/innobase/dict/dict0stats.cc:3494 #7 0x000055c66b38b272 in dict_stats_drop_table (db_and_table=0x7f4ee8000b88 "test/#sql-ib20", errstr=0x7f4eefffe760 "\200\347\377\357N\177", errstr_sz=1024) at /data/src/10.5/storage/innobase/dict/dict0stats.cc:3538 #8 0x000055c66b1a9698 in row_drop_table_for_mysql (name=0x7f4ee8000b88 "test/#sql-ib20", trx=0x7f4f180db1e8, sqlcom=SQLCOM_TRUNCATE, create_failed=false, nonatomic=true) at /data/src/10.5/storage/innobase/row/row0mysql.cc:3427 #9 0x000055c66b1a7527 in row_drop_table_for_mysql_in_background (name=0x7f4ee8000b88 "test/#sql-ib20") at /data/src/10.5/storage/innobase/row/row0mysql.cc:2608 #10 0x000055c66b1a7823 in row_drop_tables_for_mysql_in_background () at /data/src/10.5/storage/innobase/row/row0mysql.cc:2677 #11 0x000055c66b214def in srv_master_do_active_tasks () at /data/src/10.5/storage/innobase/srv/srv0srv.cc:1812 #12 0x000055c66b21579f in srv_master_callback () at /data/src/10.5/storage/innobase/srv/srv0srv.cc:2000 #13 0x000055c66b4467a4 in tpool::thread_pool_generic::timer_generic::run (this=0x55c66faa9d50) at /data/src/10.5/tpool/tpool_generic.cc:299 #14 0x000055c66b446892 in tpool::thread_pool_generic::timer_generic::execute (arg=0x55c66faa9d50) at /data/src/10.5/tpool/tpool_generic.cc:317 #15 0x000055c66b44b7c3 in tpool::task::execute (this=0x55c66faa9d90) at /data/src/10.5/tpool/task.cc:52 #16 0x000055c66b444f94 in tpool::thread_pool_generic::worker_main (this=0x55c66edd7ec0, thread_var=0x55c66ede7a50) at /data/src/10.5/tpool/tpool_generic.cc:518 #17 0x000055c66b44b207 in std::__invoke_impl<void, void (tpool::thread_pool_generic::* const&)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>(std::__invoke_memfun_deref, void (tpool::thread_pool_generic::* const&)(tpool::worker_data*), tpool::thread_pool_generic*&&, tpool::worker_data*&&) (__f=@0x7f4f14000aa8: (void (tpool::thread_pool_generic::*)(tpool::thread_pool_generic * const, tpool::worker_data *)) 0x55c66b444efc <tpool::thread_pool_generic::worker_main(tpool::worker_data*)>, __t=<unknown type in /data/bld/10.5-debug-nightly/bin/mariadbd, CU 0x8071beb, DIE 0x8083363>, __args#0=<unknown type in /data/bld/10.5-debug-nightly/bin/mariadbd, CU 0x8071beb, DIE 0x808336f>) at /usr/include/c++/6/functional:227 #18 0x000055c66b44b179 in std::__invoke<void (tpool::thread_pool_generic::* const&)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>(void (tpool::thread_pool_generic::* const&)(tpool::worker_data*), tpool::thread_pool_generic*&&, tpool::worker_data*&&) (__fn=@0x7f4f14000aa8: (void (tpool::thread_pool_generic::*)(tpool::thread_pool_generic * const, tpool::worker_data *)) 0x55c66b444efc <tpool::thread_pool_generic::worker_main(tpool::worker_data*)>, __args#0=<unknown type in /data/bld/10.5-debug-nightly/bin/mariadbd, CU 0x8071beb, DIE 0x8083363>, __args#1=<unknown type in /data/bld/10.5-debug-nightly/bin/mariadbd, CU 0x8071beb, DIE 0x808336f>) at /usr/include/c++/6/functional:251 #19 0x000055c66b44b10d in std::_Mem_fn_base<void (tpool::thread_pool_generic::*)(tpool::worker_data*), true>::operator()<tpool::thread_pool_generic*, tpool::worker_data*>(tpool::thread_pool_generic*&&, tpool::worker_data*&&) const (this=0x7f4f14000aa8, __args#0=<unknown type in /data/bld/10.5-debug-nightly/bin/mariadbd, CU 0x8071beb, DIE 0x8083363>, __args#1=<unknown type in /data/bld/10.5-debug-nightly/bin/mariadbd, CU 0x8071beb, DIE 0x808336f>) at /usr/include/c++/6/functional:604 #20 0x000055c66b44b0bf in std::_Bind_simple<std::_Mem_fn<void (tpool::thread_pool_generic::*)(tpool::worker_data*)> (tpool::thread_pool_generic*, tpool::worker_data*)>::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) (this=0x7f4f14000a98) at /usr/include/c++/6/functional:1391 #21 0x000055c66b44afbd in std::_Bind_simple<std::_Mem_fn<void (tpool::thread_pool_generic::*)(tpool::worker_data*)> (tpool::thread_pool_generic*, tpool::worker_data*)>::operator()() (this=0x7f4f14000a98) at /usr/include/c++/6/functional:1380 #22 0x000055c66b44af9c in std::thread::_State_impl<std::_Bind_simple<std::_Mem_fn<void (tpool::thread_pool_generic::*)(tpool::worker_data*)> (tpool::thread_pool_generic*, tpool::worker_data*)> >::_M_run() (this=0x7f4f14000a90) at /usr/include/c++/6/thread:197 #23 0x00007f4f19ecde6f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #24 0x00007f4f1b50e4a4 in start_thread (arg=0x7f4eeffff700) at pthread_create.c:456 #25 0x00007f4f19642d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 All threads are attached. threads1
            wlad Vladislav Vaintroub added a comment - - edited

            srv_shutdown_bg_undo_sources spins waiting for some list's length to be 0. marko would know why.

            note that SHUTDOWN command needs to be invoked from a different connection that the one that was executing the script.

            wlad Vladislav Vaintroub added a comment - - edited srv_shutdown_bg_undo_sources spins waiting for some list's length to be 0. marko would know why. note that SHUTDOWN command needs to be invoked from a different connection that the one that was executing the script.
            Elkin Andrei Elkin added a comment -

            marko, (wlad): the bug looks to be at least 'relevant' to MDEV-21602. In the latter the table is allowed to be DROPed. This time it must have been drop-ed, just internally, and something like

            ...
            TRUNCATE TABLE t;
            --error 0 /* no timeout */
            DELETE FROM t;
            

            proves that to me.
            Notice as well, that

            --connection xa
            ... /* the description XA body  */
            XA PREPARE '0';
             
            --connection "truncate"
            --error 0 /* no timeout (!) */
            DELETE FROM t;
            

            a regular DML senses locks of the XA trx, unlike TRUCATE that must respect them too.
            I am reassigning it to Marko to make him to look and decide how we all proceed.

            Elkin Andrei Elkin added a comment - marko , ( wlad ): the bug looks to be at least 'relevant' to MDEV-21602 . In the latter the table is allowed to be DROPed. This time it must have been drop-ed, just internally, and something like ... TRUNCATE TABLE t; --error 0 /* no timeout */ DELETE FROM t; proves that to me. Notice as well, that --connection xa ... /* the description XA body */ XA PREPARE '0' ;   --connection "truncate" --error 0 /* no timeout (!) */ DELETE FROM t; a regular DML senses locks of the XA trx, unlike TRUCATE that must respect them too. I am reassigning it to Marko to make him to look and decide how we all proceed.

            I think that this must be fixed in 10.2. The call to delete InnoDB persistent statistics for a #sql-ib table became more of a problem with the MDEV-13564 TRUNCATE rewrite.

            marko Marko Mäkelä added a comment - I think that this must be fixed in 10.2. The call to delete InnoDB persistent statistics for a #sql-ib table became more of a problem with the MDEV-13564 TRUNCATE rewrite.
            Elkin Andrei Elkin added a comment - - edited

            To follow up, 10.4 does not have this issue thanks to MDL locks which now in 10.5 seem to disappear from the XA connection. We could try to restore them. If that's done the case
            would arrive at the stand very similar to MDEV-21602, with the difference that the latter sql-DROPs (so explicitly, in foreground) and that does not prevent shutdown.

            Elkin Andrei Elkin added a comment - - edited To follow up, 10.4 does not have this issue thanks to MDL locks which now in 10.5 seem to disappear from the XA connection. We could try to restore them. If that's done the case would arrive at the stand very similar to MDEV-21602 , with the difference that the latter sql-DROPs (so explicitly, in foreground) and that does not prevent shutdown.

            Because wlad pointed out something strange, I attempted the following fix. But, obviously it will not fix the hang:

            diff --git a/storage/innobase/row/row0mysql.cc b/storage/innobase/row/row0mysql.cc
            index 640db10907a..282e63fa0d2 100644
            --- a/storage/innobase/row/row0mysql.cc
            +++ b/storage/innobase/row/row0mysql.cc
            @@ -3419,15 +3419,15 @@ row_drop_table_for_mysql(
             			btr_defragment_remove_table(table);
             		}
             
            -		/* Remove stats for this table and all of its indexes from the
            -		persistent storage if it exists and if there are stats for this
            -		table in there. This function creates its own trx and commits
            -		it. */
            -		char	errstr[1024];
            -		err = dict_stats_drop_table(name, errstr, sizeof(errstr));
            -
            -		if (err != DB_SUCCESS) {
            -			ib::warn() << errstr;
            +		if (UNIV_LIKELY(!strstr(name, "/" TEMP_FILE_PREFIX_INNODB))) {
            +			/* Remove any persistent statistics for this table,
            +			in a separate transaction. */
            +			char errstr[1024];
            +			err = dict_stats_drop_table(name, errstr,
            +						    sizeof errstr);
            +			if (err != DB_SUCCESS) {
            +				ib::warn() << errstr;
            +			}
             		}
             	}
             
            

            This cleanup could be applied to 10.2 separately, after properly testing it of course. (If the workaround that is described in MDEV-21602 kicks in, we would have to ensure that the InnoDB persistent statistics will be dropped.)

            With the above change, I still observe a hang, because row_mysql_drop_list will contain one table, namely the one that is locked by the XA PREPARE transaction. Even after restart, we cannot drop the table.

            I believe that the root cause of this failure is that MDEV-742 is breaking locks. The MDL on the table should be held until XA ROLLBACK or XA COMMIT, and the TRUNCATE must not be allowed to proceed in the first place.

            I tested with the following:

            --source include/have_innodb.inc
            CREATE TABLE t(a INT) ENGINE=InnoDB;
            XA START '0';
            SET pseudo_slave_mode=1;
            INSERT INTO t VALUES(1);
            XA END '0';
            XA PREPARE '0';
            # This should fail with MDL timeout!
            TRUNCATE TABLE t;
            # Instead, we get a hang here
            --source include/restart_mysqld.inc
            DROP TABLE t;
            

            marko Marko Mäkelä added a comment - Because wlad pointed out something strange, I attempted the following fix. But, obviously it will not fix the hang: diff --git a/storage/innobase/row/row0mysql.cc b/storage/innobase/row/row0mysql.cc index 640db10907a..282e63fa0d2 100644 --- a/storage/innobase/row/row0mysql.cc +++ b/storage/innobase/row/row0mysql.cc @@ -3419,15 +3419,15 @@ row_drop_table_for_mysql( btr_defragment_remove_table(table); } - /* Remove stats for this table and all of its indexes from the - persistent storage if it exists and if there are stats for this - table in there. This function creates its own trx and commits - it. */ - char errstr[1024]; - err = dict_stats_drop_table(name, errstr, sizeof(errstr)); - - if (err != DB_SUCCESS) { - ib::warn() << errstr; + if (UNIV_LIKELY(!strstr(name, "/" TEMP_FILE_PREFIX_INNODB))) { + /* Remove any persistent statistics for this table, + in a separate transaction. */ + char errstr[1024]; + err = dict_stats_drop_table(name, errstr, + sizeof errstr); + if (err != DB_SUCCESS) { + ib::warn() << errstr; + } } } This cleanup could be applied to 10.2 separately, after properly testing it of course. (If the workaround that is described in MDEV-21602 kicks in, we would have to ensure that the InnoDB persistent statistics will be dropped.) With the above change, I still observe a hang, because row_mysql_drop_list will contain one table, namely the one that is locked by the XA PREPARE transaction. Even after restart, we cannot drop the table. I believe that the root cause of this failure is that MDEV-742 is breaking locks. The MDL on the table should be held until XA ROLLBACK or XA COMMIT , and the TRUNCATE must not be allowed to proceed in the first place. I tested with the following: --source include/have_innodb.inc CREATE TABLE t(a INT ) ENGINE=InnoDB; XA START '0' ; SET pseudo_slave_mode=1; INSERT INTO t VALUES (1); XA END '0' ; XA PREPARE '0' ; # This should fail with MDL timeout! TRUNCATE TABLE t; # Instead , we get a hang here --source include/restart_mysqld.inc DROP TABLE t;
            Elkin Andrei Elkin added a comment - - edited

            marko: Stricktly speaking it's disconnected prepared transaction that this bug required. They were possible before MDEV-742 too. Yet it's still unclear to me why MDEV-21602 behavior of letting DROP to run through despite X locked records does not lead to any shutdown issue, but TRUNCATE does? Until we have not implemented MDL locks detached from THD, it may not be feasible to fully fix both bugs. But would it be feasible to refine the TRUNCATE handling to be friendly to shutdown as DROP's one is?

            Elkin Andrei Elkin added a comment - - edited marko : Stricktly speaking it's disconnected prepared transaction that this bug required. They were possible before MDEV-742 too. Yet it's still unclear to me why MDEV-21602 behavior of letting DROP to run through despite X locked records does not lead to any shutdown issue, but TRUNCATE does? Until we have not implemented MDL locks detached from THD, it may not be feasible to fully fix both bugs. But would it be feasible to refine the TRUNCATE handling to be friendly to shutdown as DROP's one is?
            Elkin Andrei Elkin added a comment -

            Upon further exchange with Marko on slack we agree he'd attempt to fix the shutdown hang.
            The big part of the bug - to allow a DDL to run through prepared XA - remains and duplicates MDEV-21602.

            Elkin Andrei Elkin added a comment - Upon further exchange with Marko on slack we agree he'd attempt to fix the shutdown hang. The big part of the bug - to allow a DDL to run through prepared XA - remains and duplicates MDEV-21602 .

            I think that this must have been broken for quite a while, in 10.2 already. I will have to revise the test to reproduce that.

            The hang can be fixed by introducing a new srv_shutdown_state value and setting it in srv_shutdown_bg_undo_sources(). However, a slow shutdown would still hang, because row_drop_table_for_mysql_in_background() fails to skip tables on which XA PREPARE transactions hold locks:

            --source include/have_innodb.inc
            CREATE TABLE t(a INT) ENGINE=InnoDB;
            XA START '0';
            SET pseudo_slave_mode=1;
            INSERT INTO t VALUES(1);
            XA END '0';
            XA PREPARE '0';
            --echo # MDEV-21602 FIXME: This should fail with MDL timeout!
            TRUNCATE TABLE t;
            --echo # FIXME: Slow shutdown would still hang
            SET GLOBAL innodb_fast_shutdown=0;
            --source include/restart_mysqld.inc
            XA ROLLBACK '0';
            DROP TABLE t;
            

            marko Marko Mäkelä added a comment - I think that this must have been broken for quite a while, in 10.2 already. I will have to revise the test to reproduce that. The hang can be fixed by introducing a new srv_shutdown_state value and setting it in srv_shutdown_bg_undo_sources() . However, a slow shutdown would still hang, because row_drop_table_for_mysql_in_background() fails to skip tables on which XA PREPARE transactions hold locks: --source include/have_innodb.inc CREATE TABLE t(a INT ) ENGINE=InnoDB; XA START '0' ; SET pseudo_slave_mode=1; INSERT INTO t VALUES (1); XA END '0' ; XA PREPARE '0' ; --echo # MDEV-21602 FIXME: This should fail with MDL timeout! TRUNCATE TABLE t; --echo # FIXME: Slow shutdown would still hang SET GLOBAL innodb_fast_shutdown=0; --source include/restart_mysqld.inc XA ROLLBACK '0' ; DROP TABLE t;

            The broken locking (failure to prevent the TRUNCATE statement from running) is a regression that was introduced in MDEV-742.

            The InnoDB hang on shutdown due to the existence of XA transactions preventing the background drop table queue from being emptied affects 10.2 and has been filed as MDEV-22769.

            marko Marko Mäkelä added a comment - The broken locking (failure to prevent the TRUNCATE statement from running) is a regression that was introduced in MDEV-742 . The InnoDB hang on shutdown due to the existence of XA transactions preventing the background drop table queue from being emptied affects 10.2 and has been filed as MDEV-22769 .
            Elkin Andrei Elkin added a comment -

            Well, MDEV-742 did not really create this issue, as unguarded by MDL locks user prepared xa:s had existed before it. It just made this issue much more to the surface.

            Elkin Andrei Elkin added a comment - Well, MDEV-742 did not really create this issue, as unguarded by MDL locks user prepared xa:s had existed before it. It just made this issue much more to the surface.

            MDEV-742 caused this issue. The regression in behaviour is mentioned in the description: The expected error

            ERROR 1399 (XAE07): XAER_RMFAIL: The command cannot be executed when global transaction is in the  PREPARED state
            

            is no longer being issued, because the MDL are being wrongly released.

            Independently of that, MDEV-22769 existed and has today been fixed it in 10.2, 10.3, 10.4, and 10.5. The symptoms of this bug have changed accordingly. There is no more hang on shutdown.

            marko Marko Mäkelä added a comment - MDEV-742 caused this issue. The regression in behaviour is mentioned in the description: The expected error ERROR 1399 (XAE07): XAER_RMFAIL: The command cannot be executed when global transaction is in the PREPARED state is no longer being issued, because the MDL are being wrongly released. Independently of that, MDEV-22769 existed and has today been fixed it in 10.2, 10.3, 10.4, and 10.5. The symptoms of this bug have changed accordingly. There is no more hang on shutdown.

            Confirmed the hang on shutdown is gone.

            The TRUNCATE succeeds.

            10.5.4 6877ef9a7c9c7ee55d67e4baaf4e8f7b874c9f89

            10.5.4>TRUNCATE TABLE t;
            Query OK, 0 rows affected (5.813 sec)
            #shutdown
            2020-06-06 14:26:54 0 [Note] InnoDB: Shutdown completed; log sequence number 50612; transaction id 521167
            

            Roel Roel Van de Paar added a comment - Confirmed the hang on shutdown is gone. The TRUNCATE succeeds. 10.5.4 6877ef9a7c9c7ee55d67e4baaf4e8f7b874c9f89 10.5.4>TRUNCATE TABLE t; Query OK, 0 rows affected (5.813 sec) #shutdown 2020-06-06 14:26:54 0 [Note] InnoDB: Shutdown completed; log sequence number 50612; transaction id 521167
            Elkin Andrei Elkin added a comment -

            Thanks, Roel! Then the remained issue - of TRUNCATE to run through - is duplicated by MDEV-21602.

            Elkin Andrei Elkin added a comment - Thanks, Roel ! Then the remained issue - of TRUNCATE to run through - is duplicated by MDEV-21602 .
            Elkin Andrei Elkin added a comment -

            MDEV-21602 hosts a non-hang issue of the two ones that this bug represents. The hang is fixed.

            Elkin Andrei Elkin added a comment - MDEV-21602 hosts a non-hang issue of the two ones that this bug represents. The hang is fixed.

            I believe that MDEV-742 introduced the regression that XA PREPARE is breaking metadata locks. This bug is not exactly a duplicate of MDEV-21602. It is more correct to say that this bug cannot be fixed before MDEV-21602 has been fixed.

            marko Marko Mäkelä added a comment - I believe that MDEV-742 introduced the regression that XA PREPARE is breaking metadata locks. This bug is not exactly a duplicate of MDEV-21602 . It is more correct to say that this bug cannot be fixed before MDEV-21602 has been fixed.

            After MDEV-21602 is fixed and the fix is merged up to 10.5, the original test case from this issue needs to be re-run on the presumably fixed branch to verify that the reported issue is gone. Or not.

            elenst Elena Stepanova added a comment - After MDEV-21602 is fixed and the fix is merged up to 10.5, the original test case from this issue needs to be re-run on the presumably fixed branch to verify that the reported issue is gone. Or not.

            Tested 10.4,10.5,10.6 and all results are the same, EXCEPT that there no longer is a hang. May indeed be good to re-check after MDEV-21602 as per elenst. Thank you

            Roel Roel Van de Paar added a comment - Tested 10.4,10.5,10.6 and all results are the same, EXCEPT that there no longer is a hang. May indeed be good to re-check after MDEV-21602 as per elenst . Thank you

            The MDEV-21602 fix cannot be trivially ported to 10.5. The following test seems to work correctly in 10.6:

            --source include/have_innodb.inc
            CREATE TABLE t(a INT) ENGINE=InnoDB;
            XA START '0';
            SET pseudo_slave_mode=1;
            INSERT INTO t VALUES(7050+0.75);
            XA END '0';
            XA PREPARE '0';
            TRUNCATE TABLE t;
            --source include/restart_mysqld.inc
            XA COMMIT '0';
            DROP TABLE t;
            

            10.6 ee39757f3c91e04a0ccbb5424fba7dd56167ad93

            mysqltest: At line 8: query 'TRUNCATE TABLE t' failed: ER_LOCK_WAIT_TIMEOUT (1205): Lock wait timeout exceeded; try restarting transaction
            

            The lock wait looked like this:

            10.6 ee39757f3c91e04a0ccbb5424fba7dd56167ad93

            #3  0x000055ec3a54af50 in lock_wait (thr=thr@entry=0x7f478c0eab30) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:1804
            #4  0x000055ec3a5a33b8 in row_mysql_handle_errors (new_err=new_err@entry=0x7f47c8eb49ec, trx=trx@entry=0x7f47ca16b330, thr=thr@entry=0x7f478c0eab30, savept=savept@entry=0x0) at /mariadb/10.6/storage/innobase/row/row0mysql.cc:686
            #5  0x000055ec3a54b950 in lock_table_for_trx (table=table@entry=0x7f478c0f2ea0, trx=trx@entry=0x7f47ca16b330, mode=mode@entry=LOCK_X) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:3616
            #6  0x000055ec3a5132ac in ha_innobase::truncate (this=0x7f478c0ea300) at /mariadb/10.6/storage/innobase/handler/ha_innodb.cc:13789
            

            The TRUNCATE cannot acquire an exclusive table lock, because the INSERT in the XA PREPARE transaction is holding a conflicting lock on the table (in IX mode).

            marko Marko Mäkelä added a comment - The MDEV-21602 fix cannot be trivially ported to 10.5. The following test seems to work correctly in 10.6: --source include/have_innodb.inc CREATE TABLE t(a INT ) ENGINE=InnoDB; XA START '0' ; SET pseudo_slave_mode=1; INSERT INTO t VALUES (7050+0.75); XA END '0' ; XA PREPARE '0' ; TRUNCATE TABLE t; --source include/restart_mysqld.inc XA COMMIT '0' ; DROP TABLE t; 10.6 ee39757f3c91e04a0ccbb5424fba7dd56167ad93 mysqltest: At line 8: query 'TRUNCATE TABLE t' failed: ER_LOCK_WAIT_TIMEOUT (1205): Lock wait timeout exceeded; try restarting transaction The lock wait looked like this: 10.6 ee39757f3c91e04a0ccbb5424fba7dd56167ad93 #3 0x000055ec3a54af50 in lock_wait (thr=thr@entry=0x7f478c0eab30) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:1804 #4 0x000055ec3a5a33b8 in row_mysql_handle_errors (new_err=new_err@entry=0x7f47c8eb49ec, trx=trx@entry=0x7f47ca16b330, thr=thr@entry=0x7f478c0eab30, savept=savept@entry=0x0) at /mariadb/10.6/storage/innobase/row/row0mysql.cc:686 #5 0x000055ec3a54b950 in lock_table_for_trx (table=table@entry=0x7f478c0f2ea0, trx=trx@entry=0x7f47ca16b330, mode=mode@entry=LOCK_X) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:3616 #6 0x000055ec3a5132ac in ha_innobase::truncate (this=0x7f478c0ea300) at /mariadb/10.6/storage/innobase/handler/ha_innodb.cc:13789 The TRUNCATE cannot acquire an exclusive table lock, because the INSERT in the XA PREPARE transaction is holding a conflicting lock on the table (in IX mode).

            The MDL breakage (as claimed in the title of this ticket) is still affecting 10.6. The test case that I posted on 2020-06-04 would trigger an InnoDB table lock wait timeout, instead of MDL conflict:

            --source include/have_innodb.inc
            CREATE TABLE t(a INT) ENGINE=InnoDB;
            XA START '0';
            SET pseudo_slave_mode=1;
            INSERT INTO t VALUES(1);
            XA END '0';
            XA PREPARE '0';
            --echo # FIXME: This should fail with MDL timeout, not InnoDB table lock wait timeout!
            TRUNCATE TABLE t;
            

            It should be possible to flag the conflict instantly without any timeout, because the current thread should be holding MDL for the XA PREPARE transaction.

            For some reason, that MDL is apparently being released prematurely and the TRUNCATE is wrongly allowed to acquire MDL_EXCLUSIVE.

            A simple fix could be to return an error if XA PREPARE is followed by any other statement than XA ROLLBACK, XA COMMIT or some ‘harmless’ statement such as SHOW, which should not access any transactions. I do not see why we would ever have to allow DDL statements while an XA transaction is open.

            marko Marko Mäkelä added a comment - The MDL breakage (as claimed in the title of this ticket) is still affecting 10.6. The test case that I posted on 2020-06-04 would trigger an InnoDB table lock wait timeout, instead of MDL conflict: --source include/have_innodb.inc CREATE TABLE t(a INT ) ENGINE=InnoDB; XA START '0' ; SET pseudo_slave_mode=1; INSERT INTO t VALUES (1); XA END '0' ; XA PREPARE '0' ; --echo # FIXME: This should fail with MDL timeout, not InnoDB table lock wait timeout! TRUNCATE TABLE t; It should be possible to flag the conflict instantly without any timeout, because the current thread should be holding MDL for the XA PREPARE transaction. For some reason, that MDL is apparently being released prematurely and the TRUNCATE is wrongly allowed to acquire MDL_EXCLUSIVE . A simple fix could be to return an error if XA PREPARE is followed by any other statement than XA ROLLBACK , XA COMMIT or some ‘harmless’ statement such as SHOW , which should not access any transactions. I do not see why we would ever have to allow DDL statements while an XA transaction is open.
            Roel Roel Van de Paar added a comment - - edited

            When everything in the testcase is executed from one client thread, and then TRUNCATE TABLE t; is executed from a secondary thread, we get a lock timeout without any UBSAN/ASAN error being observed.

            ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
            

            This being in connection with the testing requested in MDEV-30941, here.

            Roel Roel Van de Paar added a comment - - edited When everything in the testcase is executed from one client thread, and then TRUNCATE TABLE t; is executed from a secondary thread, we get a lock timeout without any UBSAN/ASAN error being observed. ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction This being in connection with the testing requested in MDEV-30941 , here .

            People

              Elkin Andrei Elkin
              Roel Roel Van de Paar
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.