Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28776

rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master

Details

    Description

      https://buildbot.mariadb.org/#/builders/192/builds/10092

      The test occasionally fails with:

      10.5 22f935d6d

      rpl.rpl_mark_optimize_tbl_ddl 'innodb,mix' w6 [ fail ]
              Test ended at 2022-06-02 22:23:29
       
      CURRENT_TEST: rpl.rpl_mark_optimize_tbl_ddl
      analyze: sync_with_master
      mysqltest: At line 46: sync_with_master failed: 'select master_pos_wait('master-bin.000001', 1388, 300, '')' returned -1 indicating timeout after 300 seconds
       
      The result from queries just before the failure was:
      < snip >
      connection server_1;
      FLUSH TABLES;
      ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
      connection server_2;
      SET @save_slave_parallel_threads= @@GLOBAL.slave_parallel_threads;
      SET @save_slave_parallel_mode= @@GLOBAL.slave_parallel_mode;
      include/stop_slave.inc
      SET GLOBAL slave_parallel_threads=2;
      SET GLOBAL slave_parallel_mode=optimistic;
      include/start_slave.inc
      connection server_1;
      CREATE TABLE t1(a INT) ENGINE=INNODB;
      OPTIMIZE TABLE t1;
      Table	Op	Msg_type	Msg_text
      test.t1	optimize	note	Table does not support optimize, doing recreate + analyze instead
      test.t1	optimize	status	OK
      INSERT INTO t1 VALUES(1);
      INSERT INTO t1  SELECT 1+a FROM t1;
      INSERT INTO t1  SELECT 2+a FROM t1;
      connection server_2;
      

      Attachments

        Issue Links

          Activity

            angelique.sklavounos Angelique Sklavounos (Inactive) created issue -
            angelique.sklavounos Angelique Sklavounos (Inactive) made changes -
            Field Original Value New Value
            Component/s Replication [ 10100 ]
            Component/s Tests [ 10800 ]
            angelique.sklavounos Angelique Sklavounos (Inactive) made changes -
            Description The test occasionally fails with:
            {noformat:title=10.5 22f935d6d}
            rpl.rpl_mark_optimize_tbl_ddl 'innodb,mix' w6 [ fail ]
                    Test ended at 2022-06-02 22:23:29

            CURRENT_TEST: rpl.rpl_mark_optimize_tbl_ddl
            analyze: sync_with_master
            mysqltest: At line 46: sync_with_master failed: 'select master_pos_wait('master-bin.000001', 1388, 300, '')' returned -1 indicating timeout after 300 seconds

            The result from queries just before the failure was:
            < snip >
            connection server_1;
            FLUSH TABLES;
            ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
            connection server_2;
            SET @save_slave_parallel_threads= @@GLOBAL.slave_parallel_threads;
            SET @save_slave_parallel_mode= @@GLOBAL.slave_parallel_mode;
            include/stop_slave.inc
            SET GLOBAL slave_parallel_threads=2;
            SET GLOBAL slave_parallel_mode=optimistic;
            include/start_slave.inc
            connection server_1;
            CREATE TABLE t1(a INT) ENGINE=INNODB;
            OPTIMIZE TABLE t1;
            Table Op Msg_type Msg_text
            test.t1 optimize note Table does not support optimize, doing recreate + analyze instead
            test.t1 optimize status OK
            INSERT INTO t1 VALUES(1);
            INSERT INTO t1 SELECT 1+a FROM t1;
            INSERT INTO t1 SELECT 2+a FROM t1;
            connection server_2;
            {noformat}
            https://buildbot.mariadb.org/#/builders/192/builds/10092

            The test occasionally fails with:
            {noformat:title=10.5 22f935d6d}
            rpl.rpl_mark_optimize_tbl_ddl 'innodb,mix' w6 [ fail ]
                    Test ended at 2022-06-02 22:23:29

            CURRENT_TEST: rpl.rpl_mark_optimize_tbl_ddl
            analyze: sync_with_master
            mysqltest: At line 46: sync_with_master failed: 'select master_pos_wait('master-bin.000001', 1388, 300, '')' returned -1 indicating timeout after 300 seconds

            The result from queries just before the failure was:
            < snip >
            connection server_1;
            FLUSH TABLES;
            ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
            connection server_2;
            SET @save_slave_parallel_threads= @@GLOBAL.slave_parallel_threads;
            SET @save_slave_parallel_mode= @@GLOBAL.slave_parallel_mode;
            include/stop_slave.inc
            SET GLOBAL slave_parallel_threads=2;
            SET GLOBAL slave_parallel_mode=optimistic;
            include/start_slave.inc
            connection server_1;
            CREATE TABLE t1(a INT) ENGINE=INNODB;
            OPTIMIZE TABLE t1;
            Table Op Msg_type Msg_text
            test.t1 optimize note Table does not support optimize, doing recreate + analyze instead
            test.t1 optimize status OK
            INSERT INTO t1 VALUES(1);
            INSERT INTO t1 SELECT 1+a FROM t1;
            INSERT INTO t1 SELECT 2+a FROM t1;
            connection server_2;
            {noformat}
            serg Sergei Golubchik made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            serg Sergei Golubchik made changes -
            Assignee Andrei Elkin [ elkin ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.7 [ 24805 ]
            Fix Version/s 10.8 [ 26121 ]

            fails quite often in 10.10

            serg Sergei Golubchik added a comment - fails quite often in 10.10
            Elkin Andrei Elkin made changes -
            Assignee Andrei Elkin [ elkin ] Brandon Nesterenko [ JIRAUSER48702 ]
            bnestere Brandon Nesterenko added a comment - - edited

            Initial problem analysis shows this to be a deadlock on the `gtid_slave_pos` table. When the `IO thread` starts up, during `get_master_version_and_clock()`'s `rpl_global_gtid_slave_state->load()` will first get an MDL lock on the table, and then try to get the table locks. Then when an SQL executor thread tries to execute `ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB`, it first gets the table lock, and then tries to get the MDL lock. If the IO thread has the MDL lock and one of the SQL executor threads has the table lock, the outcome is a deadlock and the test will timeout. Looks like this is more prevalent in 10.10 because the online alter work may have increased the window for deadlock.

            For reference, the relevant thread backtraces are pasted below:

            Thread 12 (LWP 561403 "mariadbd"):
            #0  __futex_abstimed_wait_common64 (cancel=true, private=7, abstime=0x7f0dad6cbe30, clockid=21928, expected=0, futex_word=0x7f0d740284a0) at ../sysdeps/nptl/futex-internal.c:74
            #1  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f0d740284a0, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7f0dad6cbe30, private=private@entry=0) at ../sysdeps/nptl/futex-internal.c:123
            #2  0x00007f0dbba6299e in __pthread_cond_wait_common (abstime=0x7f0dad6cbe30, clockid=0, mutex=0x7f0d740176e0, cond=0x7f0d74028478) at pthread_cond_wait.c:504
            #3  __pthread_cond_timedwait (cond=0x7f0d74028478, mutex=0x7f0d740176e0, abstime=0x7f0dad6cbe30) at pthread_cond_wait.c:646
            #4  0x000055a887a93851 in safe_cond_timedwait (cond=0x7f0d74028478, mp=0x7f0d740176b8, abstime=0x7f0dad6cbe30, file=0x55a888025600 "/home/brandon/workspace/server/mysys/my_thr_init.c", line=609) at /home/brandon/workspace/server/mysys/thr_mutex.c:548
            #5  0x000055a887a8c7c5 in psi_cond_timedwait (that=0x7f0d74028478, mutex=0x7f0d740176b8, abstime=0x7f0dad6cbe30, file=0x55a888025c58 "/home/brandon/workspace/server/mysys/thr_lock.c", line=558) at /home/brandon/workspace/server/mysys/my_thr_init.c:609
            #6  0x000055a887a8fe55 in inline_mysql_cond_timedwait (that=0x7f0d74028478, mutex=0x7f0d740176b8, abstime=0x7f0dad6cbe30, src_file=0x55a888025c58 "/home/brandon/workspace/server/mysys/thr_lock.c", src_line=558) at /home/brandon/workspace/server/include/mysql/psi/mysql_thread.h:1086
            #7  0x000055a887a904ea in wait_for_lock (wait=0x7f0d74017788, data=0x7f0d7400ca88, in_wait_list=0 '\000', lock_wait_timeout=31536000) at /home/brandon/workspace/server/mysys/thr_lock.c:558
            #8  0x000055a887a90d95 in thr_lock (data=0x7f0d7400ca88, owner=0x7f0d74020558, lock_wait_timeout=31536000) at /home/brandon/workspace/server/mysys/thr_lock.c:890
            #9  0x000055a887a91698 in thr_multi_lock (data=0x7f0d7401b8c0, count=1, owner=0x7f0d74020558, lock_wait_timeout=31536000) at /home/brandon/workspace/server/mysys/thr_lock.c:1171
            #10 0x000055a887360f84 in mysql_lock_tables (thd=0x7f0d7401e7d8, sql_lock=0x7f0d7401b898, flags=0) at /home/brandon/workspace/server/sql/lock.cc:349
            #11 0x000055a887360d82 in mysql_lock_tables (thd=0x7f0d7401e7d8, tables=0x7f0d7401b890, count=1, flags=0) at /home/brandon/workspace/server/sql/lock.cc:301
            #12 0x000055a886d6207f in lock_tables (thd=0x7f0d7401e7d8, tables=0x7f0dad6cc350, count=1, flags=0) at /home/brandon/workspace/server/sql/sql_base.cc:5799
            #13 0x000055a886d61872 in open_and_lock_tables (thd=0x7f0d7401e7d8, options=..., tables=0x7f0dad6cc350, derived=false, flags=0, prelocking_strategy=0x7f0dad6cc1a8) at /home/brandon/workspace/server/sql/sql_base.cc:5559
            #14 0x000055a886d16d6b in open_and_lock_tables (thd=0x7f0d7401e7d8, tables=0x7f0dad6cc350, derived=false, flags=0) at /home/brandon/workspace/server/sql/sql_base.h:510
            #15 0x000055a887090eff in rpl_slave_state::record_gtid (this=0x55a88af06790, thd=0x7f0d7401e7d8, gtid=0x7f0dad6ccac0, sub_id=3, in_transaction=false, in_statement=false, out_hton=0x7f0dad6ccaa8) at /home/brandon/workspace/server/sql/rpl_gtid.cc:692
            #16 0x000055a887092c7b in rpl_slave_state::load (this=0x55a88af06790, thd=0x7f0d7401e7d8, state_from_master=0x7f0d74004a3d "", len=5, reset=false, in_statement=false) at /home/brandon/workspace/server/sql/rpl_gtid.cc:1411
            #17 0x000055a886ceb98d in get_master_version_and_clock (mysql=0x7f0d74018088, mi=0x55a88b743be0) at /home/brandon/workspace/server/sql/slave.cc:2566
            #18 0x000055a886cf323e in handle_slave_io (arg=0x55a88b743be0) at /home/brandon/workspace/server/sql/slave.cc:4813
            #19 0x000055a887503bfc in pfs_spawn_thread (arg=0x7f0d88120a98) at /home/brandon/workspace/server/storage/perfschema/pfs.cc:2201
            #20 0x00007f0dbba5c450 in start_thread (arg=0x7f0dad6ce640) at pthread_create.c:473
            #21 0x00007f0dbb5f3d53 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

            Thread 15 (LWP 561407 "mariadbd"):
            #0  __futex_abstimed_wait_common64 (cancel=true, private=-1386027408, abstime=0x7f0dad62eb30, clockid=32525, expected=0, futex_word=0x7f0d6c001220) at ../sysdeps/nptl/futex-internal.c:74
            #1  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f0d6c001220, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7f0dad62eb30, private=private@entry=0) at ../sysdeps/nptl/futex-internal.c:123
            #2  0x00007f0dbba6299e in __pthread_cond_wait_common (abstime=0x7f0dad62eb30, clockid=0, mutex=0x7f0d6c001170, cond=0x7f0d6c0011f8) at pthread_cond_wait.c:504
            #3  __pthread_cond_timedwait (cond=0x7f0d6c0011f8, mutex=0x7f0d6c001170, abstime=0x7f0dad62eb30) at pthread_cond_wait.c:646
            #4  0x000055a887a93851 in safe_cond_timedwait (cond=0x7f0d6c0011f8, mp=0x7f0d6c001148, abstime=0x7f0dad62eb30, file=0x55a888025600 "/home/brandon/workspace/server/mysys/my_thr_init.c", line=609) at /home/brandon/workspace/server/mysys/thr_mutex.c:548
            #5  0x000055a887a8c7c5 in psi_cond_timedwait (that=0x7f0d6c0011f8, mutex=0x7f0d6c001148, abstime=0x7f0dad62eb30, file=0x55a887c23ed8 "/home/brandon/workspace/server/sql/mdl.cc", line=1195) at /home/brandon/workspace/server/mysys/my_thr_init.c:609
            #6  0x000055a886ff8c93 in inline_mysql_cond_timedwait (that=0x7f0d6c0011f8, mutex=0x7f0d6c001148, abstime=0x7f0dad62eb30, src_file=0x55a887c23ed8 "/home/brandon/workspace/server/sql/mdl.cc", src_line=1195) at /home/brandon/workspace/server/include/mysql/psi/mysql_thread.h:1086
            #7  0x000055a886ffa5a0 in MDL_wait::timed_wait (this=0x7f0d6c001148, owner=0x7f0d6c0010d8, abs_timeout=0x7f0dad62eb30, set_status_on_timeout=false, wait_state_name=0x55a8885c5a50 <MDL_key::m_namespace_to_wait_state_name+48>) at /home/brandon/workspace/server/sql/mdl.cc:1195
            #8  0x000055a886ffc58f in MDL_context::acquire_lock (this=0x7f0d6c001148, mdl_request=0x7f0dad62ec10, lock_wait_timeout=31536000) at /home/brandon/workspace/server/sql/mdl.cc:2379
            #9  0x000055a886ffce01 in MDL_context::upgrade_shared_lock (this=0x7f0d6c001148, mdl_ticket=0x7f0d6c0084e0, new_type=MDL_EXCLUSIVE, lock_wait_timeout=31536000) at /home/brandon/workspace/server/sql/mdl.cc:2586
            #10 0x000055a886f3fa5c in copy_data_between_tables (thd=0x7f0d6c000ff8, from=0x7f0d7401a948, to=0x7f0d6c019698, create=..., ignore=false, order_num=0, order=0x0, copied=0x7f0dad634330, deleted=0x7f0dad634338, keys_onoff=Alter_info::LEAVE_AS_IS, alter_ctx=0x7f0dad635870, online=true) at /home/brandon/workspace/server/sql/sql_table.cc:12300
            #11 0x000055a886f3c29c in mysql_alter_table (thd=0x7f0d6c000ff8, new_db=0x7f0d6c005cb8, new_name=0x7f0d6c0060c8, create_info=0x7f0dad636680, table_list=0x7f0d6c00ba50, alter_info=0x7f0dad636590, order_num=0, order=0x0, ignore=false, if_exists=false) at /home/brandon/workspace/server/sql/sql_table.cc:11337
            #12 0x000055a886ff611a in Sql_cmd_alter_table::execute (this=0x7f0d6c00c140, thd=0x7f0d6c000ff8) at /home/brandon/workspace/server/sql/sql_alter.cc:553
            #13 0x000055a886e19977 in mysql_execute_command (thd=0x7f0d6c000ff8, is_called_from_prepared_stmt=false) at /home/brandon/workspace/server/sql/sql_parse.cc:5996
            #14 0x000055a886e1fbaf in mysql_parse (thd=0x7f0d6c000ff8, rawbuf=0x7f0d8000fef3 "ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB", length=46, parser_state=0x7f0dad637900) at /home/brandon/workspace/server/sql/sql_parse.cc:8038
            #15 0x000055a88737ae0a in Query_log_event::do_apply_event (this=0x7f0d8000fd18, rgi=0x7f0d8000efa0, query_arg=0x7f0d8000fef3 "ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB", q_len_arg=46) at /home/brandon/workspace/server/sql/log_event_server.cc:2168
            #16 0x000055a8873791a7 in Query_log_event::do_apply_event (this=0x7f0d8000fd18, rgi=0x7f0d8000efa0) at /home/brandon/workspace/server/sql/log_event_server.cc:1613
            #17 0x000055a88736f185 in Log_event::apply_event (this=0x7f0d8000fd18, rgi=0x7f0d8000efa0) at /home/brandon/workspace/server/sql/log_event.cc:4152
            #18 0x000055a886cf0c62 in apply_event_and_update_pos_apply (ev=0x7f0d8000fd18, thd=0x7f0d6c000ff8, rgi=0x7f0d8000efa0, reason=0) at /home/brandon/workspace/server/sql/slave.cc:3881
            #19 0x000055a886cf1459 in apply_event_and_update_pos_for_parallel (ev=0x7f0d8000fd18, thd=0x7f0d6c000ff8, rgi=0x7f0d8000efa0) at /home/brandon/workspace/server/sql/slave.cc:4078
            #20 0x000055a88709a3f3 in rpt_handle_event (qev=0x7f0d80026718, rpt=0x7f0d800199f0) at /home/brandon/workspace/server/sql/rpl_parallel.cc:62
            #21 0x000055a88709def6 in handle_rpl_parallel_thread (arg=0x7f0d800199f0) at /home/brandon/workspace/server/sql/rpl_parallel.cc:1409
            #22 0x000055a887503bfc in pfs_spawn_thread (arg=0x7f0d8001a308) at /home/brandon/workspace/server/storage/perfschema/pfs.cc:2201
            --Type <RET> for more, q to quit, c to continue without paging--c
            #23 0x00007f0dbba5c450 in start_thread (arg=0x7f0dad638640) at pthread_create.c:473
            #24 0x00007f0dbb5f3d53 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

            bnestere Brandon Nesterenko added a comment - - edited Initial problem analysis shows this to be a deadlock on the `gtid_slave_pos` table. When the `IO thread` starts up, during `get_master_version_and_clock()`'s `rpl_global_gtid_slave_state->load()` will first get an MDL lock on the table, and then try to get the table locks. Then when an SQL executor thread tries to execute `ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB`, it first gets the table lock, and then tries to get the MDL lock. If the IO thread has the MDL lock and one of the SQL executor threads has the table lock, the outcome is a deadlock and the test will timeout. Looks like this is more prevalent in 10.10 because the online alter work may have increased the window for deadlock. For reference, the relevant thread backtraces are pasted below: Thread 12 (LWP 561403 "mariadbd"): #0 __futex_abstimed_wait_common64 (cancel=true, private=7, abstime=0x7f0dad6cbe30, clockid=21928, expected=0, futex_word=0x7f0d740284a0) at ../sysdeps/nptl/futex-internal.c:74 #1 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f0d740284a0, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7f0dad6cbe30, private=private@entry=0) at ../sysdeps/nptl/futex-internal.c:123 #2 0x00007f0dbba6299e in __pthread_cond_wait_common (abstime=0x7f0dad6cbe30, clockid=0, mutex=0x7f0d740176e0, cond=0x7f0d74028478) at pthread_cond_wait.c:504 #3 __pthread_cond_timedwait (cond=0x7f0d74028478, mutex=0x7f0d740176e0, abstime=0x7f0dad6cbe30) at pthread_cond_wait.c:646 #4 0x000055a887a93851 in safe_cond_timedwait (cond=0x7f0d74028478, mp=0x7f0d740176b8, abstime=0x7f0dad6cbe30, file=0x55a888025600 "/home/brandon/workspace/server/mysys/my_thr_init.c", line=609) at /home/brandon/workspace/server/mysys/thr_mutex.c:548 #5 0x000055a887a8c7c5 in psi_cond_timedwait (that=0x7f0d74028478, mutex=0x7f0d740176b8, abstime=0x7f0dad6cbe30, file=0x55a888025c58 "/home/brandon/workspace/server/mysys/thr_lock.c", line=558) at /home/brandon/workspace/server/mysys/my_thr_init.c:609 #6 0x000055a887a8fe55 in inline_mysql_cond_timedwait (that=0x7f0d74028478, mutex=0x7f0d740176b8, abstime=0x7f0dad6cbe30, src_file=0x55a888025c58 "/home/brandon/workspace/server/mysys/thr_lock.c", src_line=558) at /home/brandon/workspace/server/include/mysql/psi/mysql_thread.h:1086 #7 0x000055a887a904ea in wait_for_lock (wait=0x7f0d74017788, data=0x7f0d7400ca88, in_wait_list=0 '\000', lock_wait_timeout=31536000) at /home/brandon/workspace/server/mysys/thr_lock.c:558 #8 0x000055a887a90d95 in thr_lock (data=0x7f0d7400ca88, owner=0x7f0d74020558, lock_wait_timeout=31536000) at /home/brandon/workspace/server/mysys/thr_lock.c:890 #9 0x000055a887a91698 in thr_multi_lock (data=0x7f0d7401b8c0, count=1, owner=0x7f0d74020558, lock_wait_timeout=31536000) at /home/brandon/workspace/server/mysys/thr_lock.c:1171 #10 0x000055a887360f84 in mysql_lock_tables (thd=0x7f0d7401e7d8, sql_lock=0x7f0d7401b898, flags=0) at /home/brandon/workspace/server/sql/lock.cc:349 #11 0x000055a887360d82 in mysql_lock_tables (thd=0x7f0d7401e7d8, tables=0x7f0d7401b890, count=1, flags=0) at /home/brandon/workspace/server/sql/lock.cc:301 #12 0x000055a886d6207f in lock_tables (thd=0x7f0d7401e7d8, tables=0x7f0dad6cc350, count=1, flags=0) at /home/brandon/workspace/server/sql/sql_base.cc:5799 #13 0x000055a886d61872 in open_and_lock_tables (thd=0x7f0d7401e7d8, options=..., tables=0x7f0dad6cc350, derived=false, flags=0, prelocking_strategy=0x7f0dad6cc1a8) at /home/brandon/workspace/server/sql/sql_base.cc:5559 #14 0x000055a886d16d6b in open_and_lock_tables (thd=0x7f0d7401e7d8, tables=0x7f0dad6cc350, derived=false, flags=0) at /home/brandon/workspace/server/sql/sql_base.h:510 #15 0x000055a887090eff in rpl_slave_state::record_gtid (this=0x55a88af06790, thd=0x7f0d7401e7d8, gtid=0x7f0dad6ccac0, sub_id=3, in_transaction=false, in_statement=false, out_hton=0x7f0dad6ccaa8) at /home/brandon/workspace/server/sql/rpl_gtid.cc:692 #16 0x000055a887092c7b in rpl_slave_state::load (this=0x55a88af06790, thd=0x7f0d7401e7d8, state_from_master=0x7f0d74004a3d "", len=5, reset=false, in_statement=false) at /home/brandon/workspace/server/sql/rpl_gtid.cc:1411 #17 0x000055a886ceb98d in get_master_version_and_clock (mysql=0x7f0d74018088, mi=0x55a88b743be0) at /home/brandon/workspace/server/sql/slave.cc:2566 #18 0x000055a886cf323e in handle_slave_io (arg=0x55a88b743be0) at /home/brandon/workspace/server/sql/slave.cc:4813 #19 0x000055a887503bfc in pfs_spawn_thread (arg=0x7f0d88120a98) at /home/brandon/workspace/server/storage/perfschema/pfs.cc:2201 #20 0x00007f0dbba5c450 in start_thread (arg=0x7f0dad6ce640) at pthread_create.c:473 #21 0x00007f0dbb5f3d53 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 15 (LWP 561407 "mariadbd"): #0 __futex_abstimed_wait_common64 (cancel=true, private=-1386027408, abstime=0x7f0dad62eb30, clockid=32525, expected=0, futex_word=0x7f0d6c001220) at ../sysdeps/nptl/futex-internal.c:74 #1 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f0d6c001220, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7f0dad62eb30, private=private@entry=0) at ../sysdeps/nptl/futex-internal.c:123 #2 0x00007f0dbba6299e in __pthread_cond_wait_common (abstime=0x7f0dad62eb30, clockid=0, mutex=0x7f0d6c001170, cond=0x7f0d6c0011f8) at pthread_cond_wait.c:504 #3 __pthread_cond_timedwait (cond=0x7f0d6c0011f8, mutex=0x7f0d6c001170, abstime=0x7f0dad62eb30) at pthread_cond_wait.c:646 #4 0x000055a887a93851 in safe_cond_timedwait (cond=0x7f0d6c0011f8, mp=0x7f0d6c001148, abstime=0x7f0dad62eb30, file=0x55a888025600 "/home/brandon/workspace/server/mysys/my_thr_init.c", line=609) at /home/brandon/workspace/server/mysys/thr_mutex.c:548 #5 0x000055a887a8c7c5 in psi_cond_timedwait (that=0x7f0d6c0011f8, mutex=0x7f0d6c001148, abstime=0x7f0dad62eb30, file=0x55a887c23ed8 "/home/brandon/workspace/server/sql/mdl.cc", line=1195) at /home/brandon/workspace/server/mysys/my_thr_init.c:609 #6 0x000055a886ff8c93 in inline_mysql_cond_timedwait (that=0x7f0d6c0011f8, mutex=0x7f0d6c001148, abstime=0x7f0dad62eb30, src_file=0x55a887c23ed8 "/home/brandon/workspace/server/sql/mdl.cc", src_line=1195) at /home/brandon/workspace/server/include/mysql/psi/mysql_thread.h:1086 #7 0x000055a886ffa5a0 in MDL_wait::timed_wait (this=0x7f0d6c001148, owner=0x7f0d6c0010d8, abs_timeout=0x7f0dad62eb30, set_status_on_timeout=false, wait_state_name=0x55a8885c5a50 <MDL_key::m_namespace_to_wait_state_name+48>) at /home/brandon/workspace/server/sql/mdl.cc:1195 #8 0x000055a886ffc58f in MDL_context::acquire_lock (this=0x7f0d6c001148, mdl_request=0x7f0dad62ec10, lock_wait_timeout=31536000) at /home/brandon/workspace/server/sql/mdl.cc:2379 #9 0x000055a886ffce01 in MDL_context::upgrade_shared_lock (this=0x7f0d6c001148, mdl_ticket=0x7f0d6c0084e0, new_type=MDL_EXCLUSIVE, lock_wait_timeout=31536000) at /home/brandon/workspace/server/sql/mdl.cc:2586 #10 0x000055a886f3fa5c in copy_data_between_tables (thd=0x7f0d6c000ff8, from=0x7f0d7401a948, to=0x7f0d6c019698, create=..., ignore=false, order_num=0, order=0x0, copied=0x7f0dad634330, deleted=0x7f0dad634338, keys_onoff=Alter_info::LEAVE_AS_IS, alter_ctx=0x7f0dad635870, online=true) at /home/brandon/workspace/server/sql/sql_table.cc:12300 #11 0x000055a886f3c29c in mysql_alter_table (thd=0x7f0d6c000ff8, new_db=0x7f0d6c005cb8, new_name=0x7f0d6c0060c8, create_info=0x7f0dad636680, table_list=0x7f0d6c00ba50, alter_info=0x7f0dad636590, order_num=0, order=0x0, ignore=false, if_exists=false) at /home/brandon/workspace/server/sql/sql_table.cc:11337 #12 0x000055a886ff611a in Sql_cmd_alter_table::execute (this=0x7f0d6c00c140, thd=0x7f0d6c000ff8) at /home/brandon/workspace/server/sql/sql_alter.cc:553 #13 0x000055a886e19977 in mysql_execute_command (thd=0x7f0d6c000ff8, is_called_from_prepared_stmt=false) at /home/brandon/workspace/server/sql/sql_parse.cc:5996 #14 0x000055a886e1fbaf in mysql_parse (thd=0x7f0d6c000ff8, rawbuf=0x7f0d8000fef3 "ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB", length=46, parser_state=0x7f0dad637900) at /home/brandon/workspace/server/sql/sql_parse.cc:8038 #15 0x000055a88737ae0a in Query_log_event::do_apply_event (this=0x7f0d8000fd18, rgi=0x7f0d8000efa0, query_arg=0x7f0d8000fef3 "ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB", q_len_arg=46) at /home/brandon/workspace/server/sql/log_event_server.cc:2168 #16 0x000055a8873791a7 in Query_log_event::do_apply_event (this=0x7f0d8000fd18, rgi=0x7f0d8000efa0) at /home/brandon/workspace/server/sql/log_event_server.cc:1613 #17 0x000055a88736f185 in Log_event::apply_event (this=0x7f0d8000fd18, rgi=0x7f0d8000efa0) at /home/brandon/workspace/server/sql/log_event.cc:4152 #18 0x000055a886cf0c62 in apply_event_and_update_pos_apply (ev=0x7f0d8000fd18, thd=0x7f0d6c000ff8, rgi=0x7f0d8000efa0, reason=0) at /home/brandon/workspace/server/sql/slave.cc:3881 #19 0x000055a886cf1459 in apply_event_and_update_pos_for_parallel (ev=0x7f0d8000fd18, thd=0x7f0d6c000ff8, rgi=0x7f0d8000efa0) at /home/brandon/workspace/server/sql/slave.cc:4078 #20 0x000055a88709a3f3 in rpt_handle_event (qev=0x7f0d80026718, rpt=0x7f0d800199f0) at /home/brandon/workspace/server/sql/rpl_parallel.cc:62 #21 0x000055a88709def6 in handle_rpl_parallel_thread (arg=0x7f0d800199f0) at /home/brandon/workspace/server/sql/rpl_parallel.cc:1409 #22 0x000055a887503bfc in pfs_spawn_thread (arg=0x7f0d8001a308) at /home/brandon/workspace/server/storage/perfschema/pfs.cc:2201 --Type <RET> for more, q to quit, c to continue without paging--c #23 0x00007f0dbba5c450 in start_thread (arg=0x7f0dad638640) at pthread_create.c:473 #24 0x00007f0dbb5f3d53 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            bnestere Brandon Nesterenko made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            bnestere Brandon Nesterenko made changes -
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.7 [ 24805 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.3 [ 22126 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.8 [ 26121 ]

            Looks like this bug was fixed as part of MDEV-28930.
            The call to MDL_context::upgrade_shared_lock () from copy_data_between_tables () as seen in the stack traces refered above does not exist in the main trees. Only in the commit referenced from MDEV-28930 (branch preview-10.10-ddl). So it should be fixed, at least a hang with those stack traces should no longer be possible to see.

            So is it ok if I close this bug?

            - Kristian.

            knielsen Kristian Nielsen added a comment - Looks like this bug was fixed as part of MDEV-28930 . The call to MDL_context::upgrade_shared_lock () from copy_data_between_tables () as seen in the stack traces refered above does not exist in the main trees. Only in the commit referenced from MDEV-28930 (branch preview-10.10-ddl). So it should be fixed, at least a hang with those stack traces should no longer be possible to see. So is it ok if I close this bug? - Kristian.

            Hi knielsen!

            MDEV-28930 was created for an issue that arose related to the new online alter work, which coincidentally caused rpl.rpl_mark_optimize_tbl_ddl to fail in a similar way to this report (and quite frequently). Now with that fixed, this test still fails, but it is quite infrequent. I believe the most recent is here.

            Another observation (which aligns with MDEV-31482) is that the test only seems to fail with statement binlog format, when performing the INSERT...SELECT commands. The previous commands (OPTIMIZE and INSERT) always finish commit.

            bnestere Brandon Nesterenko added a comment - Hi knielsen ! MDEV-28930 was created for an issue that arose related to the new online alter work, which coincidentally caused rpl.rpl_mark_optimize_tbl_ddl to fail in a similar way to this report (and quite frequently). Now with that fixed, this test still fails, but it is quite infrequent. I believe the most recent is here . Another observation (which aligns with MDEV-31482 ) is that the test only seems to fail with statement binlog format, when performing the INSERT...SELECT commands. The previous commands (OPTIMIZE and INSERT) always finish commit.

            Right, that makes sense.
            Looking at the cross reference, I see two kinds of failures.

            One is somewhat worrying, the slave fails with this error in the log:

            2023-06-08 7:31:36 11 [ERROR] Slave worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.

            This from this recent build:

            https://buildbot.mariadb.org/#/builders/318/builds/12630

            Happens in mixed and statement-mode, as you say. The test does INSERT ... SELECT on an innodb table without primary key, I wonder if that uses auto-increment behind the scenes and could be related to MDEV-31482? Somehow T2 would need to be selected as the deadlock victim by InnoDB 10 times in a row without T3 ever progressing, nor T3 being deadlock killed. Not sure if this is really possible, though not reporting autoinc locks due to MDEV-31482 could avoid the deadlock kill perhaps. Maybe I can push a fix for MDEV-31482 and we can see if this goes away. I have not been successful so far in reproducing locally.

            The other kind of failure happens exclusively on freebsd, and seems not restricted to this particular test case. The IO thread stops, with this error from the master:

            Last_IO_Error Got fatal error 1236 from master when reading data from binary log: 'bogus data in log event; the first event 'master-bin.000001' at 614, the last event read from 'master-bin.000002' at 343, the last byte read from 'master-bin.000002' at 362.'

            Apparently there is some problem on the master with reading the binlog on freebsd?

            https://buildbot.mariadb.net/ci/reports/cross_reference#branch=&revision=&platform=&fail_name=rpl.rpl_mark_optimize_tbl_ddl&fail_variant=&fail_info_full=bogus+data+in+log+event&typ=&info=&dt=&limit=100&fail_info_short=

            Maybe some missing check of short read that only occurs on freebsd, or something, would be good to check (how does one access the freebsd builder host?)

            I actually see a third kind of failure only in my local tests on slightly modified rpl_mark_optimize_tbl_ddl.test, this warning is seen in the error log:

            2023-07-02 14:26:09 32102 [Warning] Slave: Out of memory during slave state maintenance. Some no longer necessary rows in table mysql.gtid_slave_pos may be left undeleted.

            I doubt this is actual out-of-memory, I'm trying to track down if it could be insufficient locking on the hash being updated. A bit tricky as it happens only around 1-in-10000 (seen twice so far).

            knielsen Kristian Nielsen added a comment - Right, that makes sense. Looking at the cross reference, I see two kinds of failures. One is somewhat worrying, the slave fails with this error in the log: 2023-06-08 7:31:36 11 [ERROR] Slave worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. This from this recent build: https://buildbot.mariadb.org/#/builders/318/builds/12630 Happens in mixed and statement-mode, as you say. The test does INSERT ... SELECT on an innodb table without primary key, I wonder if that uses auto-increment behind the scenes and could be related to MDEV-31482 ? Somehow T2 would need to be selected as the deadlock victim by InnoDB 10 times in a row without T3 ever progressing, nor T3 being deadlock killed. Not sure if this is really possible, though not reporting autoinc locks due to MDEV-31482 could avoid the deadlock kill perhaps. Maybe I can push a fix for MDEV-31482 and we can see if this goes away. I have not been successful so far in reproducing locally. The other kind of failure happens exclusively on freebsd, and seems not restricted to this particular test case. The IO thread stops, with this error from the master: Last_IO_Error Got fatal error 1236 from master when reading data from binary log: 'bogus data in log event; the first event 'master-bin.000001' at 614, the last event read from 'master-bin.000002' at 343, the last byte read from 'master-bin.000002' at 362.' Apparently there is some problem on the master with reading the binlog on freebsd? https://buildbot.mariadb.net/ci/reports/cross_reference#branch=&revision=&platform=&fail_name=rpl.rpl_mark_optimize_tbl_ddl&fail_variant=&fail_info_full=bogus+data+in+log+event&typ=&info=&dt=&limit=100&fail_info_short= Maybe some missing check of short read that only occurs on freebsd, or something, would be good to check (how does one access the freebsd builder host?) I actually see a third kind of failure only in my local tests on slightly modified rpl_mark_optimize_tbl_ddl.test, this warning is seen in the error log: 2023-07-02 14:26:09 32102 [Warning] Slave: Out of memory during slave state maintenance. Some no longer necessary rows in table mysql.gtid_slave_pos may be left undeleted. I doubt this is actual out-of-memory, I'm trying to track down if it could be insufficient locking on the hash being updated. A bit tricky as it happens only around 1-in-10000 (seen twice so far).

            To the first failure, I had a similar thought about a deadlock 10 times in a row, so locally I set the retry count to 1 to see if i could find a single deadlock, and never got a failure locally at all (though admittedly I don't remember how many repetitions I actually tried..). Another thought (which i never got around to testing) was that it may be something that puts the server in somewhat of a broken state that would cause any number of retries to fail, no matter what. It seems you have a test case already for MDEV-31482, perhaps i can spend some time to see if there is a way to get rpl_mark_optimize_tbl_ddl.test into a similar state, though I'll be on vacation until Wednesday (this will be my last reply until i'm back )

            The FreeBSD failures are likely contained within another ticket, MDEV-28986. I've spent some time debugging that, but I'll put those findings on a comment on that specific ticket to keep the conversations focused.

            I haven't seen your "Out of memory during slave state maintenance" failure, though I'm interested to see what you dig up about it

            bnestere Brandon Nesterenko added a comment - To the first failure, I had a similar thought about a deadlock 10 times in a row, so locally I set the retry count to 1 to see if i could find a single deadlock, and never got a failure locally at all (though admittedly I don't remember how many repetitions I actually tried..). Another thought (which i never got around to testing) was that it may be something that puts the server in somewhat of a broken state that would cause any number of retries to fail, no matter what. It seems you have a test case already for MDEV-31482 , perhaps i can spend some time to see if there is a way to get rpl_mark_optimize_tbl_ddl.test into a similar state, though I'll be on vacation until Wednesday (this will be my last reply until i'm back ) The FreeBSD failures are likely contained within another ticket, MDEV-28986 . I've spent some time debugging that, but I'll put those findings on a comment on that specific ticket to keep the conversations focused. I haven't seen your "Out of memory during slave state maintenance" failure, though I'm interested to see what you dig up about it

            I found the cause of the "third kind of failure", filed as MDEV-31602.
            It's caused my missing locking of LOCK_slave_state when the IO thread calls rpl_global_gtid_slave_state->load().
            But I think this is not related to most of the failures in buildbot of rpl_mark_optimize_tbl_ddl.

            knielsen Kristian Nielsen added a comment - I found the cause of the "third kind of failure", filed as MDEV-31602 . It's caused my missing locking of LOCK_slave_state when the IO thread calls rpl_global_gtid_slave_state->load(). But I think this is not related to most of the failures in buildbot of rpl_mark_optimize_tbl_ddl.
            knielsen Kristian Nielsen added a comment - - edited

            Running with --slave-transaction-retries=1 is a good idea. I got a failure this way after 34 iterations (--mem --parallel=3 --retry=10000).

            But I'm not sure how this could happen 10 times in a row. Each time InnoDB will rollback the deadlock victim and release its locks. If the victim is ever the second transaction, the retry will wait for the first transaction to commit, preventing further innodb deadlocks. So the first transaction will have to restart and take some lock, after which the second transaction has to take another lock and cause a deadlock. How this can happen 10 times to the second transaction is a mystery, this is a table with just one row.

            So this suggests that your thought "a broken state that would cause any number of retries to fail" could be the explanation, though I have no concrete idea what it could be...

            knielsen Kristian Nielsen added a comment - - edited Running with --slave-transaction-retries=1 is a good idea. I got a failure this way after 34 iterations (--mem --parallel=3 --retry=10000). But I'm not sure how this could happen 10 times in a row. Each time InnoDB will rollback the deadlock victim and release its locks. If the victim is ever the second transaction, the retry will wait for the first transaction to commit, preventing further innodb deadlocks. So the first transaction will have to restart and take some lock, after which the second transaction has to take another lock and cause a deadlock. How this can happen 10 times to the second transaction is a mystery, this is a table with just one row. So this suggests that your thought "a broken state that would cause any number of retries to fail" could be the explanation, though I have no concrete idea what it could be...

            I actually found something. When the VATS scheduling was put into InnoDB, they errorneously removed the call to thd_deadlock_victim_preference() to select the correct victim when two parallel replication worker threads deadlock against each other. This will cause unnecessary extra deadlocks and retries. This should be fixed, I'll probably file another MDEV for it.

            However, I still do not understand how it is possible to get 10 unsuccessful retries in this test, needs to be understood.

            knielsen Kristian Nielsen added a comment - I actually found something. When the VATS scheduling was put into InnoDB, they errorneously removed the call to thd_deadlock_victim_preference() to select the correct victim when two parallel replication worker threads deadlock against each other. This will cause unnecessary extra deadlocks and retries. This should be fixed, I'll probably file another MDEV for it. However, I still do not understand how it is possible to get 10 unsuccessful retries in this test, needs to be understood.

            Still working on this and with (slow) progress.
            I managed to reproduce it twice so far, once on each of two buildbot hosts where it's failed a couple times. It's slow, takes several days (>100k iterations) to trigger. But as long as it's reproducible somehow, I'm hopeful I'll eventually find something.

            - Kristian.

            knielsen Kristian Nielsen added a comment - Still working on this and with (slow) progress. I managed to reproduce it twice so far, once on each of two buildbot hosts where it's failed a couple times. It's slow, takes several days (>100k iterations) to trigger. But as long as it's reproducible somehow, I'm hopeful I'll eventually find something. - Kristian.
            knielsen Kristian Nielsen made changes -
            Assignee Brandon Nesterenko [ JIRAUSER48702 ] Kristian Nielsen [ knielsen ]
            knielsen Kristian Nielsen added a comment - - edited

            Ok, I believe I solved this. It's caused by MDEV-31655.
            The scenario is the following:
            1. T1 and T2 end up in a deadlock inside InnoDB.
            2. InnoDB chooses T1 as the victim (bug, MDEV-31655).
            3. Parallel replication deadlock kills T2, thread scheduling delays the kill.
            4. T1 is retried.
            5. T1 immediately deadlocks with T2 and is again chosen as deadlock victim.
            6. Thread scheduling delays T2 so that step 4 and 5 repeats.
            7. When the deadlock kill of T2 goes through, we see T1 succeeding.
            8. But when 10 retries happens before the deadlock kill, we get the failure.
            Running --repeat=300000 with some extra printouts on the buildbot machines, I was able to reproduce a few times and confirm the above scenario. A typical printout is this:

            HULU3: deadlock kill GTID 0-1-6
            HULU1: retry_event_group(GTID 0-1-6)
            HULU2: victim GTID 0-1-6
            HULU1: retry_event_group(GTID 0-1-6)
            HULU2: victim GTID 0-1-6
            HULU1: retry_event_group(GTID 0-1-6)
            HULU2: victim GTID 0-1-6
            HULU1: retry_event_group(GTID 0-1-6)
            HULU2: victim GTID 0-1-6
            HULU1: retry_event_group(GTID 0-1-6)
            HULU3: deadlock kill GTID 0-1-7
            HULU1: retry_event_group(GTID 0-1-7)
            

            Here we clearly see that T1 (GTID 0-1-6) is repeatedly chosen as a deadlock victim (4 times in this case) until T2 (GTID 0-1-7) gets deadlock killed, then T1 completes ok. A similar printout from a failed run shows T1 being chosen as a victim a total of 10 times (default of --slave-transaction-retries), causing the test to fail.

            I'm still trying to see if I can come up with a testcase that doesn't require running 100k+ times to fail, we will see if I'm successful. But it looks like this bug is a duplicate of MDEV-31655.

            knielsen Kristian Nielsen added a comment - - edited Ok, I believe I solved this. It's caused by MDEV-31655 . The scenario is the following: 1. T1 and T2 end up in a deadlock inside InnoDB. 2. InnoDB chooses T1 as the victim (bug, MDEV-31655 ). 3. Parallel replication deadlock kills T2, thread scheduling delays the kill. 4. T1 is retried. 5. T1 immediately deadlocks with T2 and is again chosen as deadlock victim. 6. Thread scheduling delays T2 so that step 4 and 5 repeats. 7. When the deadlock kill of T2 goes through, we see T1 succeeding. 8. But when 10 retries happens before the deadlock kill, we get the failure. Running --repeat=300000 with some extra printouts on the buildbot machines, I was able to reproduce a few times and confirm the above scenario. A typical printout is this: HULU3: deadlock kill GTID 0-1-6 HULU1: retry_event_group(GTID 0-1-6) HULU2: victim GTID 0-1-6 HULU1: retry_event_group(GTID 0-1-6) HULU2: victim GTID 0-1-6 HULU1: retry_event_group(GTID 0-1-6) HULU2: victim GTID 0-1-6 HULU1: retry_event_group(GTID 0-1-6) HULU2: victim GTID 0-1-6 HULU1: retry_event_group(GTID 0-1-6) HULU3: deadlock kill GTID 0-1-7 HULU1: retry_event_group(GTID 0-1-7) Here we clearly see that T1 (GTID 0-1-6) is repeatedly chosen as a deadlock victim (4 times in this case) until T2 (GTID 0-1-7) gets deadlock killed, then T1 completes ok. A similar printout from a failed run shows T1 being chosen as a victim a total of 10 times (default of --slave-transaction-retries), causing the test to fail. I'm still trying to see if I can come up with a testcase that doesn't require running 100k+ times to fail, we will see if I'm successful. But it looks like this bug is a duplicate of MDEV-31655 .
            knielsen Kristian Nielsen made changes -
            knielsen Kristian Nielsen made changes -
            Attachment mdev28776_reproduce.patch [ 71022 ]

            I managed to create a (hacky, not suitable for pushing) testcase, attached to this issue.
            This test case adds a few sleeps around the code to trigger the exact thread scheduling needed to trigger the problem semi-reliably.
            This conclusively confirms that the cause of these test failures in buildbot is the MDEV-31655.
            Running the attached test, I see the same printouts as when reproducing on the buildbo hosts.

            HULU3: deadlock kill GTID 0-1-5
            HULU2: A: victim GTID 0-1-4 (0-1-5)
            HULU1: retry_event_group(GTID 0-1-4)
            HULU2: A: victim GTID 0-1-4 (0-1-5)
            HULU1: retry_event_group(GTID 0-1-4)
            HULU2: A: victim GTID 0-1-4 (0-1-5)
            HULU1: retry_event_group(GTID 0-1-4)
            HULU2: A: victim GTID 0-1-4 (0-1-5)
            HULU1: retry_event_group(GTID 0-1-4)
            HULU2: A: victim GTID 0-1-4 (0-1-5)
            HULU1: retry_event_group(GTID 0-1-4)
            HULU2: A: victim GTID 0-1-4 (0-1-5)
            2023-07-17  0:18:04 21 [ERROR] Slave worker thread retried transaction 5 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
            2023-07-17  0:18:04 21 [ERROR] Slave SQL: Deadlock found when trying to get lock; try restarting transaction, Gtid 0-1-4, Internal MariaDB error code: 1213
            HULU3:     (delayed to now)
            

            knielsen Kristian Nielsen added a comment - I managed to create a (hacky, not suitable for pushing) testcase, attached to this issue. This test case adds a few sleeps around the code to trigger the exact thread scheduling needed to trigger the problem semi-reliably. This conclusively confirms that the cause of these test failures in buildbot is the MDEV-31655 . Running the attached test, I see the same printouts as when reproducing on the buildbo hosts. HULU3: deadlock kill GTID 0-1-5 HULU2: A: victim GTID 0-1-4 (0-1-5) HULU1: retry_event_group(GTID 0-1-4) HULU2: A: victim GTID 0-1-4 (0-1-5) HULU1: retry_event_group(GTID 0-1-4) HULU2: A: victim GTID 0-1-4 (0-1-5) HULU1: retry_event_group(GTID 0-1-4) HULU2: A: victim GTID 0-1-4 (0-1-5) HULU1: retry_event_group(GTID 0-1-4) HULU2: A: victim GTID 0-1-4 (0-1-5) HULU1: retry_event_group(GTID 0-1-4) HULU2: A: victim GTID 0-1-4 (0-1-5) 2023-07-17 0:18:04 21 [ERROR] Slave worker thread retried transaction 5 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. 2023-07-17 0:18:04 21 [ERROR] Slave SQL: Deadlock found when trying to get lock; try restarting transaction, Gtid 0-1-4, Internal MariaDB error code: 1213 HULU3: (delayed to now)

            Pushed to 10.4

            knielsen Kristian Nielsen added a comment - Pushed to 10.4
            knielsen Kristian Nielsen made changes -
            Fix Version/s 10.4.32 [ 29300 ]
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            Elkin Andrei Elkin made changes -
            Fix Version/s 10.5.23 [ 29012 ]
            Fix Version/s 10.6.16 [ 29014 ]
            Fix Version/s 10.10.7 [ 29018 ]
            Fix Version/s 10.11.6 [ 29020 ]
            Fix Version/s 11.0.4 [ 29021 ]
            Fix Version/s 10.9.8 [ 29015 ]

            People

              knielsen Kristian Nielsen
              angelique.sklavounos Angelique Sklavounos (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.