Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25559

Auto-create: infinite loop after interrupted lock wait

    XMLWordPrintable

    Details

      Description

      The scenario is an extension of the timed deadlock reported as MDEV-25547. While in MDEV-25547 the problem is a delay upon DML, possibly very lengthy but still finite, here it develops into a permanent hang – or at least I have never seen it end, and I don't see in the stack trace any suggestions that it is time-limited.

      I tried to minimize the scenario, but still not completely sure everything that remains there is necessary. Also, while it currently fails every time for me, it is probably still non-deterministic by nature. I expect it should be easy enough to synchronize it properly after the initial analysis, but I couldn't do it blindly.

      The scenario involves 3 connections which compete for auto-creating a partition by running DML on the table. In this case it hits the time interval boundary, although the same applies to limit partitions. When they hit a deadlock similar to described in MDEV-25547, it gets resolved by killing queries in two of three deadlocked connections and finishing their transactions. In the test case query killing is done by the means of max_statement_time, it has the same effect as KILL QUERY.
      The remaining connection is meant to finish its own query without any obstacles.

      However, it doesn't happen. The remaining connection hangs (or rather loops, at 100% CPU) forever.

      I expect it to be a quite realistic scenario, hence the high severity. As mentioned in MDEV-25547, lock_wait_timeout which affects the waiting time in these deadlocks is usually quite lengthy in real-life instances, so it will probably happen a lot that the deadlocks end up being resolved by killing the queries.

      The test case is for reproducing purposes only, don't put it into the regression suite, create a synchronized one instead.

      --source include/have_partition.inc
      --source include/have_metadata_lock_info.inc
      --source include/have_innodb.inc
       
      call mtr.add_suppression("is out of INTERVAL, need more HISTORY partitions");
       
      --connect(con1,localhost,root,,)
      set max_statement_time= 2;
      --connect(con2,localhost,root,,)
      set lock_wait_timeout= 5;
      --connect(con3,localhost,root,,)
      set max_statement_time= 2;
       
      --connection con1
       
      SET @@timestamp = unix_timestamp('2021-04-27 00:00:01');
      create table t (pk int primary key, a int) engine=InnoDB with system versioning partition by system_time interval 1 hour auto;
      insert into t values (1,0),(2,0),(3,0);
       
      --echo #
      --echo # Jump 1 hour forward, start transaction and get partition p1 created
      --echo #
      begin;
      SET @@timestamp = unix_timestamp('2021-04-27 01:00:01');
      update t set a = 1 where pk = 1;
       
      --connection con2
      --echo #
      --echo # Jump 10 more seconds forward, start transaction and get metadata lock on the table
      --echo #
      begin;
      SET @@timestamp = unix_timestamp('2021-04-27 01:00:11');
      update t set a = 2 where pk = 2;
       
      --connection con1
      --echo #
      --echo # Jump another hour forward, try to create partition p2, get locked
      --echo #
      SET @@timestamp = unix_timestamp('2021-04-27 02:00:12');
      --send
        update t set a = 1 where pk = 1;
       
      --connection con2
      --echo #
      --echo # Jump another 10 seconds forward, also try to create partition p2 and get locked
      --echo #
      SET @@timestamp = unix_timestamp('2021-04-27 02:00:22');
      --send
        update t set a = 2 where pk = 2;
       
      --connection con3
      --echo #
      --echo # Jump another 10 seconds forward, start transaction, also try to create partition p2 and get locked
      --echo #
      SET @@timestamp = unix_timestamp('2021-04-27 02:00:33');
      begin;
      --send
        update t set a = 3 where pk = 3;
       
      --connection default
      --echo #
      --echo # If at this point for con2 we see query "update t set a = 2 where pk = 2"
      --echo # in "closing tables" or "Opening tables" state,
      --echo # then it will hang at the end. otherwise it will probably go through
      --echo #
      --query_vertical SELECT ID, COMMAND, STATE, INFO FROM INFORMATION_SCHEMA.PROCESSLIST WHERE INFO = 'update t set a = 2 where pk = 2'
       
      --echo #
      --echo # Now we just wait till the other two connections interrupt the queries
      --echo # due to max_statement_time. It has the same effect as running
      --echo # KILL QUERY for them, only without an external action.
      --echo # Then we commit transactions in them and disconnect, to be sure
      --echo # they don't interfere anymore in any way
      --echo # 
      --connection con3
      --error ER_STATEMENT_TIMEOUT
      --reap
      commit;
      --disconnect con3
       
      --connection con1
      --error ER_STATEMENT_TIMEOUT
      --reap
      commit;
      --disconnect con1
       
      --connection default
      --echo #
      --echo # Now if we got the hang the processlist shows con2 still running the query,
      --echo # everything else idle. Metadata lock info is for informational purposes only,
      --echo # it shows the same single lock whether we got the hang or not
      --echo #
      show processlist;
      select * from information_schema.metadata_lock_info;
       
      --let $pidfile= `SELECT @@pid_file`
      --echo #
      --echo # If it hangs, it hangs here.
      --echo # It doesn't have anything to deadlock with,
      --echo # and even if it did, lock_wait_timeout is only 5 seconds.
      --echo # The reproducer doesn't interrupt the hang,
      --echo # so that it can be analyzed in peace.
      --echo # The server binary: $MYSQLD
      --echo # The server PID:
      --cat_file $pidfile
      --echo #
      --connection con2
      --reap
      --disconnect con2
       
      --connection default
      drop table t;
      

      The troubled connection is con2. At the end of the test case, when other concurrent connections exited, the output is this (starting from the processlist):

      bb-10.6-midenok-MDEV-17554 b96b96f9

      show processlist;
      Id	User	Host	db	Command	Time	State	Info	Progress
      4	root	localhost	test	Query	0	starting	show processlist	0.000
      6	root	localhost	test	Query	2	closing tables	update t set a = 2 where pk = 2	0.000
      select * from information_schema.metadata_lock_info;
      THREAD_ID	LOCK_MODE	LOCK_DURATION	LOCK_TYPE	TABLE_SCHEMA	TABLE_NAME
      6	MDL_SHARED_WRITE	NULL	Table metadata lock	test	t
      #
      # If it hangs, it hangs here.
      # It doesn't have anything to deadlock with,
      # and even if it did, lock_wait_timeout is only 5 seconds.
      # The reproducer doesn't interrupt the hang,
      # so that it can be analyzed in peace.
      

      The thread runs around somewhere in open_table, so the stack trace is different every time a snapshot is taken. Here are a couple of examples:

      #0  0x0000556b9b959c81 in lf_hash_search_using_hash_value (hash=0x556b9d1c1120 <tdc_hash>, pins=0x556b9fd02f18, hashnr=1932676072, key=0x7f68ec0146d9, keylen=7) at /data/src/bb-10.6-midenok-MDEV-17554/mysys
      /lf_hash.cc:499
      #1  0x0000556b9a6b0b95 in tdc_acquire_share (thd=0x7f68ec000db8, tl=0x7f68ec014260, flags=3, out_table=0x7f69242843e8) at /data/src/bb-10.6-midenok-MDEV-17554/sql/table_cache.cc:808
      #2  0x0000556b9a08f438 in open_table (thd=0x7f68ec000db8, table_list=0x7f68ec014260, ot_ctx=0x7f69242847d0) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_base.cc:1918
      #3  0x0000556b9a09871e in open_and_process_table (thd=0x7f68ec000db8, tables=0x7f68ec014260, counter=0x7f69242848e8, flags=0, prelocking_strategy=0x7f6924284858, has_prelocking_list=false, ot_ctx=0x7f692428
      47d0) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_base.cc:3970
      #4  0x0000556b9a09ac73 in open_tables (thd=0x7f68ec000db8, options=..., start=0x7f69242848c0, counter=0x7f69242848e8, flags=0, prelocking_strategy=0x7f6924284858) at /data/src/bb-10.6-midenok-MDEV-17554/sql
      /sql_base.cc:4453
      #5  0x0000556b9a077f22 in open_tables (thd=0x7f68ec000db8, tables=0x7f69242848c0, counter=0x7f69242848e8, flags=0) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_base.h:479
      #6  0x0000556b9a40bbdb in mysql_update (thd=0x7f68ec000db8, table_list=0x7f68ec014260, fields=..., values=..., conds=0x7f68ec014d40, order_num=0, order=0x0, limit=18446744073709551615, ignore=false, found_r
      eturn=0x7f6924284f50, updated_return=0x7f6924285040) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_update.cc:409
      #7  0x0000556b9a1e41c6 in mysql_execute_command (thd=0x7f68ec000db8) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_parse.cc:4403
      #8  0x0000556b9a1fda04 in mysql_parse (thd=0x7f68ec000db8, rawbuf=0x7f68ec014180 "update t set a = 2 where pk = 2", length=31, parser_state=0x7f6924285490) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_pa
      rse.cc:8021
      #9  0x0000556b9a1d3d51 in dispatch_command (command=COM_QUERY, thd=0x7f68ec000db8, packet=0x7f68ec00b899 "update t set a = 2 where pk = 2", packet_length=31, blocking=true) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_parse.cc:1898
      #10 0x0000556b9a1d0e06 in do_command (thd=0x7f68ec000db8, blocking=true) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_parse.cc:1407
      #11 0x0000556b9a4f947b in do_handle_one_connection (connect=0x556ba00acbc8, put_in_cache=true) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_connect.cc:1410
      #12 0x0000556b9a4f8e3e in handle_one_connection (arg=0x556ba00acbc8) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_connect.cc:1312
      #13 0x0000556b9aeb0310 in pfs_spawn_thread (arg=0x556ba00acca8) at /data/src/bb-10.6-midenok-MDEV-17554/storage/perfschema/pfs.cc:2201
      #14 0x00007f692a5f7609 in start_thread (arg=<optimized out>) at pthread_create.c:477
      #15 0x00007f692a1cb293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      #0  0x0000556b9b988bb4 in code_state () at /data/src/bb-10.6-midenok-MDEV-17554/dbug/dbug.c:379
      #1  0x0000556b9b98c548 in _db_enter_ (_func_=0x556b9bd7b946 "ha_partition::loop_partitions", _file_=0x556b9bd77fa8 "/data/src/bb-10.6-midenok-MDEV-17554/sql/ha_partition.cc", _line_=9322, _stack_frame_=0x7f
      6924284210) at /data/src/bb-10.6-midenok-MDEV-17554/dbug/dbug.c:1133
      #2  0x0000556b9ae96e3f in ha_partition::loop_partitions (this=0x7f68e802e8e0, callback=0x556b9ae9531d <extra_cb(handler*, void*)>, param=0x7f6924284254) at /data/src/bb-10.6-midenok-MDEV-17554/sql/ha_partit
      ion.cc:9322
      #3  0x0000556b9ae95e37 in ha_partition::extra (this=0x7f68e802e8e0, operation=HA_EXTRA_IS_ATTACHED_CHILDREN) at /data/src/bb-10.6-midenok-MDEV-17554/sql/ha_partition.cc:9144
      #4  0x0000556b9a6ad8ce in tc_acquire_table (thd=0x7f68ec000db8, element=0x7f68e8036f38) at /data/src/bb-10.6-midenok-MDEV-17554/sql/table_cache.cc:401
      #5  0x0000556b9a6b1297 in tdc_acquire_share (thd=0x7f68ec000db8, tl=0x7f68ec014260, flags=3, out_table=0x7f69242843e8) at /data/src/bb-10.6-midenok-MDEV-17554/sql/table_cache.cc:862
      #6  0x0000556b9a08f438 in open_table (thd=0x7f68ec000db8, table_list=0x7f68ec014260, ot_ctx=0x7f69242847d0) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_base.cc:1918
      #7  0x0000556b9a09871e in open_and_process_table (thd=0x7f68ec000db8, tables=0x7f68ec014260, counter=0x7f69242848e8, flags=0, prelocking_strategy=0x7f6924284858, has_prelocking_list=false, ot_ctx=0x7f692428
      47d0) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_base.cc:3970
      #8  0x0000556b9a09ac73 in open_tables (thd=0x7f68ec000db8, options=..., start=0x7f69242848c0, counter=0x7f69242848e8, flags=0, prelocking_strategy=0x7f6924284858) at /data/src/bb-10.6-midenok-MDEV-17554/sql
      /sql_base.cc:4453
      #9  0x0000556b9a077f22 in open_tables (thd=0x7f68ec000db8, tables=0x7f69242848c0, counter=0x7f69242848e8, flags=0) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_base.h:479
      #10 0x0000556b9a40bbdb in mysql_update (thd=0x7f68ec000db8, table_list=0x7f68ec014260, fields=..., values=..., conds=0x7f68ec014d40, order_num=0, order=0x0, limit=18446744073709551615, ignore=false, found_return=0x7f6924284f50, updated_return=0x7f6924285040) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_update.cc:409
      #11 0x0000556b9a1e41c6 in mysql_execute_command (thd=0x7f68ec000db8) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_parse.cc:4403
      #12 0x0000556b9a1fda04 in mysql_parse (thd=0x7f68ec000db8, rawbuf=0x7f68ec014180 "update t set a = 2 where pk = 2", length=31, parser_state=0x7f6924285490) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_parse.cc:8021
      #13 0x0000556b9a1d3d51 in dispatch_command (command=COM_QUERY, thd=0x7f68ec000db8, packet=0x7f68ec00b899 "update t set a = 2 where pk = 2", packet_length=31, blocking=true) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_parse.cc:1898
      #14 0x0000556b9a1d0e06 in do_command (thd=0x7f68ec000db8, blocking=true) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_parse.cc:1407
      #15 0x0000556b9a4f947b in do_handle_one_connection (connect=0x556ba00acbc8, put_in_cache=true) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_connect.cc:1410
      #16 0x0000556b9a4f8e3e in handle_one_connection (arg=0x556ba00acbc8) at /data/src/bb-10.6-midenok-MDEV-17554/sql/sql_connect.cc:1312
      #17 0x0000556b9aeb0310 in pfs_spawn_thread (arg=0x556ba00acca8) at /data/src/bb-10.6-midenok-MDEV-17554/storage/perfschema/pfs.cc:2201
      #18 0x00007f692a5f7609 in start_thread (arg=<optimized out>) at pthread_create.c:477
      #19 0x00007f692a1cb293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              midenok Aleksey Midenkov
              Reporter:
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Git Integration