Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25547

Auto-create: Undetected deadlock lasts longer than the configured timeout

Details

    Description

      As probably suspected and even expected, concurrent connections can easily deadlock if they come to the point when they need to create a new partition. The deadlock is subject to lock_wait_timeout (which can already be quite high, e.g. the default value is 24 hours), but in fact it takes even more than that – lock waits apparently get serialized and the total duration is up to N x lock_wait_timeout where N is the number of connections participating in concurrency.

      Below is the test case for 3 connections.
      lock_wait_timeout=3 is set for each connection. Yet, the second UPDATE in con1 takes 9 seconds.
      The ranges which transactions operate with are configured so that there are no row overlaps/deadlocks, so innodb_lock_wait_timeout does not play a role here.

      --source include/have_partition.inc
      --source include/have_sequence.inc
      --source include/have_innodb.inc
       
      call mtr.add_suppression("is out of LIMIT, need more HISTORY partitions");
       
      create table t (pk int primary key, a int) engine=InnoDB
        with system versioning partition by system_time limit 100 auto;
       
      insert into t select seq, seq from seq_1_to_90;
      replace into t select seq, seq from seq_1_to_90;
       
      --connect (con1,localhost,root,,)
      set lock_wait_timeout= 3;
      begin;
      update t set a = a + 100 where pk < 20;
       
      --connect (con2,localhost,root,,)
      set lock_wait_timeout= 3;
      begin;
      --send
        update t set a = a + 10 where pk = 30;
       
      --connect (con3,localhost,root,,)
      set lock_wait_timeout= 3;
      begin;
      --send
        update t set a = a + 10 where pk = 40;
       
      --connection con1
      select now();
      update t set a = a + 100 where pk < 20;
      select now();
       
      --connection con2
      --reap
      commit;
       
      --connection con3
      --reap
      commit;
       
      --connection con1
      commit;
       
      # Cleanup
      drop table t;
      

      bb-10.6-midenok-MDEV-17554 b96b96f9f

      connection con1;
      select now();
      now()
      2021-04-28 00:43:15
      update t set a = a + 100 where pk < 20;
      Warnings:
      Error	1205	Lock wait timeout exceeded; try restarting transaction
      Warning	4114	Versioned table `test`.`t`: last HISTORY partition (`p0`) is out of LIMIT, need more HISTORY partitions
      select now();
      now()
      2021-04-28 00:43:24
      

      Attachments

        Issue Links

          Activity

            Please review bb-10.9-midenok

            midenok Aleksey Midenkov added a comment - Please review bb-10.9-midenok

            I don't know if that's a bug at all, lock_wait_timeout sets the timeout for one individual lock wait, if there are multiple waits, each one of them can take up to lock_wait_timeout seconds. It's not a total limit for all lock waits together. I'd say this behaves as expected.

            Also, the commit a4218739d92 fixes a separate problem:

            there is much serious problem. While the waiting threads wait for the creating thread to complete they run the loop continuously, acquiring and releasing MDL_SHARED_WRITE. If the creating thread is blocked for a long time waiting for MDL_EXCLUSIVE and the lock wait timeout is usually very long as MDEV-25547 states, this long time the waiting threads will hog up the CPU time.

            I don't see how this scenario could happen. A creating thread can be blocked for a long time waiting for MDL_EXCLUSIVE, indeed, by another MDL, if a running statement that holds e.g. MDL_SHARED_READ. But in this case new upcoming MDL requests will wait too, there will be no busy loop, instead the an incoming request for e.g. MDL_SHARED_READ will wait for the MDL_EXCLUSIVE, which, in turn, waits for another MDL_SHARED_READ.

            serg Sergei Golubchik added a comment - I don't know if that's a bug at all, lock_wait_timeout sets the timeout for one individual lock wait, if there are multiple waits, each one of them can take up to lock_wait_timeout seconds. It's not a total limit for all lock waits together. I'd say this behaves as expected. Also, the commit a4218739d92 fixes a separate problem: there is much serious problem. While the waiting threads wait for the creating thread to complete they run the loop continuously, acquiring and releasing MDL_SHARED_WRITE. If the creating thread is blocked for a long time waiting for MDL_EXCLUSIVE and the lock wait timeout is usually very long as MDEV-25547 states, this long time the waiting threads will hog up the CPU time. I don't see how this scenario could happen. A creating thread can be blocked for a long time waiting for MDL_EXCLUSIVE, indeed, by another MDL, if a running statement that holds e.g. MDL_SHARED_READ. But in this case new upcoming MDL requests will wait too, there will be no busy loop, instead the an incoming request for e.g. MDL_SHARED_READ will wait for the MDL_EXCLUSIVE, which, in turn, waits for another MDL_SHARED_READ.

            People

              serg Sergei Golubchik
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.