Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28776

rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master

Details

    Description

      https://buildbot.mariadb.org/#/builders/192/builds/10092

      The test occasionally fails with:

      10.5 22f935d6d

      rpl.rpl_mark_optimize_tbl_ddl 'innodb,mix' w6 [ fail ]
              Test ended at 2022-06-02 22:23:29
       
      CURRENT_TEST: rpl.rpl_mark_optimize_tbl_ddl
      analyze: sync_with_master
      mysqltest: At line 46: sync_with_master failed: 'select master_pos_wait('master-bin.000001', 1388, 300, '')' returned -1 indicating timeout after 300 seconds
       
      The result from queries just before the failure was:
      < snip >
      connection server_1;
      FLUSH TABLES;
      ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
      connection server_2;
      SET @save_slave_parallel_threads= @@GLOBAL.slave_parallel_threads;
      SET @save_slave_parallel_mode= @@GLOBAL.slave_parallel_mode;
      include/stop_slave.inc
      SET GLOBAL slave_parallel_threads=2;
      SET GLOBAL slave_parallel_mode=optimistic;
      include/start_slave.inc
      connection server_1;
      CREATE TABLE t1(a INT) ENGINE=INNODB;
      OPTIMIZE TABLE t1;
      Table	Op	Msg_type	Msg_text
      test.t1	optimize	note	Table does not support optimize, doing recreate + analyze instead
      test.t1	optimize	status	OK
      INSERT INTO t1 VALUES(1);
      INSERT INTO t1  SELECT 1+a FROM t1;
      INSERT INTO t1  SELECT 2+a FROM t1;
      connection server_2;
      

      Attachments

        Issue Links

          Activity

            I actually found something. When the VATS scheduling was put into InnoDB, they errorneously removed the call to thd_deadlock_victim_preference() to select the correct victim when two parallel replication worker threads deadlock against each other. This will cause unnecessary extra deadlocks and retries. This should be fixed, I'll probably file another MDEV for it.

            However, I still do not understand how it is possible to get 10 unsuccessful retries in this test, needs to be understood.

            knielsen Kristian Nielsen added a comment - I actually found something. When the VATS scheduling was put into InnoDB, they errorneously removed the call to thd_deadlock_victim_preference() to select the correct victim when two parallel replication worker threads deadlock against each other. This will cause unnecessary extra deadlocks and retries. This should be fixed, I'll probably file another MDEV for it. However, I still do not understand how it is possible to get 10 unsuccessful retries in this test, needs to be understood.

            Still working on this and with (slow) progress.
            I managed to reproduce it twice so far, once on each of two buildbot hosts where it's failed a couple times. It's slow, takes several days (>100k iterations) to trigger. But as long as it's reproducible somehow, I'm hopeful I'll eventually find something.

            - Kristian.

            knielsen Kristian Nielsen added a comment - Still working on this and with (slow) progress. I managed to reproduce it twice so far, once on each of two buildbot hosts where it's failed a couple times. It's slow, takes several days (>100k iterations) to trigger. But as long as it's reproducible somehow, I'm hopeful I'll eventually find something. - Kristian.
            knielsen Kristian Nielsen added a comment - - edited

            Ok, I believe I solved this. It's caused by MDEV-31655.
            The scenario is the following:
            1. T1 and T2 end up in a deadlock inside InnoDB.
            2. InnoDB chooses T1 as the victim (bug, MDEV-31655).
            3. Parallel replication deadlock kills T2, thread scheduling delays the kill.
            4. T1 is retried.
            5. T1 immediately deadlocks with T2 and is again chosen as deadlock victim.
            6. Thread scheduling delays T2 so that step 4 and 5 repeats.
            7. When the deadlock kill of T2 goes through, we see T1 succeeding.
            8. But when 10 retries happens before the deadlock kill, we get the failure.
            Running --repeat=300000 with some extra printouts on the buildbot machines, I was able to reproduce a few times and confirm the above scenario. A typical printout is this:

            HULU3: deadlock kill GTID 0-1-6
            HULU1: retry_event_group(GTID 0-1-6)
            HULU2: victim GTID 0-1-6
            HULU1: retry_event_group(GTID 0-1-6)
            HULU2: victim GTID 0-1-6
            HULU1: retry_event_group(GTID 0-1-6)
            HULU2: victim GTID 0-1-6
            HULU1: retry_event_group(GTID 0-1-6)
            HULU2: victim GTID 0-1-6
            HULU1: retry_event_group(GTID 0-1-6)
            HULU3: deadlock kill GTID 0-1-7
            HULU1: retry_event_group(GTID 0-1-7)
            

            Here we clearly see that T1 (GTID 0-1-6) is repeatedly chosen as a deadlock victim (4 times in this case) until T2 (GTID 0-1-7) gets deadlock killed, then T1 completes ok. A similar printout from a failed run shows T1 being chosen as a victim a total of 10 times (default of --slave-transaction-retries), causing the test to fail.

            I'm still trying to see if I can come up with a testcase that doesn't require running 100k+ times to fail, we will see if I'm successful. But it looks like this bug is a duplicate of MDEV-31655.

            knielsen Kristian Nielsen added a comment - - edited Ok, I believe I solved this. It's caused by MDEV-31655 . The scenario is the following: 1. T1 and T2 end up in a deadlock inside InnoDB. 2. InnoDB chooses T1 as the victim (bug, MDEV-31655 ). 3. Parallel replication deadlock kills T2, thread scheduling delays the kill. 4. T1 is retried. 5. T1 immediately deadlocks with T2 and is again chosen as deadlock victim. 6. Thread scheduling delays T2 so that step 4 and 5 repeats. 7. When the deadlock kill of T2 goes through, we see T1 succeeding. 8. But when 10 retries happens before the deadlock kill, we get the failure. Running --repeat=300000 with some extra printouts on the buildbot machines, I was able to reproduce a few times and confirm the above scenario. A typical printout is this: HULU3: deadlock kill GTID 0-1-6 HULU1: retry_event_group(GTID 0-1-6) HULU2: victim GTID 0-1-6 HULU1: retry_event_group(GTID 0-1-6) HULU2: victim GTID 0-1-6 HULU1: retry_event_group(GTID 0-1-6) HULU2: victim GTID 0-1-6 HULU1: retry_event_group(GTID 0-1-6) HULU2: victim GTID 0-1-6 HULU1: retry_event_group(GTID 0-1-6) HULU3: deadlock kill GTID 0-1-7 HULU1: retry_event_group(GTID 0-1-7) Here we clearly see that T1 (GTID 0-1-6) is repeatedly chosen as a deadlock victim (4 times in this case) until T2 (GTID 0-1-7) gets deadlock killed, then T1 completes ok. A similar printout from a failed run shows T1 being chosen as a victim a total of 10 times (default of --slave-transaction-retries), causing the test to fail. I'm still trying to see if I can come up with a testcase that doesn't require running 100k+ times to fail, we will see if I'm successful. But it looks like this bug is a duplicate of MDEV-31655 .

            I managed to create a (hacky, not suitable for pushing) testcase, attached to this issue.
            This test case adds a few sleeps around the code to trigger the exact thread scheduling needed to trigger the problem semi-reliably.
            This conclusively confirms that the cause of these test failures in buildbot is the MDEV-31655.
            Running the attached test, I see the same printouts as when reproducing on the buildbo hosts.

            HULU3: deadlock kill GTID 0-1-5
            HULU2: A: victim GTID 0-1-4 (0-1-5)
            HULU1: retry_event_group(GTID 0-1-4)
            HULU2: A: victim GTID 0-1-4 (0-1-5)
            HULU1: retry_event_group(GTID 0-1-4)
            HULU2: A: victim GTID 0-1-4 (0-1-5)
            HULU1: retry_event_group(GTID 0-1-4)
            HULU2: A: victim GTID 0-1-4 (0-1-5)
            HULU1: retry_event_group(GTID 0-1-4)
            HULU2: A: victim GTID 0-1-4 (0-1-5)
            HULU1: retry_event_group(GTID 0-1-4)
            HULU2: A: victim GTID 0-1-4 (0-1-5)
            2023-07-17  0:18:04 21 [ERROR] Slave worker thread retried transaction 5 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
            2023-07-17  0:18:04 21 [ERROR] Slave SQL: Deadlock found when trying to get lock; try restarting transaction, Gtid 0-1-4, Internal MariaDB error code: 1213
            HULU3:     (delayed to now)
            

            knielsen Kristian Nielsen added a comment - I managed to create a (hacky, not suitable for pushing) testcase, attached to this issue. This test case adds a few sleeps around the code to trigger the exact thread scheduling needed to trigger the problem semi-reliably. This conclusively confirms that the cause of these test failures in buildbot is the MDEV-31655 . Running the attached test, I see the same printouts as when reproducing on the buildbo hosts. HULU3: deadlock kill GTID 0-1-5 HULU2: A: victim GTID 0-1-4 (0-1-5) HULU1: retry_event_group(GTID 0-1-4) HULU2: A: victim GTID 0-1-4 (0-1-5) HULU1: retry_event_group(GTID 0-1-4) HULU2: A: victim GTID 0-1-4 (0-1-5) HULU1: retry_event_group(GTID 0-1-4) HULU2: A: victim GTID 0-1-4 (0-1-5) HULU1: retry_event_group(GTID 0-1-4) HULU2: A: victim GTID 0-1-4 (0-1-5) HULU1: retry_event_group(GTID 0-1-4) HULU2: A: victim GTID 0-1-4 (0-1-5) 2023-07-17 0:18:04 21 [ERROR] Slave worker thread retried transaction 5 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. 2023-07-17 0:18:04 21 [ERROR] Slave SQL: Deadlock found when trying to get lock; try restarting transaction, Gtid 0-1-4, Internal MariaDB error code: 1213 HULU3: (delayed to now)

            Pushed to 10.4

            knielsen Kristian Nielsen added a comment - Pushed to 10.4

            People

              knielsen Kristian Nielsen
              angelique.sklavounos Angelique Sklavounos (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.