mysqltest: At line 46: sync_with_master failed: 'select master_pos_wait('master-bin.000001', 1388, 300, '')' returned -1 indicating timeout after 300 seconds
The result from queries just before the failure was:
< snip >
connection server_1;
FLUSH TABLES;
ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
connection server_2;
SET @save_slave_parallel_threads= @@GLOBAL.slave_parallel_threads;
SET @save_slave_parallel_mode= @@GLOBAL.slave_parallel_mode;
include/stop_slave.inc
SET GLOBAL slave_parallel_threads=2;
SET GLOBAL slave_parallel_mode=optimistic;
include/start_slave.inc
connection server_1;
CREATE TABLE t1(a INT) ENGINE=INNODB;
OPTIMIZE TABLE t1;
Table Op Msg_type Msg_text
test.t1 optimize note Table does not support optimize, doing recreate + analyze instead
I actually found something. When the VATS scheduling was put into InnoDB, they errorneously removed the call to thd_deadlock_victim_preference() to select the correct victim when two parallel replication worker threads deadlock against each other. This will cause unnecessary extra deadlocks and retries. This should be fixed, I'll probably file another MDEV for it.
However, I still do not understand how it is possible to get 10 unsuccessful retries in this test, needs to be understood.
Kristian Nielsen
added a comment - I actually found something. When the VATS scheduling was put into InnoDB, they errorneously removed the call to thd_deadlock_victim_preference() to select the correct victim when two parallel replication worker threads deadlock against each other. This will cause unnecessary extra deadlocks and retries. This should be fixed, I'll probably file another MDEV for it.
However, I still do not understand how it is possible to get 10 unsuccessful retries in this test, needs to be understood.
Still working on this and with (slow) progress.
I managed to reproduce it twice so far, once on each of two buildbot hosts where it's failed a couple times. It's slow, takes several days (>100k iterations) to trigger. But as long as it's reproducible somehow, I'm hopeful I'll eventually find something.
- Kristian.
Kristian Nielsen
added a comment - Still working on this and with (slow) progress.
I managed to reproduce it twice so far, once on each of two buildbot hosts where it's failed a couple times. It's slow, takes several days (>100k iterations) to trigger. But as long as it's reproducible somehow, I'm hopeful I'll eventually find something.
- Kristian.
Ok, I believe I solved this. It's caused by MDEV-31655.
The scenario is the following:
1. T1 and T2 end up in a deadlock inside InnoDB.
2. InnoDB chooses T1 as the victim (bug, MDEV-31655).
3. Parallel replication deadlock kills T2, thread scheduling delays the kill.
4. T1 is retried.
5. T1 immediately deadlocks with T2 and is again chosen as deadlock victim.
6. Thread scheduling delays T2 so that step 4 and 5 repeats.
7. When the deadlock kill of T2 goes through, we see T1 succeeding.
8. But when 10 retries happens before the deadlock kill, we get the failure.
Running --repeat=300000 with some extra printouts on the buildbot machines, I was able to reproduce a few times and confirm the above scenario. A typical printout is this:
HULU3: deadlock kill GTID 0-1-6
HULU1: retry_event_group(GTID 0-1-6)
HULU2: victim GTID 0-1-6
HULU1: retry_event_group(GTID 0-1-6)
HULU2: victim GTID 0-1-6
HULU1: retry_event_group(GTID 0-1-6)
HULU2: victim GTID 0-1-6
HULU1: retry_event_group(GTID 0-1-6)
HULU2: victim GTID 0-1-6
HULU1: retry_event_group(GTID 0-1-6)
HULU3: deadlock kill GTID 0-1-7
HULU1: retry_event_group(GTID 0-1-7)
Here we clearly see that T1 (GTID 0-1-6) is repeatedly chosen as a deadlock victim (4 times in this case) until T2 (GTID 0-1-7) gets deadlock killed, then T1 completes ok. A similar printout from a failed run shows T1 being chosen as a victim a total of 10 times (default of --slave-transaction-retries), causing the test to fail.
I'm still trying to see if I can come up with a testcase that doesn't require running 100k+ times to fail, we will see if I'm successful. But it looks like this bug is a duplicate of MDEV-31655.
Kristian Nielsen
added a comment - - edited Ok, I believe I solved this. It's caused by MDEV-31655 .
The scenario is the following:
1. T1 and T2 end up in a deadlock inside InnoDB.
2. InnoDB chooses T1 as the victim (bug, MDEV-31655 ).
3. Parallel replication deadlock kills T2, thread scheduling delays the kill.
4. T1 is retried.
5. T1 immediately deadlocks with T2 and is again chosen as deadlock victim.
6. Thread scheduling delays T2 so that step 4 and 5 repeats.
7. When the deadlock kill of T2 goes through, we see T1 succeeding.
8. But when 10 retries happens before the deadlock kill, we get the failure.
Running --repeat=300000 with some extra printouts on the buildbot machines, I was able to reproduce a few times and confirm the above scenario. A typical printout is this:
HULU3: deadlock kill GTID 0-1-6
HULU1: retry_event_group(GTID 0-1-6)
HULU2: victim GTID 0-1-6
HULU1: retry_event_group(GTID 0-1-6)
HULU2: victim GTID 0-1-6
HULU1: retry_event_group(GTID 0-1-6)
HULU2: victim GTID 0-1-6
HULU1: retry_event_group(GTID 0-1-6)
HULU2: victim GTID 0-1-6
HULU1: retry_event_group(GTID 0-1-6)
HULU3: deadlock kill GTID 0-1-7
HULU1: retry_event_group(GTID 0-1-7)
Here we clearly see that T1 (GTID 0-1-6) is repeatedly chosen as a deadlock victim (4 times in this case) until T2 (GTID 0-1-7) gets deadlock killed, then T1 completes ok. A similar printout from a failed run shows T1 being chosen as a victim a total of 10 times (default of --slave-transaction-retries), causing the test to fail.
I'm still trying to see if I can come up with a testcase that doesn't require running 100k+ times to fail, we will see if I'm successful. But it looks like this bug is a duplicate of MDEV-31655 .
I managed to create a (hacky, not suitable for pushing) testcase, attached to this issue.
This test case adds a few sleeps around the code to trigger the exact thread scheduling needed to trigger the problem semi-reliably.
This conclusively confirms that the cause of these test failures in buildbot is the MDEV-31655.
Running the attached test, I see the same printouts as when reproducing on the buildbo hosts.
HULU3: deadlock kill GTID 0-1-5
HULU2: A: victim GTID 0-1-4 (0-1-5)
HULU1: retry_event_group(GTID 0-1-4)
HULU2: A: victim GTID 0-1-4 (0-1-5)
HULU1: retry_event_group(GTID 0-1-4)
HULU2: A: victim GTID 0-1-4 (0-1-5)
HULU1: retry_event_group(GTID 0-1-4)
HULU2: A: victim GTID 0-1-4 (0-1-5)
HULU1: retry_event_group(GTID 0-1-4)
HULU2: A: victim GTID 0-1-4 (0-1-5)
HULU1: retry_event_group(GTID 0-1-4)
HULU2: A: victim GTID 0-1-4 (0-1-5)
2023-07-17 0:18:04 21 [ERROR] Slave worker thread retried transaction 5 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
2023-07-17 0:18:04 21 [ERROR] Slave SQL: Deadlock found when trying to get lock; try restarting transaction, Gtid 0-1-4, Internal MariaDB error code: 1213
HULU3: (delayed to now)
Kristian Nielsen
added a comment - I managed to create a (hacky, not suitable for pushing) testcase, attached to this issue.
This test case adds a few sleeps around the code to trigger the exact thread scheduling needed to trigger the problem semi-reliably.
This conclusively confirms that the cause of these test failures in buildbot is the MDEV-31655 .
Running the attached test, I see the same printouts as when reproducing on the buildbo hosts.
HULU3: deadlock kill GTID 0-1-5
HULU2: A: victim GTID 0-1-4 (0-1-5)
HULU1: retry_event_group(GTID 0-1-4)
HULU2: A: victim GTID 0-1-4 (0-1-5)
HULU1: retry_event_group(GTID 0-1-4)
HULU2: A: victim GTID 0-1-4 (0-1-5)
HULU1: retry_event_group(GTID 0-1-4)
HULU2: A: victim GTID 0-1-4 (0-1-5)
HULU1: retry_event_group(GTID 0-1-4)
HULU2: A: victim GTID 0-1-4 (0-1-5)
HULU1: retry_event_group(GTID 0-1-4)
HULU2: A: victim GTID 0-1-4 (0-1-5)
2023-07-17 0:18:04 21 [ERROR] Slave worker thread retried transaction 5 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
2023-07-17 0:18:04 21 [ERROR] Slave SQL: Deadlock found when trying to get lock; try restarting transaction, Gtid 0-1-4, Internal MariaDB error code: 1213
HULU3: (delayed to now)
I actually found something. When the VATS scheduling was put into InnoDB, they errorneously removed the call to thd_deadlock_victim_preference() to select the correct victim when two parallel replication worker threads deadlock against each other. This will cause unnecessary extra deadlocks and retries. This should be fixed, I'll probably file another MDEV for it.
However, I still do not understand how it is possible to get 10 unsuccessful retries in this test, needs to be understood.