[MDEV-28776] rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.2(EOL), 10.3(EOL), 10.5, 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL)
Fix Version/s: 10.4.32, 10.5.23, 10.6.16, 10.9.8, 10.10.7, 10.11.6, 11.0.4
Component/s: Replication, Tests
Labels:
None

Description

https://buildbot.mariadb.org/#/builders/192/builds/10092

The test occasionally fails with:

10.5 22f935d6d
rpl.rpl_mark_optimize_tbl_ddl 'innodb,mix' w6 [ fail ]
Test ended at 2022-06-02 22:23:29

CURRENT_TEST: rpl.rpl_mark_optimize_tbl_ddl
analyze: sync_with_master
mysqltest: At line 46: sync_with_master failed: 'select master_pos_wait('master-bin.000001', 1388, 300, '')' returned -1 indicating timeout after 300 seconds

The result from queries just before the failure was:
< snip >
connection server_1;
FLUSH TABLES;
ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
connection server_2;
SET @save_slave_parallel_threads= @@GLOBAL.slave_parallel_threads;
SET @save_slave_parallel_mode= @@GLOBAL.slave_parallel_mode;
include/stop_slave.inc
SET GLOBAL slave_parallel_threads=2;
SET GLOBAL slave_parallel_mode=optimistic;
include/start_slave.inc
connection server_1;
CREATE TABLE t1(a INT) ENGINE=INNODB;
OPTIMIZE TABLE t1;
Table Op Msg_type Msg_text
test.t1 optimize note Table does not support optimize, doing recreate + analyze instead
test.t1 optimize status OK
INSERT INTO t1 VALUES(1);
INSERT INTO t1 SELECT 1+a FROM t1;
INSERT INTO t1 SELECT 2+a FROM t1;
connection server_2;

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mdev28776_reproduce.patch
2023-07-16 22:28
8 kB
Kristian Nielsen

Issue Links

is caused by

MDEV-31655 Parallel replication deadlock victim preference code errorneously removed

Closed

split to

MDEV-28930 ALTER TABLE Deadlocks with parallel TL_WRITE

Closed

Activity

Ascending order - Click to sort in descending order

View 8 older comments

Kristian Nielsen added a comment - 2023-07-04 20:55

I actually found something. When the VATS scheduling was put into InnoDB, they errorneously removed the call to thd_deadlock_victim_preference() to select the correct victim when two parallel replication worker threads deadlock against each other. This will cause unnecessary extra deadlocks and retries. This should be fixed, I'll probably file another MDEV for it.

However, I still do not understand how it is possible to get 10 unsuccessful retries in this test, needs to be understood.

Kristian Nielsen added a comment - 2023-07-04 20:55 I actually found something. When the VATS scheduling was put into InnoDB, they errorneously removed the call to thd_deadlock_victim_preference() to select the correct victim when two parallel replication worker threads deadlock against each other. This will cause unnecessary extra deadlocks and retries. This should be fixed, I'll probably file another MDEV for it. However, I still do not understand how it is possible to get 10 unsuccessful retries in this test, needs to be understood.

Kristian Nielsen added a comment - 2023-07-12 19:20

Still working on this and with (slow) progress.
I managed to reproduce it twice so far, once on each of two buildbot hosts where it's failed a couple times. It's slow, takes several days (>100k iterations) to trigger. But as long as it's reproducible somehow, I'm hopeful I'll eventually find something.

- Kristian.

Kristian Nielsen added a comment - 2023-07-12 19:20 Still working on this and with (slow) progress. I managed to reproduce it twice so far, once on each of two buildbot hosts where it's failed a couple times. It's slow, takes several days (>100k iterations) to trigger. But as long as it's reproducible somehow, I'm hopeful I'll eventually find something. - Kristian.

Kristian Nielsen added a comment - 2023-07-14 17:54 - edited

Ok, I believe I solved this. It's caused by ~~MDEV-31655~~.
The scenario is the following:
1. T1 and T2 end up in a deadlock inside InnoDB.
2. InnoDB chooses T1 as the victim (bug, ~~MDEV-31655~~).
3. Parallel replication deadlock kills T2, thread scheduling delays the kill.
4. T1 is retried.
5. T1 immediately deadlocks with T2 and is again chosen as deadlock victim.
6. Thread scheduling delays T2 so that step 4 and 5 repeats.
7. When the deadlock kill of T2 goes through, we see T1 succeeding.
8. But when 10 retries happens before the deadlock kill, we get the failure.
Running --repeat=300000 with some extra printouts on the buildbot machines, I was able to reproduce a few times and confirm the above scenario. A typical printout is this:

HULU3: deadlock kill GTID 0-1-6

HULU1: retry_event_group(GTID 0-1-6)

HULU2: victim GTID 0-1-6

HULU1: retry_event_group(GTID 0-1-6)

HULU2: victim GTID 0-1-6

HULU1: retry_event_group(GTID 0-1-6)

HULU2: victim GTID 0-1-6

HULU1: retry_event_group(GTID 0-1-6)

HULU2: victim GTID 0-1-6

HULU1: retry_event_group(GTID 0-1-6)

HULU3: deadlock kill GTID 0-1-7

HULU1: retry_event_group(GTID 0-1-7)

Here we clearly see that T1 (GTID 0-1-6) is repeatedly chosen as a deadlock victim (4 times in this case) until T2 (GTID 0-1-7) gets deadlock killed, then T1 completes ok. A similar printout from a failed run shows T1 being chosen as a victim a total of 10 times (default of --slave-transaction-retries), causing the test to fail.

I'm still trying to see if I can come up with a testcase that doesn't require running 100k+ times to fail, we will see if I'm successful. But it looks like this bug is a duplicate of ~~MDEV-31655~~.

Kristian Nielsen added a comment - 2023-07-14 17:54 - edited Ok, I believe I solved this. It's caused by MDEV-31655 . The scenario is the following: 1. T1 and T2 end up in a deadlock inside InnoDB. 2. InnoDB chooses T1 as the victim (bug, MDEV-31655 ). 3. Parallel replication deadlock kills T2, thread scheduling delays the kill. 4. T1 is retried. 5. T1 immediately deadlocks with T2 and is again chosen as deadlock victim. 6. Thread scheduling delays T2 so that step 4 and 5 repeats. 7. When the deadlock kill of T2 goes through, we see T1 succeeding. 8. But when 10 retries happens before the deadlock kill, we get the failure. Running --repeat=300000 with some extra printouts on the buildbot machines, I was able to reproduce a few times and confirm the above scenario. A typical printout is this: HULU3: deadlock kill GTID 0-1-6 HULU1: retry_event_group(GTID 0-1-6) HULU2: victim GTID 0-1-6 HULU1: retry_event_group(GTID 0-1-6) HULU2: victim GTID 0-1-6 HULU1: retry_event_group(GTID 0-1-6) HULU2: victim GTID 0-1-6 HULU1: retry_event_group(GTID 0-1-6) HULU2: victim GTID 0-1-6 HULU1: retry_event_group(GTID 0-1-6) HULU3: deadlock kill GTID 0-1-7 HULU1: retry_event_group(GTID 0-1-7) Here we clearly see that T1 (GTID 0-1-6) is repeatedly chosen as a deadlock victim (4 times in this case) until T2 (GTID 0-1-7) gets deadlock killed, then T1 completes ok. A similar printout from a failed run shows T1 being chosen as a victim a total of 10 times (default of --slave-transaction-retries), causing the test to fail. I'm still trying to see if I can come up with a testcase that doesn't require running 100k+ times to fail, we will see if I'm successful. But it looks like this bug is a duplicate of MDEV-31655 .

Kristian Nielsen added a comment - 2023-07-16 22:32

I managed to create a (hacky, not suitable for pushing) testcase, attached to this issue.
This test case adds a few sleeps around the code to trigger the exact thread scheduling needed to trigger the problem semi-reliably.
This conclusively confirms that the cause of these test failures in buildbot is the ~~MDEV-31655~~.
Running the attached test, I see the same printouts as when reproducing on the buildbo hosts.

HULU3: deadlock kill GTID 0-1-5

HULU2: A: victim GTID 0-1-4 (0-1-5)

HULU1: retry_event_group(GTID 0-1-4)

HULU2: A: victim GTID 0-1-4 (0-1-5)

HULU1: retry_event_group(GTID 0-1-4)

HULU2: A: victim GTID 0-1-4 (0-1-5)

HULU1: retry_event_group(GTID 0-1-4)

HULU2: A: victim GTID 0-1-4 (0-1-5)

HULU1: retry_event_group(GTID 0-1-4)

HULU2: A: victim GTID 0-1-4 (0-1-5)

HULU1: retry_event_group(GTID 0-1-4)

HULU2: A: victim GTID 0-1-4 (0-1-5)

2023-07-17  0:18:04 21 [ERROR] Slave worker thread retried transaction 5 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.

2023-07-17  0:18:04 21 [ERROR] Slave SQL: Deadlock found when trying to get lock; try restarting transaction, Gtid 0-1-4, Internal MariaDB error code: 1213

HULU3:     (delayed to now)

Kristian Nielsen added a comment - 2023-07-16 22:32 I managed to create a (hacky, not suitable for pushing) testcase, attached to this issue. This test case adds a few sleeps around the code to trigger the exact thread scheduling needed to trigger the problem semi-reliably. This conclusively confirms that the cause of these test failures in buildbot is the MDEV-31655 . Running the attached test, I see the same printouts as when reproducing on the buildbo hosts. HULU3: deadlock kill GTID 0-1-5 HULU2: A: victim GTID 0-1-4 (0-1-5) HULU1: retry_event_group(GTID 0-1-4) HULU2: A: victim GTID 0-1-4 (0-1-5) HULU1: retry_event_group(GTID 0-1-4) HULU2: A: victim GTID 0-1-4 (0-1-5) HULU1: retry_event_group(GTID 0-1-4) HULU2: A: victim GTID 0-1-4 (0-1-5) HULU1: retry_event_group(GTID 0-1-4) HULU2: A: victim GTID 0-1-4 (0-1-5) HULU1: retry_event_group(GTID 0-1-4) HULU2: A: victim GTID 0-1-4 (0-1-5) 2023-07-17 0:18:04 21 [ERROR] Slave worker thread retried transaction 5 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. 2023-07-17 0:18:04 21 [ERROR] Slave SQL: Deadlock found when trying to get lock; try restarting transaction, Gtid 0-1-4, Internal MariaDB error code: 1213 HULU3: (delayed to now)

Kristian Nielsen added a comment - 2023-08-15 16:19

Pushed to 10.4

Kristian Nielsen added a comment - 2023-08-15 16:19 Pushed to 10.4

MariaDB Server

rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration