Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.4.0
Description
From code inspection, found the following problem with STOP SLAVE and FLUSH TABLES WITH READ LOCK:
1. Event groups T1 and T2 are queued but not started yet.
2. FLUSH TABLE WITH READ LOCKS starts, sets rpl_parallel_entry::pause_sub_id
3. T2 Sees pause_sub_id, goes to wait for the pause to complete.
4. FTWRL completes, UNLOCK TABLES is run.
5. STOP SLAVE is run, sets rpl_parallel_entry::stop_sub_id.
6. T2 wakes up after FTWRL pause, only now sets rpl_parallel_entry::largest_started_sub_id. This is the bug, largest_started_sub_id is set too late here.
7. T1 starts, it sees stop_sub_id<T1, so T1 is skipped due to STOP SLAVE.
8. T2 continues, its check for stop_sub_id was before STOP SLAVE. So T2 is wrongly applied, silently losing transaction T1.
The bug is that largest_started_sub_id must be set immediately after (or before) checking stop_sub_id, while holding LOCK_parallel_entry. The problem is that it is set only after the FTWRL wait, which can temporarily release the LOCK_parallel_entry.
I will attach an mtr test case.