[MDEV-31509] Lost data with FTWRL and STOP SLAVE Created: 2023-06-20  Updated: 2023-08-25  Resolved: 2023-07-12

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.4.0
Fix Version/s: 10.4.31

Type: Bug Priority: Major
Reporter: Kristian Nielsen Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: parallelreplication

Attachments: Text File needed_debug_sync.patch.txt     File rpl_parallel_ftwrl.result     File rpl_parallel_ftwrl.test    

 Description   

From code inspection, found the following problem with STOP SLAVE and FLUSH TABLES WITH READ LOCK:

1. Event groups T1 and T2 are queued but not started yet.
2. FLUSH TABLE WITH READ LOCKS starts, sets rpl_parallel_entry::pause_sub_id
3. T2 Sees pause_sub_id, goes to wait for the pause to complete.
4. FTWRL completes, UNLOCK TABLES is run.
5. STOP SLAVE is run, sets rpl_parallel_entry::stop_sub_id.
6. T2 wakes up after FTWRL pause, only now sets rpl_parallel_entry::largest_started_sub_id. This is the bug, largest_started_sub_id is set too late here.
7. T1 starts, it sees stop_sub_id<T1, so T1 is skipped due to STOP SLAVE.
8. T2 continues, its check for stop_sub_id was before STOP SLAVE. So T2 is wrongly applied, silently losing transaction T1.

The bug is that largest_started_sub_id must be set immediately after (or before) checking stop_sub_id, while holding LOCK_parallel_entry. The problem is that it is set only after the FTWRL wait, which can temporarily release the LOCK_parallel_entry.

I will attach an mtr test case.



 Comments   
Comment by Kristian Nielsen [ 2023-07-12 ]

Pushed to 10.4

Generated at Thu Feb 08 10:24:25 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.