[MDEV-31509] Lost data with FTWRL and STOP SLAVE - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.4.0
Fix Version/s: 10.4.31
Component/s: Replication
Labels:
- parallelreplication

Description

From code inspection, found the following problem with STOP SLAVE and FLUSH TABLES WITH READ LOCK:

1. Event groups T1 and T2 are queued but not started yet.
2. FLUSH TABLE WITH READ LOCKS starts, sets rpl_parallel_entry::pause_sub_id
3. T2 Sees pause_sub_id, goes to wait for the pause to complete.
4. FTWRL completes, UNLOCK TABLES is run.
5. STOP SLAVE is run, sets rpl_parallel_entry::stop_sub_id.
6. T2 wakes up after FTWRL pause, only now sets rpl_parallel_entry::largest_started_sub_id. This is the bug, largest_started_sub_id is set too late here.
7. T1 starts, it sees stop_sub_id<T1, so T1 is skipped due to STOP SLAVE.
8. T2 continues, its check for stop_sub_id was before STOP SLAVE. So T2 is wrongly applied, silently losing transaction T1.

The bug is that largest_started_sub_id must be set immediately after (or before) checking stop_sub_id, while holding LOCK_parallel_entry. The problem is that it is set only after the FTWRL wait, which can temporarily release the LOCK_parallel_entry.

I will attach an mtr test case.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

needed_debug_sync.patch.txt
2 kB
2023-06-20 20:24
rpl_parallel_ftwrl.result
3 kB
2023-06-20 20:24
rpl_parallel_ftwrl.test
5 kB
2023-06-20 20:24

Activity

People

Assignee:: Kristian Nielsen

Reporter:: Kristian Nielsen

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2023-06-20 20:21

Updated:: 2023-08-25 12:52

Resolved:: 2023-07-12 08:07

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.