Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31509

Lost data with FTWRL and STOP SLAVE

    XMLWordPrintable

Details

    Description

      From code inspection, found the following problem with STOP SLAVE and FLUSH TABLES WITH READ LOCK:

      1. Event groups T1 and T2 are queued but not started yet.
      2. FLUSH TABLE WITH READ LOCKS starts, sets rpl_parallel_entry::pause_sub_id
      3. T2 Sees pause_sub_id, goes to wait for the pause to complete.
      4. FTWRL completes, UNLOCK TABLES is run.
      5. STOP SLAVE is run, sets rpl_parallel_entry::stop_sub_id.
      6. T2 wakes up after FTWRL pause, only now sets rpl_parallel_entry::largest_started_sub_id. This is the bug, largest_started_sub_id is set too late here.
      7. T1 starts, it sees stop_sub_id<T1, so T1 is skipped due to STOP SLAVE.
      8. T2 continues, its check for stop_sub_id was before STOP SLAVE. So T2 is wrongly applied, silently losing transaction T1.

      The bug is that largest_started_sub_id must be set immediately after (or before) checking stop_sub_id, while holding LOCK_parallel_entry. The problem is that it is set only after the FTWRL wait, which can temporarily release the LOCK_parallel_entry.

      I will attach an mtr test case.

      Attachments

        1. needed_debug_sync.patch.txt
          2 kB
          Kristian Nielsen
        2. rpl_parallel_ftwrl.result
          3 kB
          Kristian Nielsen
        3. rpl_parallel_ftwrl.test
          5 kB
          Kristian Nielsen

        Activity

          People

            knielsen Kristian Nielsen
            knielsen Kristian Nielsen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.