Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-15152

Optimistic parallel slave doesn't cope well with START SLAVE UNTIL

Details

    Description

      When giving a parallel optimistic slave a replication stop position with

      START SLAVE UNTIL ...file..., ...pos...;

      it may actually stop at a position earlier than the given one if the slave worker that performs the transaction that spans over the given stop position has to roll back due to conflicts.

      This seems to leave the slave in a state where the SQL master thread still waits for the transaction to complete, but does not hand out any more tasks to the actual worker threads. This seems to include re-execution of failed transactions that start before the stop position.

      So replication effectively stops, but the SQL thread still shows as "Slave_SQL_Running: YES" and Exec_master_log_pos showing a log position smaller than the given stop position, whereas UNTIL should actually end with Slave_SQL_Running=No and the Exec position at the beginning of the first transaction (or event?) after the stop position.

      Meanwhile the actual slave worker threads seem to have been terminated already, and attempts to stop the slave do not succeed.

      Attachments

        Activity

          Elkin Andrei Elkin added a comment -

          I've seen problems with the Aggressive mode.
          It is not flawless yet.
          MDEV-12746 - which I have not updated with analysis - a patch is coming to hopefully by night. It's going to refine automatic retry-related logics. Whether/how it might be relevant to the UNTIL one is unclear though now.

          Elkin Andrei Elkin added a comment - I've seen problems with the Aggressive mode. It is not flawless yet. MDEV-12746 - which I have not updated with analysis - a patch is coming to hopefully by night. It's going to refine automatic retry-related logics. Whether/how it might be relevant to the UNTIL one is unclear though now.
          jonahgeorge Jonah George added a comment -

          I've also been experiencing sporadic issues that seem related to this (on v10.1.28)- curious if anyone has narrowed down a reproducible test case.

          jonahgeorge Jonah George added a comment - I've also been experiencing sporadic issues that seem related to this (on v10.1.28)- curious if anyone has narrowed down a reproducible test case.

          I am sure this is fully reproducible. When I did the private support ticket to Hartmut related to that, I was hitting this situation many times. I would expect a tight UPDATE loop on the same row to make the problem very easy to reproduce. I, sadly, do not have much time to work on this now. Let me just add that I am very disappointed that such bug in an important feature is not receiving more attention, which deserves a #MariaDB #BugOfTheDay: https://twitter.com/jfg956/status/1082406523282903041

          jeanfrancois.gagne Jean-François Gagné added a comment - I am sure this is fully reproducible. When I did the private support ticket to Hartmut related to that, I was hitting this situation many times. I would expect a tight UPDATE loop on the same row to make the problem very easy to reproduce. I, sadly, do not have much time to work on this now. Let me just add that I am very disappointed that such bug in an important feature is not receiving more attention, which deserves a #MariaDB #BugOfTheDay: https://twitter.com/jfg956/status/1082406523282903041
          Elkin Andrei Elkin added a comment - - edited

          I can confirm the following that the until started OPTIMISTIC parallel slave can reach the Until condition and exist having Relay_Master_Log_File:Exec_Master_Log_Pos < Until_Log_File:Until_Log_Pos.
          This may happen when optimistically executed transaction range spans more than one binlog file so the actual stop occures in an early file: Relay_Master_Log_File < Until_Log_File.

          This is a clear failure and discovered details hint strongly how to fix.

          Note too, that there's no traces of the hanging SQL thread now: {{ Slave_SQL_Running: No }}.
          (Perhaps removed by MDEV-12746 fixes).

          Elkin Andrei Elkin added a comment - - edited I can confirm the following that the until started OPTIMISTIC parallel slave can reach the Until condition and exist having Relay_Master_Log_File:Exec_Master_Log_Pos < Until_Log_File:Until_Log_Pos . This may happen when optimistically executed transaction range spans more than one binlog file so the actual stop occures in an early file: Relay_Master_Log_File < Until_Log_File . This is a clear failure and discovered details hint strongly how to fix. Note too, that there's no traces of the hanging SQL thread now : {{ Slave_SQL_Running: No }}. (Perhaps removed by MDEV-12746 fixes).
          Elkin Andrei Elkin added a comment -

          Sachin, could you please prioritize reviewing the soonest. KristianN has not shown up
          for it(, must be just busy).

          Elkin Andrei Elkin added a comment - Sachin, could you please prioritize reviewing the soonest. KristianN has not shown up for it(, must be just busy).

          Okay to push

          sachin.setiya.007 Sachin Setiya (Inactive) added a comment - Okay to push

          People

            Elkin Andrei Elkin
            hholzgra Hartmut Holzgraefe
            Votes:
            3 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.