Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-20628

If temporary error occurs with optimistic mode of parallel replication, error message has false information

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Duplicate
    • 10.2.27
    • N/A
    • Replication
    • None

    Description

      With optimistic mode of parallel replication, if a query fails, then you can end up with log messages like this:

      2019-09-17 19:27:07 140190015801088 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
      2019-09-17 19:27:07 140190015801088 [Warning] Slave: Duplicate entry '...' for key 'name' Error_code: 1062
      2019-09-17 19:27:07 140190015801088 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
      2019-09-17 19:27:07 140190015801088 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
      2019-09-17 19:27:07 140190015801088 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 50284765; GTID position '0-1-119489'
      2019-09-17 19:27:07 140190315325184 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
      2019-09-17 19:27:07 140190315325184 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
      2019-09-17 19:27:07 140190315325184 [Warning] Slave: Connection was killed Error_code: 1927
      2019-09-17 19:27:07 140190315325184 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
      2019-09-17 19:27:07 140190315325184 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
      2019-09-17 19:27:07 140190315325184 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 50284765; GTID position '0-1-119489'
      

      These messages are printed in slave_output_error_info. See here:

      https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/slave.cc#L4756

      With optimistic mode of parallel replication, the slave_output_error_info function can be called when a transaction has failed because another transaction that it depends on has failed. See here:

      https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/rpl_parallel.cc#L1221

      Some errors are considered "temporary errors". The only temporary errors are currently just deadlocks and timeouts. See here:

      https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/slave.cc#L3381

      In the case of a "temporary error", the event group can actually be retried. See here:

      https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/rpl_parallel.cc#L1305

      These retries happen if slave_transaction_retries is greater than 0.

      https://mariadb.com/kb/en/library/replication-and-binary-log-system-variables/#slave_transaction_retries

      In the example shown above, the errors were deadlocks, so the transactions were indeed retried. Notice that the following message was still included in the above log snippet:

      2019-09-17 19:27:07 140190315325184 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 50284765; GTID position '0-1-119489'

      This message is not quite correct in this case.

      • The SQL thread was not aborted.
      • The SQL thread does not need to be manually restarted.

      It seems like this error message in slave_output_error_info should be slightly different if the SQL thread encounters a temporary error. In that case, maybe it should be a Warning instead of an Error, and maybe it should say something like this instead:

      2019-09-17 19:27:07 140190315325184 [Warning] Temporary error running query. The slave SQL thread will retry the query. We encountered the temporary error at log 'mariadb-bin.000001' position 50284765; GTID position '0-1-119489'

      Attachments

        Issue Links

          Activity

            Elkin Andrei Elkin added a comment -

            GeoffMontee: Thanks for looking into this! When a temporary error is encountered the message should be of the [Warning] level, incl the last unsuccessful retry I think 'cos we have a run-out-of-retries conclusive error.

            Elkin Andrei Elkin added a comment - GeoffMontee : Thanks for looking into this! When a temporary error is encountered the message should be of the [Warning] level, incl the last unsuccessful retry I think 'cos we have a run-out-of-retries conclusive error.

            Hi GeoffMontee

            I tried to simulate the same error log as mentioned , But for me slave was stopped

            And one more thing if this condition is true
            https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/rpl_parallel.cc#L1221
            We will never retry event
            https://github.com/MariaDB/server/blob/f9f968557f9c632a083acabb3578631ce0d1c4f6/sql/rpl_parallel.cc#L1309
            Because worker_error is not zero

            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - Hi GeoffMontee I tried to simulate the same error log as mentioned , But for me slave was stopped And one more thing if this condition is true https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/rpl_parallel.cc#L1221 We will never retry event https://github.com/MariaDB/server/blob/f9f968557f9c632a083acabb3578631ce0d1c4f6/sql/rpl_parallel.cc#L1309 Because worker_error is not zero

            Hi sachin.setiya.007,

            I tried to simulate the same error log as mentioned , But for me slave was stopped

            Why would the slave stop? Aren't the errors in the log snippet "temporary errors", so they should be automatically retried?

            Or is the "duplicate key error" in the log snippet a "real error" and not a "temporary error"? By the way, that "duplicate key error" was probably caused by MDEV-20605, as these two issues are related.

            GeoffMontee Geoff Montee (Inactive) added a comment - Hi sachin.setiya.007 , I tried to simulate the same error log as mentioned , But for me slave was stopped Why would the slave stop? Aren't the errors in the log snippet "temporary errors", so they should be automatically retried? Or is the "duplicate key error" in the log snippet a "real error" and not a "temporary error"? By the way, that "duplicate key error" was probably caused by MDEV-20605 , as these two issues are related.

            Hi GeoffMontee!

            I am will try to simulate the MDEV-20605, Then I will have a better understanding of this error.
            But if I read it correctly we are getting Duplicate key error in MDEV-20605 , And replication is also stopped

                         Slave_IO_Running: Yes
                        Slave_SQL_Running: No
            

            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - Hi GeoffMontee ! I am will try to simulate the MDEV-20605 , Then I will have a better understanding of this error. But if I read it correctly we are getting Duplicate key error in MDEV-20605 , And replication is also stopped Slave_IO_Running: Yes Slave_SQL_Running: No

            Hi sachin.setiya.007,

            Yes, that is correct. In the case of MDEV-20605, the SQL thread is stopped when the duplicate key error occurs.

            For this specific issue, I don't remember how I verified that the slave threads did not stop. I probably encountered this problem while reproducing MDEV-20605, but I don't remember. Unfortunately, I already deleted the VM with the MDEV-20605 test data set a long time ago to cut costs, so I can't easily confirm it.

            Maybe in some edge cases, the problems that lead to MDEV-20605 can also lead to this issue. Or maybe this issue does really happen as it is explained in the description, so it is not a real bug. I wish I could remember more details for you. Sorry!

            GeoffMontee Geoff Montee (Inactive) added a comment - Hi sachin.setiya.007 , Yes, that is correct. In the case of MDEV-20605 , the SQL thread is stopped when the duplicate key error occurs. For this specific issue, I don't remember how I verified that the slave threads did not stop. I probably encountered this problem while reproducing MDEV-20605 , but I don't remember. Unfortunately, I already deleted the VM with the MDEV-20605 test data set a long time ago to cut costs, so I can't easily confirm it. Maybe in some edge cases, the problems that lead to MDEV-20605 can also lead to this issue. Or maybe this issue does really happen as it is explained in the description, so it is not a real bug. I wish I could remember more details for you. Sorry!

            Duplicate of MDEV-20605

            julien.fritsch Julien Fritsch added a comment - Duplicate of MDEV-20605
            Elkin Andrei Elkin added a comment -

            rob.schwyzer@mariadb.com, the reason I blamed MDEV-20605 is the dup key error which can not be retried.
            I could not reach to details through your link, may be you can do that to make sure there are no DUP key error and the whole problem is that the slave sets the error status wrongly, while continuing to progress with automatic transaction retrying.

            Elkin Andrei Elkin added a comment - rob.schwyzer@mariadb.com , the reason I blamed MDEV-20605 is the dup key error which can not be retried. I could not reach to details through your link, may be you can do that to make sure there are no DUP key error and the whole problem is that the slave sets the error status wrongly, while continuing to progress with automatic transaction retrying.

            People

              sachin.setiya.007 Sachin Setiya (Inactive)
              GeoffMontee Geoff Montee (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.