[MDEV-20628] If temporary error occurs with optimistic mode of parallel replication, error message has false information - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Duplicate
Affects Version/s: 10.2.27
Fix Version/s: N/A
Component/s: Replication
Labels:
None

Description

With optimistic mode of parallel replication, if a query fails, then you can end up with log messages like this:

2019-09-17 19:27:07 140190015801088 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964

2019-09-17 19:27:07 140190015801088 [Warning] Slave: Duplicate entry '...' for key 'name' Error_code: 1062

2019-09-17 19:27:07 140190015801088 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213

2019-09-17 19:27:07 140190015801088 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964

2019-09-17 19:27:07 140190015801088 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 50284765; GTID position '0-1-119489'

2019-09-17 19:27:07 140190315325184 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964

2019-09-17 19:27:07 140190315325184 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213

2019-09-17 19:27:07 140190315325184 [Warning] Slave: Connection was killed Error_code: 1927

2019-09-17 19:27:07 140190315325184 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213

2019-09-17 19:27:07 140190315325184 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964

2019-09-17 19:27:07 140190315325184 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 50284765; GTID position '0-1-119489'

These messages are printed in slave_output_error_info. See here:

https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/slave.cc#L4756

With optimistic mode of parallel replication, the slave_output_error_info function can be called when a transaction has failed because another transaction that it depends on has failed. See here:

https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/rpl_parallel.cc#L1221

Some errors are considered "temporary errors". The only temporary errors are currently just deadlocks and timeouts. See here:

https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/slave.cc#L3381

In the case of a "temporary error", the event group can actually be retried. See here:

https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/rpl_parallel.cc#L1305

These retries happen if slave_transaction_retries is greater than 0.

https://mariadb.com/kb/en/library/replication-and-binary-log-system-variables/#slave_transaction_retries

In the example shown above, the errors were deadlocks, so the transactions were indeed retried. Notice that the following message was still included in the above log snippet:

2019-09-17 19:27:07 140190315325184 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 50284765; GTID position '0-1-119489'

This message is not quite correct in this case.

The SQL thread was not aborted.
The SQL thread does not need to be manually restarted.

It seems like this error message in slave_output_error_info should be slightly different if the SQL thread encounters a temporary error. In that case, maybe it should be a Warning instead of an Error, and maybe it should say something like this instead:

2019-09-17 19:27:07 140190315325184 [Warning] Temporary error running query. The slave SQL thread will retry the query. We encountered the temporary error at log 'mariadb-bin.000001' position 50284765; GTID position '0-1-119489'

Attachments

Issue Links

is blocked by

MDEV-33303 slave_parallel_mode=optimistic should not report the mode's specific temporary errors

Closed

relates to

MDEV-13577 slave_parallel_mode=optimistic should not report the mode's specific temporary errors

Closed

MDEV-30010 Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964

Closed

MDEV-20604 Duplicate key value is silently truncated to 64 characters in print_keydup_error

Closed

MDEV-20605 Awaken transaction can miss inserted by other transaction records due to wrong persistent cursor restoration

Closed

Activity

Ascending order - Click to sort in descending order

Andrei Elkin added a comment - 2019-09-19 12:46

GeoffMontee: Thanks for looking into this! When a temporary error is encountered the message should be of the [Warning] level, incl the last unsuccessful retry I think 'cos we have a run-out-of-retries conclusive error.

Andrei Elkin added a comment - 2019-09-19 12:46 GeoffMontee : Thanks for looking into this! When a temporary error is encountered the message should be of the [Warning] level, incl the last unsuccessful retry I think 'cos we have a run-out-of-retries conclusive error.

Sachin Setiya (Inactive) added a comment - 2020-08-18 11:58

Hi GeoffMontee

I tried to simulate the same error log as mentioned , But for me slave was stopped

And one more thing if this condition is true
https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/rpl_parallel.cc#L1221
We will never retry event
https://github.com/MariaDB/server/blob/f9f968557f9c632a083acabb3578631ce0d1c4f6/sql/rpl_parallel.cc#L1309
Because worker_error is not zero

Sachin Setiya (Inactive) added a comment - 2020-08-18 11:58 Hi GeoffMontee I tried to simulate the same error log as mentioned , But for me slave was stopped And one more thing if this condition is true https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/rpl_parallel.cc#L1221 We will never retry event https://github.com/MariaDB/server/blob/f9f968557f9c632a083acabb3578631ce0d1c4f6/sql/rpl_parallel.cc#L1309 Because worker_error is not zero

Geoff Montee (Inactive) added a comment - 2020-08-18 17:47

Hi sachin.setiya.007,

I tried to simulate the same error log as mentioned , But for me slave was stopped

Why would the slave stop? Aren't the errors in the log snippet "temporary errors", so they should be automatically retried?

Or is the "duplicate key error" in the log snippet a "real error" and not a "temporary error"? By the way, that "duplicate key error" was probably caused by ~~MDEV-20605~~, as these two issues are related.

Geoff Montee (Inactive) added a comment - 2020-08-18 17:47 Hi sachin.setiya.007 , I tried to simulate the same error log as mentioned , But for me slave was stopped Why would the slave stop? Aren't the errors in the log snippet "temporary errors", so they should be automatically retried? Or is the "duplicate key error" in the log snippet a "real error" and not a "temporary error"? By the way, that "duplicate key error" was probably caused by MDEV-20605 , as these two issues are related.

Sachin Setiya (Inactive) added a comment - 2020-08-20 11:15

Hi GeoffMontee!

I am will try to simulate the ~~MDEV-20605~~, Then I will have a better understanding of this error.
But if I read it correctly we are getting Duplicate key error in ~~MDEV-20605~~ , And replication is also stopped

             Slave_IO_Running: Yes

            Slave_SQL_Running: No

Sachin Setiya (Inactive) added a comment - 2020-08-20 11:15 Hi GeoffMontee ! I am will try to simulate the MDEV-20605 , Then I will have a better understanding of this error. But if I read it correctly we are getting Duplicate key error in MDEV-20605 , And replication is also stopped Slave_IO_Running: Yes Slave_SQL_Running: No

Geoff Montee (Inactive) added a comment - 2020-08-20 17:21

Hi sachin.setiya.007,

Yes, that is correct. In the case of ~~MDEV-20605~~, the SQL thread is stopped when the duplicate key error occurs.

For this specific issue, I don't remember how I verified that the slave threads did not stop. I probably encountered this problem while reproducing ~~MDEV-20605~~, but I don't remember. Unfortunately, I already deleted the VM with the ~~MDEV-20605~~ test data set a long time ago to cut costs, so I can't easily confirm it.

Maybe in some edge cases, the problems that lead to ~~MDEV-20605~~ can also lead to this issue. Or maybe this issue does really happen as it is explained in the description, so it is not a real bug. I wish I could remember more details for you. Sorry!

Geoff Montee (Inactive) added a comment - 2020-08-20 17:21 Hi sachin.setiya.007 , Yes, that is correct. In the case of MDEV-20605 , the SQL thread is stopped when the duplicate key error occurs. For this specific issue, I don't remember how I verified that the slave threads did not stop. I probably encountered this problem while reproducing MDEV-20605 , but I don't remember. Unfortunately, I already deleted the VM with the MDEV-20605 test data set a long time ago to cut costs, so I can't easily confirm it. Maybe in some edge cases, the problems that lead to MDEV-20605 can also lead to this issue. Or maybe this issue does really happen as it is explained in the description, so it is not a real bug. I wish I could remember more details for you. Sorry!

Julien Fritsch added a comment - 2020-10-09 10:05

Duplicate of ~~MDEV-20605~~

Julien Fritsch added a comment - 2020-10-09 10:05 Duplicate of MDEV-20605

Andrei Elkin added a comment - 2023-01-27 12:28

rob.schwyzer@mariadb.com, the reason I blamed ~~MDEV-20605~~ is the dup key error which can not be retried.
I could not reach to details through your link, may be you can do that to make sure there are no DUP key error and the whole problem is that the slave sets the error status wrongly, while continuing to progress with automatic transaction retrying.

Andrei Elkin added a comment - 2023-01-27 12:28 rob.schwyzer@mariadb.com , the reason I blamed MDEV-20605 is the dup key error which can not be retried. I could not reach to details through your link, may be you can do that to make sure there are no DUP key error and the whole problem is that the slave sets the error status wrongly, while continuing to progress with automatic transaction retrying.

People

Assignee:: Sachin Setiya (Inactive)

Reporter:: Geoff Montee (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 2019-09-19 00:12

Updated:: 2024-09-06 10:00

Resolved:: 2020-10-09 10:05

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server