[MDEV-20628] If temporary error occurs with optimistic mode of parallel replication, error message has false information Created: 2019-09-19  Updated: 2024-01-24  Resolved: 2020-10-09

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.2.27
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Geoff Montee (Inactive) Assignee: Sachin Setiya (Inactive)
Resolution: Duplicate Votes: 1
Labels: None

Issue Links:
Blocks
is blocked by MDEV-33303 slave_parallel_mode=optimistic should... Confirmed
Relates
relates to MDEV-13577 slave_parallel_mode=optimistic should... Closed
relates to MDEV-30010 Slave (additional info): Commit faile... Closed
relates to MDEV-20604 Duplicate key value is silently trunc... Closed
relates to MDEV-20605 Awaken transaction can miss inserted ... Closed

 Description   

With optimistic mode of parallel replication, if a query fails, then you can end up with log messages like this:

2019-09-17 19:27:07 140190015801088 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2019-09-17 19:27:07 140190015801088 [Warning] Slave: Duplicate entry '...' for key 'name' Error_code: 1062
2019-09-17 19:27:07 140190015801088 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
2019-09-17 19:27:07 140190015801088 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2019-09-17 19:27:07 140190015801088 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 50284765; GTID position '0-1-119489'
2019-09-17 19:27:07 140190315325184 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2019-09-17 19:27:07 140190315325184 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
2019-09-17 19:27:07 140190315325184 [Warning] Slave: Connection was killed Error_code: 1927
2019-09-17 19:27:07 140190315325184 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
2019-09-17 19:27:07 140190315325184 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2019-09-17 19:27:07 140190315325184 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 50284765; GTID position '0-1-119489'

These messages are printed in slave_output_error_info. See here:

https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/slave.cc#L4756

With optimistic mode of parallel replication, the slave_output_error_info function can be called when a transaction has failed because another transaction that it depends on has failed. See here:

https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/rpl_parallel.cc#L1221

Some errors are considered "temporary errors". The only temporary errors are currently just deadlocks and timeouts. See here:

https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/slave.cc#L3381

In the case of a "temporary error", the event group can actually be retried. See here:

https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/rpl_parallel.cc#L1305

These retries happen if slave_transaction_retries is greater than 0.

https://mariadb.com/kb/en/library/replication-and-binary-log-system-variables/#slave_transaction_retries

In the example shown above, the errors were deadlocks, so the transactions were indeed retried. Notice that the following message was still included in the above log snippet:

2019-09-17 19:27:07 140190315325184 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 50284765; GTID position '0-1-119489'

This message is not quite correct in this case.

  • The SQL thread was not aborted.
  • The SQL thread does not need to be manually restarted.

It seems like this error message in slave_output_error_info should be slightly different if the SQL thread encounters a temporary error. In that case, maybe it should be a Warning instead of an Error, and maybe it should say something like this instead:

2019-09-17 19:27:07 140190315325184 [Warning] Temporary error running query. The slave SQL thread will retry the query. We encountered the temporary error at log 'mariadb-bin.000001' position 50284765; GTID position '0-1-119489'



 Comments   
Comment by Andrei Elkin [ 2019-09-19 ]

GeoffMontee: Thanks for looking into this! When a temporary error is encountered the message should be of the [Warning] level, incl the last unsuccessful retry I think 'cos we have a run-out-of-retries conclusive error.

Comment by Sachin Setiya (Inactive) [ 2020-08-18 ]

Hi GeoffMontee

I tried to simulate the same error log as mentioned , But for me slave was stopped

And one more thing if this condition is true
https://github.com/MariaDB/server/blob/mariadb-10.2.27/sql/rpl_parallel.cc#L1221
We will never retry event
https://github.com/MariaDB/server/blob/f9f968557f9c632a083acabb3578631ce0d1c4f6/sql/rpl_parallel.cc#L1309
Because worker_error is not zero

Comment by Geoff Montee (Inactive) [ 2020-08-18 ]

Hi sachin.setiya.007,

I tried to simulate the same error log as mentioned , But for me slave was stopped

Why would the slave stop? Aren't the errors in the log snippet "temporary errors", so they should be automatically retried?

Or is the "duplicate key error" in the log snippet a "real error" and not a "temporary error"? By the way, that "duplicate key error" was probably caused by MDEV-20605, as these two issues are related.

Comment by Sachin Setiya (Inactive) [ 2020-08-20 ]

Hi GeoffMontee!

I am will try to simulate the MDEV-20605, Then I will have a better understanding of this error.
But if I read it correctly we are getting Duplicate key error in MDEV-20605 , And replication is also stopped

             Slave_IO_Running: Yes
            Slave_SQL_Running: No

Comment by Geoff Montee (Inactive) [ 2020-08-20 ]

Hi sachin.setiya.007,

Yes, that is correct. In the case of MDEV-20605, the SQL thread is stopped when the duplicate key error occurs.

For this specific issue, I don't remember how I verified that the slave threads did not stop. I probably encountered this problem while reproducing MDEV-20605, but I don't remember. Unfortunately, I already deleted the VM with the MDEV-20605 test data set a long time ago to cut costs, so I can't easily confirm it.

Maybe in some edge cases, the problems that lead to MDEV-20605 can also lead to this issue. Or maybe this issue does really happen as it is explained in the description, so it is not a real bug. I wish I could remember more details for you. Sorry!

Comment by Julien Fritsch [ 2020-10-09 ]

Duplicate of MDEV-20605

Comment by Andrei Elkin [ 2023-01-27 ]

rob.schwyzer@mariadb.com, the reason I blamed MDEV-20605 is the dup key error which can not be retried.
I could not reach to details through your link, may be you can do that to make sure there are no DUP key error and the whole problem is that the slave sets the error status wrongly, while continuing to progress with automatic transaction retrying.

Generated at Thu Feb 08 09:00:57 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.