[MXS-4540] transaction replay retries repeatedly after failing checksum Created: 2023-03-03  Updated: 2023-03-10  Resolved: 2023-03-10

Status: Closed
Project: MariaDB MaxScale
Component/s: readwritesplit
Affects Version/s: 6.3.1
Fix Version/s: 6.4.6, 22.08.5, 23.02.1

Type: Bug Priority: Major
Reporter: Henry Hwang (Inactive) Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: triage


 Description   

Transaction replay repeatedly retries a transaction after failing checksum rather than stopping after hitting the transaction_replay_attempts threshold.

2023-01-31 10:00:00   info   : (472441) > Autocommit: [disabled], trx is [open], cmd: (0x03) COM_QUERY, plen: 147, type: QUERY_TYPE_READ, stmt: SELECT * FROM transaction_history WHERE publisher_id = 2 AND message_type = 1 AND message_id = 'TUAIf45ciVIf' AND alternate_message_usage = 20
2023-01-31 10:00:00   info   : (472441) [readwritesplit] Route query to master: @@xpand_monitor:node-22 <
2023-01-31 10:00:00   info   : (472441) [readwritesplit] (XpandService); Adding COM_QUERY to trx: SELECT * FROM transaction_history WHERE publisher_id = 2 AND message_type = 1 AND message_id = 'TUAIf45ciVIf' AND alternate_message_usage = 20
2023-01-31 10:00:00   info   : (472441) [readwritesplit] (XpandService); Reply complete from '@@xpand_monitor:node-22' (Resultset: 1 rows in 701B)
2023-01-31 10:00:00   info   : (472441) [readwritesplit] (XpandService); Starting transaction replay 1. Replay has been ongoing for 69521 seconds.
2023-01-31 10:00:00   info   : (472441) [readwritesplit] (XpandService); Replaying COM_PING:
2023-01-31 10:00:00   info   : (472441) [readwritesplit] (XpandService); Checksum mismatch, starting transaction replay again.
...
2023-01-31 10:00:01   info   : (472441) > Autocommit: [disabled], trx is [open], cmd: (0x0e) COM_PING, plen: 5, type: QUERY_TYPE_SESSION_WRITE, stmt:
2023-01-31 10:00:01   info   : (472441) [readwritesplit] Session write, routing to all servers.
2023-01-31 10:00:01   info   : (472441) [readwritesplit] Route query to master: @@xpand_monitor:node-22
2023-01-31 10:00:01   info   : (472441) [readwritesplit] Route query to slave: @@xpand_monitor:node-23
2023-01-31 10:00:01   info   : (472441) [readwritesplit] Will return response from '@@xpand_monitor:node-22' to the client
2023-01-31 10:00:01   info   : (472441) [readwritesplit] (XpandService); Reply complete from '@@xpand_monitor:node-23', discarding it.
2023-01-31 10:00:01   info   : (472441) [readwritesplit] (XpandService); Adding COM_PING to trx:
2023-01-31 10:00:01   info   : (472441) [readwritesplit] (XpandService); Reply complete from '@@xpand_monitor:node-22' (OK: 0 warnings)
2023-01-31 10:00:01   info   : (472441) [readwritesplit] (XpandService); Replaying COM_QUERY: SELECT * FROM transaction_history WHERE publisher_id = 2 AND message_type = 1 AND message_id = 'TUAIf45ciVIf' AND alternate_message_usage = 20
2023-01-31 10:00:01   info   : (472441) > Autocommit: [disabled], trx is [open], cmd: (0x03) COM_QUERY, plen: 147, type: QUERY_TYPE_READ, stmt: SELECT * FROM transaction_history WHERE publisher_id = 2 AND message_type = 1 AND message_id = 'TUAIf45ciVIf' AND alternate_message_usage = 20
2023-01-31 10:00:01   info   : (472441) [readwritesplit] Route query to master: @@xpand_monitor:node-22 <
2023-01-31 10:00:01   info   : (472441) [readwritesplit] (XpandService); Adding COM_QUERY to trx: SELECT * FROM transaction_history WHERE publisher_id = 2 AND message_type = 1 AND message_id = 'TUAIf45ciVIf' AND alternate_message_usage = 20
2023-01-31 10:00:01   info   : (472441) [readwritesplit] (XpandService); Reply complete from '@@xpand_monitor:node-22' (Resultset: 1 rows in 701B)
2023-01-31 10:00:01   info   : (472441) [readwritesplit] (XpandService); Starting transaction replay 1. Replay has been ongoing for 69522 seconds.
2023-01-31 10:00:01   info   : (472441) [readwritesplit] (XpandService); Replaying COM_PING:
2023-01-31 10:00:01   info   : (472441) [readwritesplit] (XpandService); Checksum mismatch, starting transaction replay again.

It looks like the attempted replay value isn't incremented with each retry?

This is what they had in their maxscale configuration:

transaction_replay=true
transaction_replay_max_size=10Mi
transaction_replay_attempts=3
transaction_replay_retry_on_deadlock=true
transaction_replay_retry_on_mismatch=true

Customer is using MaxScale 6.3.1 with Xpand.



 Comments   
Comment by markus makela [ 2023-03-03 ]

There's a bug where the retry count is reset before the checksums are compared. This caused the replay that was triggered when transaction_replay_retry_on_mismatch was enabled to always start as it effectively ignored the limits.

Generated at Thu Feb 08 04:29:22 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.