There are three different ways in which the transaction replay can behave:
- If a transaction replay happens when a server is down and is not coming up, it takes at most delayed_retry_timeout seconds to fail.
- If the transaction replay happens when a connection dies and is allowed to reconnect but fails immediately, it takes at most transaction_replay_attempts seconds to fail (each replay happens at least one second apart). This can happen when some hard limit (e.g. max_connections) is hit which is only handled after the TCP connection has been opened.
- The worst-case time is delayed_retry_timeout multiplied by transaction_replay_attempts and happens when a node is down for less than delayed_retry_timeout, comes back up and allows new connections to be created but immediately closes them due to some reason.
All of this results in very unpredictable behavior and trying to solve one case makes the timeouts too long (or short) for the other case.
This can be solved by introducing a new parameter that sets an absolute time limit on the transaction replay. This new parameter, provisionally named transaction_replay_timeout, would set delayed_retry_timeout to be at least as large as it is and would ignore transaction_replay_attempts being exceeded if the time the transaction has been replayed for is less than transaction_replay_timeout. It would also cause the replay to stop if the time limit is exceeded but the attempt limit is not.