[MXS-4149] Cooperative Transaction Replay Created: 2022-05-31 Updated: 2023-12-15 |
|
| Status: | Open |
| Project: | MariaDB MaxScale |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Icebox |
| Type: | New Feature | Priority: | Major |
| Reporter: | Juan | Assignee: | Joe Cotellese |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Description |
|
Many customers run MaxScale in container environments where unique memory leaks and other stability issues can affect individual instances and require them to be restarted. Although we currently have good HA architecture patterns, as well as client-side java connector failover capability, and cooperative monitoring makes multiple MaxScale instances able to work together effectively in managing back-end topologies, destroying an instance because it's leaking, for example, is still an incident with adverse consequences because whatever transactions are in-flight in current connections handled by the given MaxScale instance are lost. Transaction_replay manages this problem beautifully in the event of a lost back-end server. Having a similar mechanism to cache the client connection, recognize it from a second MaxScale in an HA configuration, and deliver back to that client whatever results might not have successfully made it from a destroyed MaxScale. Although replicating the transaction replay queue on every MaxScale is one way this could be accomplished, now that some MaxScale state information is already stored server-side, specifically who the controlling MaxScale in a cooperative group is controlling, it, replay & session information could also be stored on the database servers themselves. |
| Comments |
| Comment by markus makela [ 2022-06-01 ] |
|
Unless the connectors support some form of transaction replay, MaxScale alone cannot do this: if the MaxScale where the transaction is active goes down and the client connects to the second MaxScale, it needs to somehow indicate which session it was on the old MaxScale instance for the new MaxScale instace to know which transaction to replay. Without connector support, this cannot be done. An alternative way to deal with these sort of situations would be to have graceful shutdowns of MaxScale nodes. This would allow open connections to be migrated to a replacement node once they're done with their active transactions. This wouldn't save transactions that are lost due to unexpected outages but the use-case for "needing to restart" would be served quite well with this. A mechanism similar to what is described in |
| Comment by markus makela [ 2022-06-01 ] |
|
A small step towards implementing something for this would be for MaxScale to generate fake "session state change" by injecting a custom variable (e.g. the redirect_url from Something as simple as drain=<URL> as a listener parameter would allow this to be done on the listener level and could be wrapped into a maxctrl drain maxscale <URL> command to do it for all listeners. The only thing that would change in MaxScale would be that it would start signaling new (and possibly existing) connections with the extra information stating that they should avoid this MaxScale and redirect to the given URL. This would start out as a manual process but combining with something like the configuration synchronization (i.e. config_sync_cluster), it could be automated to have all but the current MaxScale be included in the URL. |
| Comment by Johan Wikman [ 2022-09-12 ] |
|
A pre-requisite for this is that connectors support MDEV-15935. Currently there apparently is no activity on that front. However, cooperative transaction replay would be very complex to arrange, as it would mean that the transaction recording made by one MaxScale would have to be stored in a manner that allows another MaxScale to later replay it. Furthermore, when a connection is, invisibly from the application, moved (when Recently most MaxScale leaks have been caused by MaxScale being mis-configured. In such cases, implementing very complex functionality for being able to move a transaction from one MaxScale to another would not bring any actual benefit. MXS-4161 will attempt to address that, i.e. detect whether the resource consumption implied by the configuration is in conflict with the available resources. Moving the fix-version to 23.08 for now. |
| Comment by markus makela [ 2022-09-16 ] |
|
The MariaDB JDBC driver already has a built-in transaction replay feature. This is one way of solving the problem and could end up being a simpler solution than anything that is built into MaxScale. |