[MXS-4153] Graceful Restart Created: 2022-06-02  Updated: 2023-12-15

Status: Open
Project: MariaDB MaxScale
Component/s: None
Affects Version/s: None
Fix Version/s: Icebox

Type: New Feature Priority: Major
Reporter: Rob Schwyzer Assignee: Joe Cotellese
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Blocks
blocks MXS-4149 Cooperative Transaction Replay Open
is blocked by CONC-599 Add support for connection redirection Open
is blocked by CONCPP-101 Add support for connection redirection Open
is blocked by CONJ-981 Add support for connection redirection Open
is blocked by CONJS-207 Add support for connection redirection Closed
is blocked by CONPY-207 Add support for connection redirection Open
is blocked by MDEV-15935 Connection Redirection Mechanism in M... Closed
is blocked by ODBC-364 Add support for connection redirection Open
is blocked by R2DBC-66 Add support for connection redirection Closed
Relates
relates to MDEV-32053 New features requested by customer on... Open
relates to MXS-4635 Provide load balancing metadata to co... Closed

 Description   

Markus put this well in MXS-4149-

An alternative way to deal with these sort of situations would be to have graceful shutdowns of MaxScale nodes. This would allow open connections to be migrated to a replacement node once they're done with their active transactions. This wouldn't save transactions that are lost due to unexpected outages but the use-case for "needing to restart" would be served quite well with this.

In short, many customers have reported getting into states where it becomes necessary to restart MaxScale to avoid a crash or other issue. An example is due to rising/runaway memory usage. In many cases, the causes leading up to this are detectable via monitoring- ex, by tracking a server's remaining free memory or storage space. This means customers can proactively trigger the restart rather than waiting for a crash or true emergency.

These customers are already leveraging techniques like cooperative monitoring to obtain HA from multiple MaxScale nodes. So why is a regular restart not good enough? Because a regular restart terminates and bounces back connections currently open on the MaxScale node being restarted. This makes MaxScale's HA setup appear and behave unreliably in these cases to applications/clients/etc.

It should instead be possible for MaxScale to be aware it has "sister" nodes which it could migrate connections or transactions to in these cases. A "graceful restart" mechanism which has MaxScale drain its active and future connections to a "sister" node before restarting would resolve this concern and provide customers with a valuable tool needed for them and their operations teams to help themselves.

Beyond-initial scope, but once MXS-3822 is implemented, there will be a lot of runway in future MaxScale versions to enhance this feature by enabling automatic graceful restarts and such.

MXS-4149 is related to this issue as MXS-4149 is the preferable, desired future-state. However, the graceful restart functionality requested in this feature is expected to be easier and quicker to implement and should provide a manual solution customers can benefit from ASAP and build around as necessary. MXS-4149 and other, further improvements would be ways for MaxScale to add value.



 Comments   
Comment by markus makela [ 2022-06-10 ]

In order for this to work, MDEV-15935 needs to be implemented by the connectors.

Comment by Johan Wikman [ 2022-09-12 ]

As commented above, this can't be implemented unless the MDEV-15935 is implemented by the connectors. As there currently apparently is no activity on that front, the fix-version is tentatively moved to 23.08.

As the need for this is for dealing with "rising/runaway memory usage" that often is caused by MaxScale having been incorrectly configured - e.g. if threads=auto is used when MaxScale is running in a container that does not have as many CPUs and as much memory as the host computer the container is running on - MXS-4161 may help in avoiding the problem in the first place.

Comment by markus makela [ 2023-06-28 ]

The connection_metadata parameter that was added for MXS-4635 can be used to implement this in a manual manner. This still requires that either the connectors implement it or that the client application itself reads the system variable changes and reacts to them.

Generated at Thu Feb 08 04:26:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.