[MXS-2678] RSU on Galera via MaxScale Created: 2019-09-16  Updated: 2022-09-08  Resolved: 2022-09-08

Status: Closed
Project: MariaDB MaxScale
Component/s: N/A
Affects Version/s: None
Fix Version/s: N/A

Type: New Feature Priority: Major
Reporter: Assen Totin (Inactive) Assignee: Todd Stoffel (Inactive)
Resolution: Won't Do Votes: 0
Labels: Galera

Epic Link: Galera Compatibility

 Description   

Long-running DDLs are known to cause problems on Galera clusters, hence we promote the concept or RSU in MariaDB server as a replacement However, the traditional RSU requires external management of the process and while it is doable via MaxScale API, having a native ability in MaxScale for this would be much appreciated.

It may go along these lines (or anything similar as long as it is easy to use and can be triggered by the client within its session without the need to have access to the MaxScale API):

  • An RSU mode is enabled on MaxScale session via some SQL statement like SET... . The mode is per-session.
  • A DDL is sent to MaxScale in the same session.
  • MaxScale puts one Galera node in maintenance mode, sets the RSU variable on the node, runs the DDL, then unsets the RSU variable and enables the node.
  • MaxScale repeates the previous on all remaining Galera nodes one by one.
  • The client is kept "on hold" during the whole execution flow, so the connection stays active and blocked.
  • Once the last node completes the DDL and gets back online, MaxScale releases the client.
  • The client unsets the RSU mode variable on MaxScale.

The actual way the DDL is executed on the Galera node is left to the node itself, so we don't want to implement more complex stuff like the shadow DDL of pt-osc.

Also, blocking the client is OK as having a non-blocking DDL on the client will be probably be more complex and confusing; keeping the connection intact is a matter of the client - likely we only need to specify what happens if the connection is dropped before the process is completed, but, I guess, would be OK to continue the RSU until completed (as our DDL is non-transactional).



 Comments   
Comment by markus makela [ 2019-09-26 ]

Would executing SQL statements serially on all nodes solves this? What I'm thinking is that with a few modifications existing code can be changed to do this.

For example running ALTER TABLE ... with three configured servers (A, B and C), the statement would first be sent to A. After A is done executing it, the statement would be sent to B and after B is done, the statement is sent to C. If this would be adequate, the Cat router could be modified to do this quite easily.

Comment by Assen Totin (Inactive) [ 2019-09-27 ]

I think yes, my idea was to serialise DDL execution: at the expense of keeping the client blocked, but with the benefit - and this is the key here - of not blocking the cluster entirely during the DDL run, so that all other queries may be run on all remaining nodes.

This is why I thought of first removing a node (drain, maintenance mode, whatever), then applying the DDL, then rejoining (hopefully fitting into IST time window). I'm afraid that if we don't disconnect the node while the DDL is running, the queries it receive may get slow or be blocked completely.

Generated at Thu Feb 08 04:15:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.