Related to https://jira.mariadb.org/browse/CONJ-551
Following a recent manual failover scenario in our production, I was tasked to investigate the long recovery time for our instances on a 2 nodes aurora cluster. Since we have strict consistency requirement, we need to always use the master instance to avoid replica lag inconsistency. Therefore, we use the "failover" (or "loadbalance") mode of the driver.
My investigation pointed out that the maximum lifetime of the connection was the only mechanism in which the connections will get reestablished to the new elected master. This is because since it's a manual failover, the now-reader node is still up but is throwing the "read-only" exception on each modifying query. Nobody handles that and after the connection max lifetime has elapsed, the connection get reestablished to the new master (usually, DNS propagation has happened by then). This leads to potential very long downtime!
We've made tests in our develop environment with the aurora mode of the driver and the results we're really impressive, with a connection pool of 12 connections and 144 insert/sec on 12 different threads, the downtime following a manual failover was next to none, each thread logged one error and that's it so it meant less than a second of downtime. Like I said, impressive and good job on that!
This brings me to the title of this jira, would it be possible to have a "aurora" mode combined with the "MasterProtocol" so we could leverage the awesome failover capabilities while retaining the master only connection and data consistency?
In case someone else finds this, here are some workaround possible :
- Have a thread in the background that fetches the ip address of the cluster endpoint and evict the connections of your pool when it changes (Hikari has a neat softEvictConnections method), we will be using this for now
- Have a wrapper around your datasource to catch the read-only exception, evict the connection and rethrow it
- After a manual failover, reboot the new reader (the ex master) so that the existing connections die
- Have a very small max lifetime on your connections (something like 1-2 minutes)
- Have a validation query that checks the read-only status of the instance