I was testing our new connection queuing functionality-
I used MaxScale v6.4.0 connected to a MariaDB Enterprise Sever 10.6.8-4 backend. I have attached my maxscale.cnf used for this. My 10.6.8-4 ES configuration is vanilla except for the below-
This syncs with my maxscale.cnf config for the backend of-
I leave extra max_connections on the backend due to persistent overhead from MariaDB Monitor and readwritesplit. I also like leaving a slot open for me to poll mariadb-admin stat proc on the backend as needed. In any case, I have validated MaxScale is configured and correctly adheres to only establishing 10 "extra" connections to the backend when given a query workload of >= 10 concurrent queries from a client.
I generate the workload via-
markus makela has successfully reproduced this issue via the above, so I am not including the 260MB log file MaxScale generates with log_info=true for this. Note to reproduce this you will need to create appropriate maxscale and slap users on the MariaDB backend.
What happens when running the above is between 60 and 90 seconds into the workload, mariadb-slap errors on a write query with the below-
Looking into MaxScale's error log, we find-
And then a whole lot of sessions and connections get closed from there.
Before approaching MaxScale Engineering, I reviewed all timeout configuration variables present in MaxScale and attempted setting them to maximum or infinite-equivalent values. This did not change MaxScale's behavior. Note the submitted configuration file lacks these extreme measures and focuses more directly on what the original use-case required.
From discussion with markus makela on Slack, expected likely causes of this behavior are:
- A scheduling issue where a subset of connections are being preferred over others causing some client sessions to spend a long time waiting to execute
- An internal, 60 second time limit
While my testing only cares about how long it takes MaxScale to complete all queries being fed to it, the expectation is real-world use-cases will care about per-connection latency, so correcting the scheduling issue is a top priority.
However, the internal 60 second timeout is also a problem we should address. I performed a similar test on competing database proxy software and verified that software provided full user configuration of all such timeouts, so MaxScale should provide the same functionality for users.