[MXS-2761] no clear error message when user loading exceeds auth_read_timeout Created: 2019-11-08 Updated: 2020-12-08 Resolved: 2020-12-08 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | Monitor |
| Affects Version/s: | 2.3.13, 2.4.2 |
| Fix Version/s: | 2.5.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Hartmut Holzgraefe | Assignee: | Unassigned |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Description |
|
If the WITH RECURSIVE query reading user privileges and passwords takes longer than a second, the connection running the query is terminated and none of the mysqld users get known to maxscale. On the maxscale size things fail with
On the MariaDB server side there are just two reports about aborted connections
|
| Comments |
| Comment by Hartmut Holzgraefe [ 2019-11-08 ] | |
|
Ok, the one second timeout is due to exceeding auth_read_timeout, which defaults to one second. Raising the auth_read_timeout solves the problem, so the real problem is that maxscale does not report that it ran into a local timeout, but instead just reports
making it look as if the actual problem was completely on the database server side. This makes it really hard to understand what is actually really going on ... | |
| Comment by markus makela [ 2019-11-09 ] | |
|
If I remember correctly, this is because Connector-C (which we use to query users) doesn't report when a query fails due to a timeout. | |
| Comment by Hartmut Holzgraefe [ 2019-11-09 ] | |
|
I can understand the need for a timeout. My main problems were:
but most important:
which lead to a lot of wasted time trying to figure out what was actually going on (including analyzing connection TCP dumps) A clear "Failed to load users from server 'server1': auth_read_timeout expired" would have allowed to solve this part of the issue within minutes instead of days ... | |
| Comment by markus makela [ 2019-11-09 ] | |
|
The default could be changed to a less aggressive one as one second is a bit too low in most cases. The accompanying documentation could also use the default value as the example. The misleading error is simply what is returned to MaxScale from the C connector. A broken network connection and hitting a timeout appear identical from MaxScale's point of view. The only reliable way to detect whether the read timed out or was broken due to a network problem would be to have the connector expose this information, possibly by return a different error for timeouts. I don't know if this will happen any time soon. Another way to handle this would be to treat all network errors that happen close enough to the configured value as timeouts. For example, if the connector returns a network error after 0.95 seconds and the timeout was configured to 1 second, we could change the error message to suggest that the disconnection might've happened due to a timeout being exceeded. We wouldn't be certain but it would at least be a reasonably accurate guess that would help get the users on the right track. | |
| Comment by markus makela [ 2020-12-08 ] | |
|
Nevermind, that's a completely separate thing. Regardless, this is much less likely to happen in 2.5 with the rewrite of the authentication queries. I'll still close this issue as I expect this problem to not happen again. |