[MXS-2761] no clear error message when user loading exceeds auth_read_timeout Created: 2019-11-08  Updated: 2020-12-08  Resolved: 2020-12-08

Status: Closed
Project: MariaDB MaxScale
Component/s: Monitor
Affects Version/s: 2.3.13, 2.4.2
Fix Version/s: 2.5.0

Type: Bug Priority: Major
Reporter: Hartmut Holzgraefe Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MXS-2759 Nontrivial number of privilege entrie... Closed

 Description   

If the WITH RECURSIVE query reading user privileges and passwords takes longer than a second, the connection running the query is terminated and none of the mysqld users get known to maxscale.

On the maxscale size things fail with

Nov 08 23:28:39 maxscale maxscale[3249]: Failed to load users from server 'server1': Lost connection to MySQL server during query
Nov 08 23:28:39 maxscale maxscale[3249]: Added user: INSERT OR REPLACE INTO mysqlauth_users VALUES ('maxscale', '%', NULL, 1, '14E65567ABDB5135D0CFD9A70B3032C179A49EE7')
Nov 08 23:28:39 maxscale maxscale[3249]: Added user: INSERT OR REPLACE INTO mysqlauth_users VALUES ('maxscale', 'localhost', NULL, 1, '14E65567ABDB5135D0CFD9A70B3032C179A49EE7')
Nov 08 23:28:39 maxscale maxscale[3249]: [Read-Write-Service] No users were loaded but 'inject_service_user' is enabled. Enabling service credentials for authentication until database users have been successfully loaded.
Nov 08 23:28:39 maxscale maxscale[3249]: [MySQLAuth] Read-Write-Service: login attempt for user 'foo2'@[127.0.0.1]:35828, authentication failed. User not found.

On the MariaDB server side there are just two reports about aborted connections

Nov 08 23:28:38 node-1 mysqld[4888]: 2019-11-08 23:28:38 120 [Warning] Aborted connection 120 to db: 'mysql' user: 'maxscale' host: 'maxscale' (Got an error writing communication packets)
Nov 08 23:28:39 node-1 mysqld[4888]: 2019-11-08 23:28:39 121 [Warning] Aborted connection 121 to db: 'mysql' user: 'maxscale' host: 'maxscale' (Got an error writing communication packets)



 Comments   
Comment by Hartmut Holzgraefe [ 2019-11-08 ]

Ok, the one second timeout is due to exceeding auth_read_timeout, which defaults to one second.

Raising the auth_read_timeout solves the problem, so the real problem is that maxscale does not report that it ran into a local timeout, but instead just reports

Failed to load users from server 'server1': Lost connection to MySQL server during query

making it look as if the actual problem was completely on the database server side.

This makes it really hard to understand what is actually really going on ...

Comment by markus makela [ 2019-11-09 ]

If I remember correctly, this is because Connector-C (which we use to query users) doesn't report when a query fails due to a timeout.

Comment by Hartmut Holzgraefe [ 2019-11-09 ]

I can understand the need for a timeout.

My main problems were:

  • auth_read_timeout apparently only being mentioned in the "Global settings" reference section of the docs
  • the one second default being mentioned in the text, but the example assignment, which stands out much more optically, showing 10 seconds as example value

but most important:

  • the misleading "lost connection to server" error message

which lead to a lot of wasted time trying to figure out what was actually going on (including analyzing connection TCP dumps)

A clear "Failed to load users from server 'server1': auth_read_timeout expired" would have allowed to solve this part of the issue within minutes instead of days ...

Comment by markus makela [ 2019-11-09 ]

The default could be changed to a less aggressive one as one second is a bit too low in most cases. The accompanying documentation could also use the default value as the example.

The misleading error is simply what is returned to MaxScale from the C connector. A broken network connection and hitting a timeout appear identical from MaxScale's point of view. The only reliable way to detect whether the read timed out or was broken due to a network problem would be to have the connector expose this information, possibly by return a different error for timeouts. I don't know if this will happen any time soon.

Another way to handle this would be to treat all network errors that happen close enough to the configured value as timeouts. For example, if the connector returns a network error after 0.95 seconds and the timeout was configured to 1 second, we could change the error message to suggest that the disconnection might've happened due to a timeout being exceeded. We wouldn't be certain but it would at least be a reasonably accurate guess that would help get the users on the right track.

Comment by markus makela [ 2020-12-08 ]

Tested in 2.5 and it works as expected:

Monitor timed out when connecting to server server3[127.0.0.1:3002]

Nevermind, that's a completely separate thing.

Regardless, this is much less likely to happen in 2.5 with the rewrite of the authentication queries. I'll still close this issue as I expect this problem to not happen again.

Generated at Thu Feb 08 04:16:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.