[CONJ-762] Connection hanging when using a CNAME aurora endpoint with authentication Created: 2020-02-20  Updated: 2020-03-05  Resolved: 2020-03-04

Status: Closed
Project: MariaDB Connector/J
Component/s: aurora
Affects Version/s: 2.5.3, 2.5.4
Fix Version/s: N/A

Type: New Feature Priority: Major
Reporter: Jacques-Etienne Beaudet Assignee: Diego Dupin
Resolution: Not a Bug Votes: 1
Labels: None

Issue Links:
Relates
relates to CONJ-766 Adding a socket timeout until complet... Closed

 Description   

There has been a regression in 2.5.3 following this commit for our use case.

Test setup :

  • One aurora instance (version doesn't matter). For the example let's say the cluster endpoint is "aurora-cluster.aws.com"
  • One CNAME record (can be in /etc/hosts or a real record) "cname.test.com" -> "aurora-cluster.aws.com". This could also be a A record to the cluster IP.

Code setup (this will work) :

java.util.Properties info = new java.util.Properties();
info.put("user", "anythingValid");
info.put("password", "anythingValid");
return new Driver().connect("aurora-cluster.aws.com", info);

Code setup (this will not work) :

java.util.Properties info = new java.util.Properties();
info.put("user", "anythingValid");
info.put("password", "anythingValid");
return new Driver().connect("cname.test.com", info);

I've dug a bit into it and the difference lies in UrlParser::auroraPipelineQuirks where the pipeline auth is disabled for the cluster endpoint since it's matching the aurora regex while the CNAME record is obviously not doing that. Seems like sending that extra byte made some difference at some level though for this part I'm not that familiar with the inner working of the driver.

There's definitely some timing issues here because when I was debugging, I was able to successfully connect sometimes. When running the code with no breakpoints, I always reproduced the issue.

here is the stack trace when the thread is stuck :

owns: ReadAheadBufferedStream  (id=23)	
SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int) line: not available [native method]	
SocketInputStream.socketRead(FileDescriptor, byte[], int, int, int) line: 116	
SocketInputStream.read(byte[], int, int, int) line: 171	
SocketInputStream.read(byte[], int, int) line: 141	
ReadAheadBufferedStream(FilterInputStream).read(byte[], int, int) line: 133	
ReadAheadBufferedStream.fillBuffer(int) line: 129	
ReadAheadBufferedStream.read(byte[], int, int) line: 102	
StandardPacketInputStream.getPacketArray(boolean) line: 241	
StandardPacketInputStream.getPacket(boolean) line: 212	
MasterProtocol(AbstractQueryProtocol).readPacket(Results) line: 1443	
MasterProtocol(AbstractQueryProtocol).getResult(Results) line: 1424	
MasterProtocol(AbstractConnectProtocol).readRequestSessionVariables(Map<String,String>) line: 884	
MasterProtocol(AbstractConnectProtocol).readPipelineAdditionalData(Map<String,String>) line: 929	
MasterProtocol(AbstractConnectProtocol).postConnectionQueries() line: 793	
MasterProtocol(AbstractConnectProtocol).createConnection(HostAddress, String) line: 549	
MasterProtocol(AbstractConnectProtocol).connectWithoutProxy() line: 1236	
Utils.retrieveProxy(UrlParser, GlobalStateInfo) line: 610	
MariaDbConnection.newConnection(UrlParser, GlobalStateInfo) line: 142	
Driver.connect(String, Properties) line: 86	

Happy to provide more info if necessary, thx!



 Comments   
Comment by Stephen Mills [ 2020-02-28 ]

While this isn't an issue with smaller customers, large customers frequently use CNAME records to connect to databases. Because of how the driver handles Aurora, it effectively gives no benefit to use the MariaDB driver. Using CNAME records has an advantage though as moving a database to a new cluster/server can have a large code roll impact. Using CNAME records let's you easily move a database to a new cluster without developers having to modify their code and rolling. Of course there are other solutions to the same problem, but DNS is the lowest common denominator, available to all systems. It's why AWS is using CNAME records for Aurora.

Please consider addressing this by either changing the lookup process or maybe work with AWS to have them include the entire instance endpoint in the table information_schema.replica_host_status. Either way would make it a lot more flexible.

Comment by Diego Dupin [ 2020-03-04 ]

Aurora use a proxy that has a bug when using pipelining (i.e. send command A + send command B + read result A + read result B)
This can be disabled with options useBatchMultiSend and usePipelineAuth

So adding "&useBatchMultiSend =false&usePipelineAuth=false" to connection string will solve this kind of problem.

Feel free to report that to AWS if you have the possibility (there has been different report to AWS guys without correction)

Comment by Diego Dupin [ 2020-03-04 ]

I'm closing this issue has nto a bug, since driver cannot distinguish aurora from a normal server, but feel free to create a new issue link to this one if that doesn't solve your issue

Comment by Jacques-Etienne Beaudet [ 2020-03-04 ]

Couldn't the driver provide a better way to get around this? I was thinking of something like an option "forceAuroraMode" or something? Having to disable 2 seemingly unrelated options to make it work look cryptic in my opinion. Documentation could help too.

As for the original issue, would it be possible to timeout at some point and not hang the thread indefinitely?

Thanks

Comment by Diego Dupin [ 2020-03-05 ]

Problem with this kind of error is that exchanges are in improper state, meaning, proxy discard some bytes to send to driver, and driver cannot know state.

Technically, there can be a bad solution that would consist of adding a specific timeout (like 1 second) to get additional information after authentication, then force reconnection without pipelining options. Problem is that would add this timeout to all connections creation.

A better solution might be to add a specific timeout to retrieve additional information, and throw an exception indicating to disable those options. I'll create a task for that.

Comment by Jacques-Etienne Beaudet [ 2020-03-05 ]

Yea that would be great, having an everlasting stuck connection can be sneaky, in our cases we realized it only in production via another issue that made me thread dump my application to see a bunch of them stuck in this issue. If they had thrown an exception I would have caught in development for sure and it will be even better with a proper exception message like you're suggesting.

We have AWS technical account managers at my job, I've asked about the proxy bug you're mentioning and will get back to you if I get anything interesting.

Thanks a lot

Generated at Thu Feb 08 03:18:07 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.