[CONJ-762] Connection hanging when using a CNAME aurora endpoint with authentication Created: 2020-02-20 Updated: 2020-03-05 Resolved: 2020-03-04 |
|
| Status: | Closed |
| Project: | MariaDB Connector/J |
| Component/s: | aurora |
| Affects Version/s: | 2.5.3, 2.5.4 |
| Fix Version/s: | N/A |
| Type: | New Feature | Priority: | Major |
| Reporter: | Jacques-Etienne Beaudet | Assignee: | Diego Dupin |
| Resolution: | Not a Bug | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Description |
|
There has been a regression in 2.5.3 following this commit for our use case. Test setup :
Code setup (this will work) :
Code setup (this will not work) :
I've dug a bit into it and the difference lies in UrlParser::auroraPipelineQuirks where the pipeline auth is disabled for the cluster endpoint since it's matching the aurora regex while the CNAME record is obviously not doing that. Seems like sending that extra byte made some difference at some level though for this part I'm not that familiar with the inner working of the driver. There's definitely some timing issues here because when I was debugging, I was able to successfully connect sometimes. When running the code with no breakpoints, I always reproduced the issue. here is the stack trace when the thread is stuck :
Happy to provide more info if necessary, thx! |
| Comments |
| Comment by Stephen Mills [ 2020-02-28 ] |
|
While this isn't an issue with smaller customers, large customers frequently use CNAME records to connect to databases. Because of how the driver handles Aurora, it effectively gives no benefit to use the MariaDB driver. Using CNAME records has an advantage though as moving a database to a new cluster/server can have a large code roll impact. Using CNAME records let's you easily move a database to a new cluster without developers having to modify their code and rolling. Of course there are other solutions to the same problem, but DNS is the lowest common denominator, available to all systems. It's why AWS is using CNAME records for Aurora. Please consider addressing this by either changing the lookup process or maybe work with AWS to have them include the entire instance endpoint in the table information_schema.replica_host_status. Either way would make it a lot more flexible. |
| Comment by Diego Dupin [ 2020-03-04 ] |
|
Aurora use a proxy that has a bug when using pipelining (i.e. send command A + send command B + read result A + read result B) So adding "&useBatchMultiSend =false&usePipelineAuth=false" to connection string will solve this kind of problem. Feel free to report that to AWS if you have the possibility (there has been different report to AWS guys without correction) |
| Comment by Diego Dupin [ 2020-03-04 ] |
|
I'm closing this issue has nto a bug, since driver cannot distinguish aurora from a normal server, but feel free to create a new issue link to this one if that doesn't solve your issue |
| Comment by Jacques-Etienne Beaudet [ 2020-03-04 ] |
|
Couldn't the driver provide a better way to get around this? I was thinking of something like an option "forceAuroraMode" or something? Having to disable 2 seemingly unrelated options to make it work look cryptic in my opinion. Documentation could help too. As for the original issue, would it be possible to timeout at some point and not hang the thread indefinitely? Thanks |
| Comment by Diego Dupin [ 2020-03-05 ] |
|
Problem with this kind of error is that exchanges are in improper state, meaning, proxy discard some bytes to send to driver, and driver cannot know state. Technically, there can be a bad solution that would consist of adding a specific timeout (like 1 second) to get additional information after authentication, then force reconnection without pipelining options. Problem is that would add this timeout to all connections creation. A better solution might be to add a specific timeout to retrieve additional information, and throw an exception indicating to disable those options. I'll create a task for that. |
| Comment by Jacques-Etienne Beaudet [ 2020-03-05 ] |
|
Yea that would be great, having an everlasting stuck connection can be sneaky, in our cases we realized it only in production via another issue that made me thread dump my application to see a bunch of them stuck in this issue. If they had thrown an exception I would have caught in development for sure and it will be even better with a proper exception message like you're suggesting. We have AWS technical account managers at my job, I've asked about the proxy bug you're mentioning and will get back to you if I get anything interesting. Thanks a lot |