[CONJ-502] Aurora JDBC connections leaking between connection pools Created: 2017-07-06 Updated: 2017-11-21 Resolved: 2017-07-28 |
|
| Status: | Closed |
| Project: | MariaDB Connector/J |
| Component/s: | Failover |
| Affects Version/s: | 1.5.9 |
| Fix Version/s: | 2.1.0, 1.6.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Joshua Garnett | Assignee: | Diego Dupin |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Description |
|
So this bug is going to be a little light on concrete examples to duplicate the behavior we are seeing as it has been incredibly hard to duplicate in staging, but please bear with me as I explain. The tldr is we shard across multiple hikari pools backed by aurora, in production we observed sql exceptions after adding a new delete query, and then observed entities being written to the wrong shard, which after a code review likely points to the hikari pool or mariadb driver returning the wrong connection. Our application is built on top of Akka Persistence, which follows an event sourcing model. Each event is persisted as a new row in the database. Every 50 events (or journals) we create a snapshot of the full state. On recovery we replay the latest snapshot and all journals after that snapshot. At the application level we are sharding this data across multiple databases. Example schema:
On 6/26 we released some new code to production, which enabled cleanup of the journal and snapshots. Example query:
We had manual processes in place to clean things up, but the tables were quite large so some of the delete queries took a long time and we started observing connection link exceptions.
At the same time these exceptions were occurring, we started seeing entities being written to the wrong shards. We did an in depth analysis of our code and have not been able to find any bugs that would cause entities to be written to the wrong shard. All logs indicate that our shards maintained their correct order and that all entities were being written to the correct shard. At this point it appears to be caused by either our usage of the HikariCP or the MariaDB Driver with aurora failover enabled. Our best guess is when an exception occurred either hikari returned a connection to the wrong pool or the MariaDB driver returned a new connection to the connection pool that was pointing to a different shard than specified by the config. Any insights or help tracking down this issue would be greatly appreciated. Below are some more technical details. Java 8u102 Hashing code:
Connection settings:
|
| Comments |
| Comment by Joshua Garnett [ 2017-07-06 ] | |||||||
|
Digging into this further, I've noticed that in a few places, the code will extract and use a portion of the aurora URL. For example:
This can be observed in the AuroraListener.findClusterHostAddress method. | |||||||
| Comment by Joshua Garnett [ 2017-07-06 ] | |||||||
|
I've been able to reproduce our issue by rebooting the cluster's write instances while a stress test is running. After changing the jdbc url prefix from jdbc:mariadb:aurora: to jdbc:mariadb: the problem went away. | |||||||
| Comment by Joshua Garnett [ 2017-07-07 ] | |||||||
|
Some further investigation, it looks like in AbstractMastersListener the blacklist is a static variable. If you have two clusters generate a failure at once, they would both be added to that blacklist. In the AuroraListener.reconnectFailedConnection method, it appends the black listed hosts to the list of addresses to loop over (Lines 157-159). Looking at the AuroraListener.findClusterHostAddress and the associated regex on line 73, you could see how that black listed host for the incorrect cluster could potentially get selected. | |||||||
| Comment by Diego Dupin [ 2017-07-28 ] | |||||||
|
The issue occur when having multiple pools in the same JVM in failover configuration. When for a pool a connection fail, driver will try to recover connection. Driver will attemps to connect one of remaining not blacklisted hosts. Issue is when all hosts from a pool where blacklisted, driver then retry with ALL the blacklisted hosts, not only the one that are configured in pool. So if another pool did have host blacklisted too those can be connected if authentication information correspond. |