[CONJ-502] Aurora JDBC connections leaking between connection pools - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.5.9
Fix Version/s: 2.1.0, 1.6.3
Component/s: Failover
Labels:
None

Description

So this bug is going to be a little light on concrete examples to duplicate the behavior we are seeing as it has been incredibly hard to duplicate in staging, but please bear with me as I explain. The tldr is we shard across multiple hikari pools backed by aurora, in production we observed sql exceptions after adding a new delete query, and then observed entities being written to the wrong shard, which after a code review likely points to the hikari pool or mariadb driver returning the wrong connection.

Our application is built on top of Akka Persistence, which follows an event sourcing model. Each event is persisted as a new row in the database. Every 50 events (or journals) we create a snapshot of the full state. On recovery we replay the latest snapshot and all journals after that snapshot. At the application level we are sharding this data across multiple databases.

Example schema:

CREATE TABLE `journal` (

  persistence_id VARCHAR(255) NOT NULL,

  sequence_number BIGINT NOT NULL,

  deleted BOOLEAN DEFAULT FALSE,

  created_at BIGINT NOT NULL,

  message MEDIUMBLOB NOT NULL,

  PRIMARY KEY(persistence_id, sequence_number)

) ENGINE=InnoDB DEFAULT CHARSET=latin1;

On 6/26 we released some new code to production, which enabled cleanup of the journal and snapshots.

Example query:

DELETE FROM journal WHERE persistence_id = ? AND sequence_number <= ?"

We had manual processes in place to clean things up, but the tables were quite large so some of the delete queries took a long time and we started observing connection link exceptions.

com.company.foo.persistence.PersistenceException: onPersistFailure

  at com.company.foo.persistence.PersistentActorExceptional$class.onPersistFailure(PersistentActorExceptional.scala:18)

  at com.company.foo.delta.PersistentDeltaActor.onPersistFailure(PersistentDeltaActor.scala:81)

  at akka.persistence.Eventsourced$ProcessingState$$anonfun$2.applyOrElse(Eventsourced.scala:627)

  at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

  at akka.persistence.Eventsourced$$anon$2.stateReceive(Eventsourced.scala:690)

  at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)

  at com.company.foo.delta.PersistentDeltaActor.aroundReceive(PersistentDeltaActor.scala:81)

  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

  at akka.actor.ActorCell.invoke(ActorCell.scala)

  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

  at akka.dispatch.Mailbox.run$$original(Mailbox.scala:224)

  at akka.dispatch.Mailbox.run(Mailbox.scala:29)

  at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

  at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

  at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: java.sql.SQLException: (conn:23619227) Communications link failure with primary host foo-aurora-production-cluster-1-0.abcdefg.us-west-2.rds.amazonaws.com:3306. Could not send query: Broken pipe

on HostAddress{host='foo-aurora-production-cluster-1-0.abcdefg.us-west-2.rds.amazonaws.com', port=3306},master=true.  Driver has reconnect connection

  at org.mariadb.jdbc.internal.util.ExceptionMapper.get(ExceptionMapper.java:150)

  at org.mariadb.jdbc.internal.util.ExceptionMapper.getException(ExceptionMapper.java:101)

  at org.mariadb.jdbc.internal.util.ExceptionMapper.throwAndLogException(ExceptionMapper.java:77)

  at org.mariadb.jdbc.MariaDbStatement.executeQueryEpilog(MariaDbStatement.java:226)

  at org.mariadb.jdbc.MariaDbStatement.executeInternal(MariaDbStatement.java:256)

  at org.mariadb.jdbc.MariaDbStatement.executeUpdate(MariaDbStatement.java:418)

  at org.mariadb.jdbc.MariaDbConnection.setAutoCommit(MariaDbConnection.java:590)

  at com.zaxxer.hikari.pool.ProxyConnection.setAutoCommit(ProxyConnection.java:382)

  at com.zaxxer.hikari.pool.HikariProxyConnection.setAutoCommit(HikariProxyConnection.java)

  at com.company.bar.db.mysql.MySQLConnection.setAutoCommit(MySQLConnection.scala:19)

  at com.company.bar.db.mysql.ShardDataSource.connection(ShardDataSource.scala:24)

  at com.company.bar.akka.persistence.mysql.MySQLJournalPlugin$$anonfun$asyncWriteMessageToShard$2.apply(MySQLJournalPlugin.scala:118)

  at com.company.bar.akka.persistence.mysql.MySQLJournalPlugin$$anonfun$asyncWriteMessageToShard$2.apply(MySQLJournalPlugin.scala:114)

  at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)

  at scala.concurrent.impl.Future$PromiseCompletingRunnable.run$$original(Future.scala:24)

  at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:35)

  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

  at java.lang.Thread.run(Thread.java:745)

Caused by: org.mariadb.jdbc.internal.util.dao.QueryException: Communications link failure with primary host foo-aurora-production-cluster-1-0.abcdefg.us-west-2.rds.amazonaws.com:3306. Could not send query: Broken pipe

on HostAddress{host='foo-aurora-production-cluster-1-0.abcdefg.us-west-2.rds.amazonaws.com', port=3306},master=true.  Driver has reconnect connection

  at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.executeQuery(AbstractQueryProtocol.java:159)

  at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source)

  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

  at java.lang.reflect.Method.invoke(Method.java:498)

  at org.mariadb.jdbc.internal.failover.AbstractMastersListener.invoke(AbstractMastersListener.java:355)

  at org.mariadb.jdbc.internal.failover.FailoverProxy.executeInvocation(FailoverProxy.java:200)

  at org.mariadb.jdbc.internal.failover.FailoverProxy.invoke(FailoverProxy.java:192)

  at com.sun.proxy.$Proxy27.executeQuery(Unknown Source)

  at org.mariadb.jdbc.MariaDbStatement.executeInternal(MariaDbStatement.java:246)

  ... 14 common frames omitted

Caused by: java.net.SocketException: Broken pipe

  at java.net.SocketOutputStream.socketWrite0(Native Method)

  at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)

  at java.net.SocketOutputStream.write(SocketOutputStream.java:153)

  at org.mariadb.jdbc.internal.stream.PacketOutputStream.send(PacketOutputStream.java:935)

  at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.executeQuery(AbstractQueryProtocol.java:151)

  ... 22 common frames omitted

At the same time these exceptions were occurring, we started seeing entities being written to the wrong shards. We did an in depth analysis of our code and have not been able to find any bugs that would cause entities to be written to the wrong shard. All logs indicate that our shards maintained their correct order and that all entities were being written to the correct shard.

At this point it appears to be caused by either our usage of the HikariCP or the MariaDB Driver with aurora failover enabled. Our best guess is when an exception occurred either hikari returned a connection to the wrong pool or the MariaDB driver returned a new connection to the connection pool that was pointing to a different shard than specified by the config.

Any insights or help tracking down this issue would be greatly appreciated. Below are some more technical details.

Java 8u102
Scala 2.11.8
Akka 2.4.17
HikariCP 2.4.7
MariaDB 1.5.9

Hashing code:

  private val pools: Vector[ShardDataSource] = databases.map(c => new ShardDataSource(tableName, c)).toVector

  private val shardCount = pools.size

/**

    * This returns a connection pool associated with the persistenceId

*/

  override def getEntityShard(hashKey: String): ShardDataSource = {

    pools(math.abs(hashKey.hashCode) % shardCount)

Connection settings:

    def toHikari: HikariConfig = {

      val hikari = new HikariConfig()

      val jdbcUrl = s"jdbc:mariadb:aurora://$host/$dbName"

      log.info(marker("HikariConfig"), s"Connecting to $jdbcUrl")

      hikari.setJdbcUrl(jdbcUrl)

      hikari.setUsername(user)

      hikari.setPassword(password)

      MySQLPersistenceConfig.configure(hikari)

      hikari.setIdleTimeout(150000)

      hikari.setMaxLifetime(300000)

      hikari.setMaximumPoolSize(maxConnections.getOrElse(ConnectionSettings.defaultMaxConnections))

      hikari

Attachments

Issue Links

relates to

CONJ-419 Insert Query failed after failover and not able to get master connection

Closed

Aurora JDBC connections leaking between connection pools

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration