Uploaded image for project: 'MariaDB Connector/J'
  1. MariaDB Connector/J
  2. CONJ-502

Aurora JDBC connections leaking between connection pools

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.5.9
    • Fix Version/s: 2.1.0, 1.6.3
    • Component/s: Failover
    • Labels:
      None

      Description

      So this bug is going to be a little light on concrete examples to duplicate the behavior we are seeing as it has been incredibly hard to duplicate in staging, but please bear with me as I explain. The tldr is we shard across multiple hikari pools backed by aurora, in production we observed sql exceptions after adding a new delete query, and then observed entities being written to the wrong shard, which after a code review likely points to the hikari pool or mariadb driver returning the wrong connection.

      Our application is built on top of Akka Persistence, which follows an event sourcing model. Each event is persisted as a new row in the database. Every 50 events (or journals) we create a snapshot of the full state. On recovery we replay the latest snapshot and all journals after that snapshot. At the application level we are sharding this data across multiple databases.

      Example schema:

      CREATE TABLE `journal` (
        persistence_id VARCHAR(255) NOT NULL,
        sequence_number BIGINT NOT NULL,
        deleted BOOLEAN DEFAULT FALSE,
        created_at BIGINT NOT NULL,
        message MEDIUMBLOB NOT NULL,
        PRIMARY KEY(persistence_id, sequence_number)
      ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
      

      On 6/26 we released some new code to production, which enabled cleanup of the journal and snapshots.

      Example query:

      DELETE FROM journal WHERE persistence_id = ? AND sequence_number <= ?"
      

      We had manual processes in place to clean things up, but the tables were quite large so some of the delete queries took a long time and we started observing connection link exceptions.

      com.company.foo.persistence.PersistenceException: onPersistFailure
        at com.company.foo.persistence.PersistentActorExceptional$class.onPersistFailure(PersistentActorExceptional.scala:18)
        at com.company.foo.delta.PersistentDeltaActor.onPersistFailure(PersistentDeltaActor.scala:81)
        at akka.persistence.Eventsourced$ProcessingState$$anonfun$2.applyOrElse(Eventsourced.scala:627)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at akka.persistence.Eventsourced$$anon$2.stateReceive(Eventsourced.scala:690)
        at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)
        at com.company.foo.delta.PersistentDeltaActor.aroundReceive(PersistentDeltaActor.scala:81)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run$$original(Mailbox.scala:224)
        at akka.dispatch.Mailbox.run(Mailbox.scala:29)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
      Caused by: java.sql.SQLException: (conn:23619227) Communications link failure with primary host foo-aurora-production-cluster-1-0.abcdefg.us-west-2.rds.amazonaws.com:3306. Could not send query: Broken pipe
      on HostAddress{host='foo-aurora-production-cluster-1-0.abcdefg.us-west-2.rds.amazonaws.com', port=3306},master=true.  Driver has reconnect connection
        at org.mariadb.jdbc.internal.util.ExceptionMapper.get(ExceptionMapper.java:150)
        at org.mariadb.jdbc.internal.util.ExceptionMapper.getException(ExceptionMapper.java:101)
        at org.mariadb.jdbc.internal.util.ExceptionMapper.throwAndLogException(ExceptionMapper.java:77)
        at org.mariadb.jdbc.MariaDbStatement.executeQueryEpilog(MariaDbStatement.java:226)
        at org.mariadb.jdbc.MariaDbStatement.executeInternal(MariaDbStatement.java:256)
        at org.mariadb.jdbc.MariaDbStatement.executeUpdate(MariaDbStatement.java:418)
        at org.mariadb.jdbc.MariaDbConnection.setAutoCommit(MariaDbConnection.java:590)
        at com.zaxxer.hikari.pool.ProxyConnection.setAutoCommit(ProxyConnection.java:382)
        at com.zaxxer.hikari.pool.HikariProxyConnection.setAutoCommit(HikariProxyConnection.java)
        at com.company.bar.db.mysql.MySQLConnection.setAutoCommit(MySQLConnection.scala:19)
        at com.company.bar.db.mysql.ShardDataSource.connection(ShardDataSource.scala:24)
        at com.company.bar.akka.persistence.mysql.MySQLJournalPlugin$$anonfun$asyncWriteMessageToShard$2.apply(MySQLJournalPlugin.scala:118)
        at com.company.bar.akka.persistence.mysql.MySQLJournalPlugin$$anonfun$asyncWriteMessageToShard$2.apply(MySQLJournalPlugin.scala:114)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run$$original(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:35)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
      Caused by: org.mariadb.jdbc.internal.util.dao.QueryException: Communications link failure with primary host foo-aurora-production-cluster-1-0.abcdefg.us-west-2.rds.amazonaws.com:3306. Could not send query: Broken pipe
      on HostAddress{host='foo-aurora-production-cluster-1-0.abcdefg.us-west-2.rds.amazonaws.com', port=3306},master=true.  Driver has reconnect connection
        at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.executeQuery(AbstractQueryProtocol.java:159)
        at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.mariadb.jdbc.internal.failover.AbstractMastersListener.invoke(AbstractMastersListener.java:355)
        at org.mariadb.jdbc.internal.failover.FailoverProxy.executeInvocation(FailoverProxy.java:200)
        at org.mariadb.jdbc.internal.failover.FailoverProxy.invoke(FailoverProxy.java:192)
        at com.sun.proxy.$Proxy27.executeQuery(Unknown Source)
        at org.mariadb.jdbc.MariaDbStatement.executeInternal(MariaDbStatement.java:246)
        ... 14 common frames omitted
      Caused by: java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
        at org.mariadb.jdbc.internal.stream.PacketOutputStream.send(PacketOutputStream.java:935)
        at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.executeQuery(AbstractQueryProtocol.java:151)
        ... 22 common frames omitted
      

      At the same time these exceptions were occurring, we started seeing entities being written to the wrong shards. We did an in depth analysis of our code and have not been able to find any bugs that would cause entities to be written to the wrong shard. All logs indicate that our shards maintained their correct order and that all entities were being written to the correct shard.

      At this point it appears to be caused by either our usage of the HikariCP or the MariaDB Driver with aurora failover enabled. Our best guess is when an exception occurred either hikari returned a connection to the wrong pool or the MariaDB driver returned a new connection to the connection pool that was pointing to a different shard than specified by the config.

      Any insights or help tracking down this issue would be greatly appreciated. Below are some more technical details.

      Java 8u102
      Scala 2.11.8
      Akka 2.4.17
      HikariCP 2.4.7
      MariaDB 1.5.9

      Hashing code:

        private val pools: Vector[ShardDataSource] = databases.map(c => new ShardDataSource(tableName, c)).toVector
        private val shardCount = pools.size
       
        /**
          * This returns a connection pool associated with the persistenceId
          */
        override def getEntityShard(hashKey: String): ShardDataSource = {
          pools(math.abs(hashKey.hashCode) % shardCount)
        }
      

      Connection settings:

          def toHikari: HikariConfig = {
            val hikari = new HikariConfig()
            val jdbcUrl = s"jdbc:mariadb:aurora://$host/$dbName"
            log.info(marker("HikariConfig"), s"Connecting to $jdbcUrl")
            hikari.setJdbcUrl(jdbcUrl)
            hikari.setUsername(user)
            hikari.setPassword(password)
            MySQLPersistenceConfig.configure(hikari)
            hikari.setIdleTimeout(150000)
            hikari.setMaxLifetime(300000)
            hikari.setMaximumPoolSize(maxConnections.getOrElse(ConnectionSettings.defaultMaxConnections))
       
            hikari
          }
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                diego dupin Diego Dupin
                Reporter:
                jgarnett Joshua Garnett
              • Votes:
                1 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: