Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
1.5.9
-
None
Description
So this bug is going to be a little light on concrete examples to duplicate the behavior we are seeing as it has been incredibly hard to duplicate in staging, but please bear with me as I explain. The tldr is we shard across multiple hikari pools backed by aurora, in production we observed sql exceptions after adding a new delete query, and then observed entities being written to the wrong shard, which after a code review likely points to the hikari pool or mariadb driver returning the wrong connection.
Our application is built on top of Akka Persistence, which follows an event sourcing model. Each event is persisted as a new row in the database. Every 50 events (or journals) we create a snapshot of the full state. On recovery we replay the latest snapshot and all journals after that snapshot. At the application level we are sharding this data across multiple databases.
Example schema:
CREATE TABLE `journal` (
|
persistence_id VARCHAR(255) NOT NULL,
|
sequence_number BIGINT NOT NULL,
|
deleted BOOLEAN DEFAULT FALSE,
|
created_at BIGINT NOT NULL,
|
message MEDIUMBLOB NOT NULL,
|
PRIMARY KEY(persistence_id, sequence_number)
|
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
|
On 6/26 we released some new code to production, which enabled cleanup of the journal and snapshots.
Example query:
DELETE FROM journal WHERE persistence_id = ? AND sequence_number <= ?"
|
We had manual processes in place to clean things up, but the tables were quite large so some of the delete queries took a long time and we started observing connection link exceptions.
com.company.foo.persistence.PersistenceException: onPersistFailure
|
at com.company.foo.persistence.PersistentActorExceptional$class.onPersistFailure(PersistentActorExceptional.scala:18)
|
at com.company.foo.delta.PersistentDeltaActor.onPersistFailure(PersistentDeltaActor.scala:81)
|
at akka.persistence.Eventsourced$ProcessingState$$anonfun$2.applyOrElse(Eventsourced.scala:627)
|
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
|
at akka.persistence.Eventsourced$$anon$2.stateReceive(Eventsourced.scala:690)
|
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)
|
at com.company.foo.delta.PersistentDeltaActor.aroundReceive(PersistentDeltaActor.scala:81)
|
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
|
at akka.actor.ActorCell.invoke(ActorCell.scala)
|
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
|
at akka.dispatch.Mailbox.run$$original(Mailbox.scala:224)
|
at akka.dispatch.Mailbox.run(Mailbox.scala:29)
|
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
|
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
|
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
|
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
|
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
|
Caused by: java.sql.SQLException: (conn:23619227) Communications link failure with primary host foo-aurora-production-cluster-1-0.abcdefg.us-west-2.rds.amazonaws.com:3306. Could not send query: Broken pipe
|
on HostAddress{host='foo-aurora-production-cluster-1-0.abcdefg.us-west-2.rds.amazonaws.com', port=3306},master=true. Driver has reconnect connection
|
at org.mariadb.jdbc.internal.util.ExceptionMapper.get(ExceptionMapper.java:150)
|
at org.mariadb.jdbc.internal.util.ExceptionMapper.getException(ExceptionMapper.java:101)
|
at org.mariadb.jdbc.internal.util.ExceptionMapper.throwAndLogException(ExceptionMapper.java:77)
|
at org.mariadb.jdbc.MariaDbStatement.executeQueryEpilog(MariaDbStatement.java:226)
|
at org.mariadb.jdbc.MariaDbStatement.executeInternal(MariaDbStatement.java:256)
|
at org.mariadb.jdbc.MariaDbStatement.executeUpdate(MariaDbStatement.java:418)
|
at org.mariadb.jdbc.MariaDbConnection.setAutoCommit(MariaDbConnection.java:590)
|
at com.zaxxer.hikari.pool.ProxyConnection.setAutoCommit(ProxyConnection.java:382)
|
at com.zaxxer.hikari.pool.HikariProxyConnection.setAutoCommit(HikariProxyConnection.java)
|
at com.company.bar.db.mysql.MySQLConnection.setAutoCommit(MySQLConnection.scala:19)
|
at com.company.bar.db.mysql.ShardDataSource.connection(ShardDataSource.scala:24)
|
at com.company.bar.akka.persistence.mysql.MySQLJournalPlugin$$anonfun$asyncWriteMessageToShard$2.apply(MySQLJournalPlugin.scala:118)
|
at com.company.bar.akka.persistence.mysql.MySQLJournalPlugin$$anonfun$asyncWriteMessageToShard$2.apply(MySQLJournalPlugin.scala:114)
|
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
|
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run$$original(Future.scala:24)
|
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:35)
|
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
|
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
|
at java.lang.Thread.run(Thread.java:745)
|
Caused by: org.mariadb.jdbc.internal.util.dao.QueryException: Communications link failure with primary host foo-aurora-production-cluster-1-0.abcdefg.us-west-2.rds.amazonaws.com:3306. Could not send query: Broken pipe
|
on HostAddress{host='foo-aurora-production-cluster-1-0.abcdefg.us-west-2.rds.amazonaws.com', port=3306},master=true. Driver has reconnect connection
|
at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.executeQuery(AbstractQueryProtocol.java:159)
|
at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source)
|
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
|
at java.lang.reflect.Method.invoke(Method.java:498)
|
at org.mariadb.jdbc.internal.failover.AbstractMastersListener.invoke(AbstractMastersListener.java:355)
|
at org.mariadb.jdbc.internal.failover.FailoverProxy.executeInvocation(FailoverProxy.java:200)
|
at org.mariadb.jdbc.internal.failover.FailoverProxy.invoke(FailoverProxy.java:192)
|
at com.sun.proxy.$Proxy27.executeQuery(Unknown Source)
|
at org.mariadb.jdbc.MariaDbStatement.executeInternal(MariaDbStatement.java:246)
|
... 14 common frames omitted
|
Caused by: java.net.SocketException: Broken pipe
|
at java.net.SocketOutputStream.socketWrite0(Native Method)
|
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
|
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
|
at org.mariadb.jdbc.internal.stream.PacketOutputStream.send(PacketOutputStream.java:935)
|
at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.executeQuery(AbstractQueryProtocol.java:151)
|
... 22 common frames omitted
|
At the same time these exceptions were occurring, we started seeing entities being written to the wrong shards. We did an in depth analysis of our code and have not been able to find any bugs that would cause entities to be written to the wrong shard. All logs indicate that our shards maintained their correct order and that all entities were being written to the correct shard.
At this point it appears to be caused by either our usage of the HikariCP or the MariaDB Driver with aurora failover enabled. Our best guess is when an exception occurred either hikari returned a connection to the wrong pool or the MariaDB driver returned a new connection to the connection pool that was pointing to a different shard than specified by the config.
Any insights or help tracking down this issue would be greatly appreciated. Below are some more technical details.
Java 8u102
Scala 2.11.8
Akka 2.4.17
HikariCP 2.4.7
MariaDB 1.5.9
Hashing code:
private val pools: Vector[ShardDataSource] = databases.map(c => new ShardDataSource(tableName, c)).toVector
|
private val shardCount = pools.size
|
|
/**
|
* This returns a connection pool associated with the persistenceId
|
*/
|
override def getEntityShard(hashKey: String): ShardDataSource = {
|
pools(math.abs(hashKey.hashCode) % shardCount)
|
}
|
Connection settings:
def toHikari: HikariConfig = {
|
val hikari = new HikariConfig()
|
val jdbcUrl = s"jdbc:mariadb:aurora://$host/$dbName"
|
log.info(marker("HikariConfig"), s"Connecting to $jdbcUrl")
|
hikari.setJdbcUrl(jdbcUrl)
|
hikari.setUsername(user)
|
hikari.setPassword(password)
|
MySQLPersistenceConfig.configure(hikari)
|
hikari.setIdleTimeout(150000)
|
hikari.setMaxLifetime(300000)
|
hikari.setMaximumPoolSize(maxConnections.getOrElse(ConnectionSettings.defaultMaxConnections))
|
|
hikari
|
}
|
Attachments
Issue Links
- relates to
-
CONJ-419 Insert Query failed after failover and not able to get master connection
- Closed