[CONJ-420] High CPU usage against Aurora Created: 2017-01-30  Updated: 2017-02-10  Resolved: 2017-02-10

Status: Closed
Project: MariaDB Connector/J
Component/s: Failover
Affects Version/s: 1.5.7
Fix Version/s: 1.5.8

Type: Bug Priority: Critical
Reporter: Kris Iyer Assignee: Diego Dupin
Resolution: Fixed Votes: 0
Labels: None

Attachments: PNG File ec2_1.png     PNG File ec2_2.png    

 Description   

Observed a mariadb daemon thread leak when using the Aurora protocol with the cluster endpoint. It appears to happen exactly after 2 hours atleast on every instance that i have been able to reproduce this issue. After 2 hours of runtime, mariadb java client would spawn several mariadb threads causing severe problems including excessive garbage collection on the JVM - Was able to capture atleast 10-15 threads a second all resulting in 10-15 minor collections on the JVM.

The issue is not seen while running against the mysql protocol (even against a Aurora database). Only happens when using the aurora protocol against the Aurora Database (jdbc:mysql:aurora://cluster.cluster-xxxx.us-east-1.rds.amazonaws.com:3306/db).

Sample Thread dumps captured from a JVM that had been in this state for more than 12 hours:
mariaDb-2-117978934" daemon prio=10 tid=0x00007fa14c01b800 nid=0x29b3 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

Locked ownable synchronizers:

  • None

"mariaDb-2-117980090" daemon prio=10 tid=0x00007fa14c01b800 nid=0x2e43 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

Locked ownable synchronizers:

  • None

"mariaDb-2-117980089" daemon prio=10 tid=0x00007fa14c143800 nid=0x2e42 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

Locked ownable synchronizers:

  • None

"mariaDb-2-117981286" daemon prio=10 tid=0x00007fa14c01b800 nid=0x32fd runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

Locked ownable synchronizers:

  • None

mariaDb-2-117982489" daemon prio=10 tid=0x00007fa14c01b800 nid=0x37bc runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

Locked ownable synchronizers:

  • None

"mariaDb-2-117982488" daemon prio=10 tid=0x00007fa14c176800 nid=0x37bb runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

Locked ownable synchronizers:

  • None

"mariaDb-2-117983690" daemon prio=10 tid=0x00007fa14c01b800 nid=0x3c80 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

Locked ownable synchronizers:

  • None

"mariaDb-2-117983689" daemon prio=10 tid=0x00007fa14c143800 nid=0x3c7f runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

Locked ownable synchronizers:

  • None

Notice that the daemon threads keep incrementing indefinitely.

Sample Garbage Collection Logs:

64677.059: [GC [PSYoungGen: 86560K->32K(87040K)] 155549K->69021K(227328K), 0.0032420 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
64677.159: [GC [PSYoungGen: 86560K->64K(87040K)] 155549K->69053K(227328K), 0.0033700 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
64677.233: [GC [PSYoungGen: 86592K->32K(87040K)] 155581K->69021K(227328K), 0.0037880 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
64677.298: [GC [PSYoungGen: 86560K->64K(87040K)] 155549K->69053K(227328K), 0.0273960 secs] [Times: user=0.04 sys=0.01, real=0.03 secs]
64677.385: [GC [PSYoungGen: 86592K->64K(87040K)] 155581K->69061K(227328K), 0.0155870 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
64677.458: [GC [PSYoungGen: 86592K->64K(87040K)] 155589K->69061K(227328K), 0.0032730 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
64677.517: [GC [PSYoungGen: 86592K->32K(87040K)] 155589K->69029K(227328K), 0.0153590 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
64677.613: [GC [PSYoungGen: 86560K->32K(87040K)] 155557K->69029K(227328K), 0.0033720 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
64677.684: [GC [PSYoungGen: 86560K->64K(87040K)] 155557K->69061K(227328K), 0.0113370 secs] [Times: user=0.02 sys=0.01, real=0.01 secs]
64677.747: [GC [PSYoungGen: 86592K->32K(87040K)] 155589K->69029K(227328K), 0.0155070 secs] [Times: user=0.03 sys=0.01, real=0.02 secs]
64677.835: [GC [PSYoungGen: 86560K->32K(87040K)] 155557K->69029K(227328K), 0.0032410 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
64677.903: [GC [PSYoungGen: 86560K->32K(87040K)] 155557K->69029K(227328K), 0.0263050 secs] [Times: user=0.03 sys=0.00, real=0.03 secs]
64677.989: [GC [PSYoungGen: 86560K->64K(87040K)] 155557K->69061K(227328K), 0.0032160 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
64678.052: [GC [PSYoungGen: 86592K->32K(87040K)] 155589K->69029K(227328K), 0.0036990 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
64678.112: [GC [PSYoungGen: 86560K->32K(87040K)] 155557K->69029K(227328K), 0.0193080 secs] [Times: user=0.02 sys=0.00, real=0.02 secs]
64678.183: [GC [PSYoungGen: 86560K->64K(87040K)] 155557K->69061K(227328K), 0.0029340 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
64678.258: [GC [PSYoungGen: 86592K->64K(87040K)] 155589K->69061K(227328K), 0.0275680 secs] [Times: user=0.02 sys=0.01, real=0.03 secs]
64678.340: [GC [PSYoungGen: 86592K->64K(87040K)] 155589K->69061K(227328K), 0.0265170 secs] [Times: user=0.04 sys=0.00, real=0.03 secs]
64678.451: [GC [PSYoungGen: 86592K->64K(87040K)] 155589K->69069K(227328K), 0.0033880 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
64678.506: [GC [PSYoungGen: 86592K->64K(87040K)] 155597K->69069K(227328K), 0.0305550 secs] [Times: user=0.03 sys=0.01, real=0.03 secs]
64678.585: [GC [PSYoungGen: 86592K->32K(87040K)] 155597K->69037K(227328K), 0.0033300 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
64678.669: [GC [PSYoungGen: 86560K->64K(87040K)] 155565K->69069K(227328K), 0.0032000 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
64678.732: [GC [PSYoungGen: 86592K->64K(87040K)] 155597K->69069K(227328K), 0.0191150 secs] [Times: user=0.01 sys=0.00, real=0.02 secs]
64678.807: [GC [PSYoungGen: 86592K->32K(87040K)] 155597K->69037K(227328K), 0.0033350 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
64678.895: [GC [PSYoungGen: 86560K->64K(87040K)] 155565K->69069K(227328K), 0.0033410 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
64678.991: [GC [PSYoungGen: 86592K->32K(87040K)] 155597K->69037K(227328K), 0.0032990 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]

Please find attached cpu snapshots from 2 different application servers where you could see garbage collection causing cpu spikes exactly after 2 hours of uptime (servers started at 04.00 UTC and the spikes start at 06:00 UTC).



 Comments   
Comment by Diego Dupin [ 2017-01-31 ]

Hi Kris.
Thanks for reporting this.

I'll investigate this, but at first glance, it seems to that this could occur only is same connections are used in multithread, and one of the thread close connection while other use it for long query, Can you confirm that you have this kind of use ?

Comment by Kris Iyer [ 2017-01-31 ]

Hi Diego,

The use case for web applications that i have been testing with does implement a connection pool. For the applications, I am running Jboss AS7 and Wildfly for application servers which implement a standard datasource that includes, pooling, connection validation etc.

Here is how a sample configuration would look like:

<datasource jndi-name="java:/testDS" pool-name="testDS" enabled="true" use-ccm="true">
    <connection-url>jdbc:mysql:aurora://cluster.cluster-xxxx.us-east-1.rds.amazonaws.com:3306/db</connection-url>
    <driver-class>org.mariadb.jdbc.Driver</driver-class>
    <driver>mariadb</driver>
    <transaction-isolation>TRANSACTION_READ_COMMITTED</transaction-isolation>
    <pool>
        <min-pool-size>10</min-pool-size>
        <max-pool-size>150</max-pool-size>
        <use-strict-min>false</use-strict-min>
        <flush-strategy>FailingConnectionOnly</flush-strategy>
    </pool>
    <security>
        <user-name>foo</user-name>
        <password>bar</password>
    </security>
    <validation>
        <check-valid-connection-sql>SELECT 1</check-valid-connection-sql>
        <validate-on-match>false</validate-on-match>
        <background-validation>true</background-validation>
        <background-validation-millis>60000</background-validation-millis>
        <use-fast-fail>true</use-fast-fail>
    </validation>
</datasource>

Please let me know if you need any more information.

Comment by Diego Dupin [ 2017-02-10 ]

Hi Kris,

The "mariaDb-2%" threads are here on purpose : it's a pool of threads that take care of graceful failover, their number increasing according to active connection number.
Issue is identified, design doesn't permit as many possible connection.
(btw, i'll change thread name to be more understandable)

Comment by Diego Dupin [ 2017-02-10 ]

>Issue is identified
I might have been optimistic, can you confirm my assumptions :

There is an option, enabled only for aurora that some race condition can lead to this issue.
To be in this particular case, pool must have some connection in pool, then after 2 complete hours without any connection in pool (seems possible in your case due to having use-strict-min=false set) there can be a possible race condition, leading to CPU bust.

Can you confirm your application didn't have any use at this particular moment ?

I'll correct this issue, and without the correction, having the pool with at least a conneciton would solve the problem :

<pool>
        <min-pool-size>1</min-pool-size>
        <max-pool-size>150</max-pool-size>
        <use-strict-min>true</use-strict-min>
        <flush-strategy>FailingConnectionOnly</flush-strategy>
    </pool>

Comment by Kris Iyer [ 2017-02-10 ]

I am pretty sure this only happens when the system is idling. However once in that state, even generating some traffic and creating more connections in the pool would not recover instantly. It could take an additional hour or 2 depending on whats already been created in the JVM to clear out. I'll give this workaround a test and post my findings today. Thanks for the update.

Comment by Kris Iyer [ 2017-02-10 ]

Confirmed the following:

1. The workaround to use-strict-min to true on the connection pool does help getting around this issue.
2. The fixes posted under 1.5.8 does work as intended.

Thank you for the quick turnaround!

Generated at Thu Feb 08 03:15:32 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.