[CONJ-420] High CPU usage against Aurora Created: 2017-01-30 Updated: 2017-02-10 Resolved: 2017-02-10 |
|
| Status: | Closed |
| Project: | MariaDB Connector/J |
| Component/s: | Failover |
| Affects Version/s: | 1.5.7 |
| Fix Version/s: | 1.5.8 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Kris Iyer | Assignee: | Diego Dupin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Description |
|
Observed a mariadb daemon thread leak when using the Aurora protocol with the cluster endpoint. It appears to happen exactly after 2 hours atleast on every instance that i have been able to reproduce this issue. After 2 hours of runtime, mariadb java client would spawn several mariadb threads causing severe problems including excessive garbage collection on the JVM - Was able to capture atleast 10-15 threads a second all resulting in 10-15 minor collections on the JVM. The issue is not seen while running against the mysql protocol (even against a Aurora database). Only happens when using the aurora protocol against the Aurora Database (jdbc:mysql:aurora://cluster.cluster-xxxx.us-east-1.rds.amazonaws.com:3306/db). Sample Thread dumps captured from a JVM that had been in this state for more than 12 hours: Locked ownable synchronizers:
"mariaDb-2-117980090" daemon prio=10 tid=0x00007fa14c01b800 nid=0x2e43 runnable [0x0000000000000000] Locked ownable synchronizers:
"mariaDb-2-117980089" daemon prio=10 tid=0x00007fa14c143800 nid=0x2e42 runnable [0x0000000000000000] Locked ownable synchronizers:
"mariaDb-2-117981286" daemon prio=10 tid=0x00007fa14c01b800 nid=0x32fd runnable [0x0000000000000000] Locked ownable synchronizers:
mariaDb-2-117982489" daemon prio=10 tid=0x00007fa14c01b800 nid=0x37bc runnable [0x0000000000000000] Locked ownable synchronizers:
"mariaDb-2-117982488" daemon prio=10 tid=0x00007fa14c176800 nid=0x37bb runnable [0x0000000000000000] Locked ownable synchronizers:
"mariaDb-2-117983690" daemon prio=10 tid=0x00007fa14c01b800 nid=0x3c80 runnable [0x0000000000000000] Locked ownable synchronizers:
"mariaDb-2-117983689" daemon prio=10 tid=0x00007fa14c143800 nid=0x3c7f runnable [0x0000000000000000] Locked ownable synchronizers:
Notice that the daemon threads keep incrementing indefinitely. Sample Garbage Collection Logs: 64677.059: [GC [PSYoungGen: 86560K->32K(87040K)] 155549K->69021K(227328K), 0.0032420 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] Please find attached cpu snapshots from 2 different application servers where you could see garbage collection causing cpu spikes exactly after 2 hours of uptime (servers started at 04.00 UTC and the spikes start at 06:00 UTC). |
| Comments |
| Comment by Diego Dupin [ 2017-01-31 ] | |||||||||||||||||||||||
|
Hi Kris. I'll investigate this, but at first glance, it seems to that this could occur only is same connections are used in multithread, and one of the thread close connection while other use it for long query, Can you confirm that you have this kind of use ? | |||||||||||||||||||||||
| Comment by Kris Iyer [ 2017-01-31 ] | |||||||||||||||||||||||
|
Hi Diego, The use case for web applications that i have been testing with does implement a connection pool. For the applications, I am running Jboss AS7 and Wildfly for application servers which implement a standard datasource that includes, pooling, connection validation etc. Here is how a sample configuration would look like:
Please let me know if you need any more information. | |||||||||||||||||||||||
| Comment by Diego Dupin [ 2017-02-10 ] | |||||||||||||||||||||||
|
Hi Kris, The "mariaDb-2%" threads are here on purpose : it's a pool of threads that take care of graceful failover, their number increasing according to active connection number. | |||||||||||||||||||||||
| Comment by Diego Dupin [ 2017-02-10 ] | |||||||||||||||||||||||
|
>Issue is identified There is an option, enabled only for aurora that some race condition can lead to this issue. Can you confirm your application didn't have any use at this particular moment ? I'll correct this issue, and without the correction, having the pool with at least a conneciton would solve the problem :
| |||||||||||||||||||||||
| Comment by Kris Iyer [ 2017-02-10 ] | |||||||||||||||||||||||
|
I am pretty sure this only happens when the system is idling. However once in that state, even generating some traffic and creating more connections in the pool would not recover instantly. It could take an additional hour or 2 depending on whats already been created in the JVM to clear out. I'll give this workaround a test and post my findings today. Thanks for the update. | |||||||||||||||||||||||
| Comment by Kris Iyer [ 2017-02-10 ] | |||||||||||||||||||||||
|
Confirmed the following: 1. The workaround to use-strict-min to true on the connection pool does help getting around this issue. Thank you for the quick turnaround! |