We are consistently experiencing a blocking issue on our main database server.
- The enduser keeps waiting forever after a query to get appointments (this is a huge query with a lot of subqueries that normally takes only seconds).
- On the server we see the connections that have the issue in the 'checking permissions' state for a long time (more then 2 minutes).
- There are a many connections in that state simultaneously and their active time just keeps on going up.
- When we kill the hanging connections sometimes the problem gets solved, but not always (The client automatically reconnects and retries the query.)
- We have around a 1000 schemas and when the problem happens it only happens for a few of them. We have no clue yet what is different for them.
- The issue appears usually in the early afternoon when the load on the server is higher.
- We had the issue the first time a few months ago. Now it is back every day since a week.
- The application is a swing application that connects to the database via jdbc.
- The customers connect from all over canada to our database server directly. When the issue happens we can reproduce it also from our offices where we have a fat internet line.
- We see nothing appear in the mysql error logs.
- mysqld runs on a 31G 8cpu machine and it takes 17G. When the issue is manifesting mysqld takes up to 6 of the 8 cpus at 100%.
They have lasted from 100 seconds upwards of 2000 seconds, and then either clear out or they restart mysqld.
It is only a certain query that causes this, and while the query is run against multiple schemas, it generally only hangs on a couple of the schemas. Note that most of the time this query runs fast (couple seconds), but only every now and again, this hanging in "checking permissions" starts. Generally the load is higher, but not always.
They are using MariaDB 5.5.59 (latest 5.5), but it also occurred on an earlier 5.5.45. I had them upgrade to rule that out. Also, early indications seemed like their table_cache and table_definition_cache (and max_open_tables) were quite low. So we increased those significantly, but the problem still persists.
Nothing is logged to the error log about it.
The oldest active transaction is the one hung, so it is confusing what could be blocking this.
For instance, here are the last 3 transactions from the latest SHOW ENGINE INNODB STATUS when this occurred: