Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-15222

Connections hanging in 'checking permissions' state

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Won't Fix
    • 5.5.45, 5.5.59
    • N/A
    • OTHER
    • None
    • CentOS release 6.7 (Final)

    Description

      We are consistently experiencing a blocking issue on our main database server.

      Symptoms:

      • The enduser keeps waiting forever after a query to get appointments (this is a huge query with a lot of subqueries that normally takes only seconds).
      • On the server we see the connections that have the issue in the 'checking permissions' state for a long time (more then 2 minutes).
      • There are a many connections in that state simultaneously and their active time just keeps on going up.
      • When we kill the hanging connections sometimes the problem gets solved, but not always (The client automatically reconnects and retries the query.)
      • We have around a 1000 schemas and when the problem happens it only happens for a few of them. We have no clue yet what is different for them.
      • The issue appears usually in the early afternoon when the load on the server is higher.
      • We had the issue the first time a few months ago. Now it is back every day since a week.
      • The application is a swing application that connects to the database via jdbc.
      • The customers connect from all over canada to our database server directly. When the issue happens we can reproduce it also from our offices where we have a fat internet line.
      • We see nothing appear in the mysql error logs.
      • mysqld runs on a 31G 8cpu machine and it takes 17G. When the issue is manifesting mysqld takes up to 6 of the 8 cpus at 100%.

      They have lasted from 100 seconds upwards of 2000 seconds, and then either clear out or they restart mysqld.

      It is only a certain query that causes this, and while the query is run against multiple schemas, it generally only hangs on a couple of the schemas. Note that most of the time this query runs fast (couple seconds), but only every now and again, this hanging in "checking permissions" starts. Generally the load is higher, but not always.

      They are using MariaDB 5.5.59 (latest 5.5), but it also occurred on an earlier 5.5.45. I had them upgrade to rule that out. Also, early indications seemed like their table_cache and table_definition_cache (and max_open_tables) were quite low. So we increased those significantly, but the problem still persists.

      Nothing is logged to the error log about it.

      The oldest active transaction is the one hung, so it is confusing what could be blocking this.

      For instance, here are the last 3 transactions from the latest SHOW ENGINE INNODB STATUS when this occurred:

      ---TRANSACTION 2A765CFDB, ACTIVE 2044 sec
      mysql tables in use 12, locked 0
      MySQL thread id 95, OS thread handle 0x7f59a8d2a700, query id 74203968 X.X.X.X db1 checking permissions
      select ...
      Trx read view will not see trx with id >= 2A765CFDC, sees < 2A765458C
      ---TRANSACTION 2A7657FA5, ACTIVE 2061 sec
      mysql tables in use 12, locked 0
      MySQL thread id 224, OS thread handle 0x7f59a5a00700, query id 352097814 X.X.X.X db1 checking permissions
      select ...
      Trx read view will not see trx with id >= 2A7657FA6, sees < 2A765458C
      ---TRANSACTION 2A7654ADE, ACTIVE 2075 sec
      mysql tables in use 12, locked 0
      MySQL thread id 76, OS thread handle 0x7f59a9295700, query id 70280283 X.X.X.X db1 checking permissions
      select ...
      Trx read view will not see trx with id >= 2A7654ADF, sees < 2A765458C
      ----------------------------
      END OF INNODB MONITOR OUTPUT
      ============================
      

      Attachments

        Issue Links

          Activity

            Do you have a lot of rows in the mysql.user or mysql.db table? What does

            select count(*) from mysql.user;
            select count(*) from mysql.db;
            

            return?

            serg Sergei Golubchik added a comment - Do you have a lot of rows in the mysql.user or mysql.db table? What does select count (*) from mysql. user ; select count (*) from mysql.db; return?
            ccalender Chris Calender (Inactive) added a comment - - edited

            Hello serg

            Here are the results:

            select count(*) from mysql.user = 567
            select count(*) from mysql.db = 566
            

            ccalender Chris Calender (Inactive) added a comment - - edited Hello serg Here are the results: select count(*) from mysql.user = 567 select count(*) from mysql.db = 566
            serg Sergei Golubchik added a comment - - edited

            According to the stack traces (trace apply all bt in gdb) there are 5 threads stuck in acl_getroot() (called from set_routine_security_ctx() from sp_head::execute_function()), trying to lock the acl_cache->lock mutex. And one thread in in strcmp() invoked from acl_getroot().

            Looks like a clear case showing that linear scanning or user/db lists is slow. Still, it cannot possibly take minutes, so might be coupled with some scheduler glitch?

            Anyway, there are few ideas how to speed up acl_getroot():

            • ignore the definer completely if the function does not access any tables (seems to be the case here)
            • cache the security context in the sp_head, to avoid calling acl_getroot() on every invocation
            • generally, get rid of linear scanning, and only scan for entries with wildcards, while using a hash for entries without.
            serg Sergei Golubchik added a comment - - edited According to the stack traces ( trace apply all bt in gdb) there are 5 threads stuck in acl_getroot() (called from set_routine_security_ctx() from sp_head::execute_function() ), trying to lock the acl_cache->lock mutex. And one thread in in strcmp() invoked from acl_getroot() . Looks like a clear case showing that linear scanning or user/db lists is slow. Still, it cannot possibly take minutes, so might be coupled with some scheduler glitch? Anyway, there are few ideas how to speed up acl_getroot() : ignore the definer completely if the function does not access any tables (seems to be the case here) cache the security context in the sp_head , to avoid calling acl_getroot() on every invocation generally, get rid of linear scanning, and only scan for entries with wildcards, while using a hash for entries without.

            People

              serg Sergei Golubchik
              ccalender Chris Calender (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.