Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30481

Hard lock up with queries in "Opening tables" state

Details

    Description

      We have a lot of servers that are using this same set up and configuration, and on one of them recently we started having a really odd issue that I can't explain.

      The short version is that one of the databases in the server stops responding. When we review the process list, it shows that queries are stuck in the "Opening tables" state. Other databases on this same server are still responding normally at this time.

      Killing the SQL processes from the mysql command line doesn't work. No error, it just doesn't kill it.

      Then, if we try to issue the standard systemctl restart mariadb, it looks like it tried to start shutting down, but never can. At that point, the process list looks like this (username and database name have been replaced):

      | 36649 | USERNAME | localhost | DATABASE | Killed  | 4044 | Opening tables | SELECT post_id, meta_key, meta_value FROM wp_postmeta WHERE post_id IN (3210213,3209688,1894564,1578 |    0.000 |
      | 36650 | USERNAME | localhost | DATABASE | Killed  | 4038 | Opening tables | SELECT post_id, meta_key, meta_value FROM wp_postmeta WHERE post_id IN (3210213,3209688,1894564,1578 |    0.000 |
      | 36653 | USERNAME | localhost | DATABASE | Killed  | 4038 | Opening tables | SELECT post_id, meta_key, meta_value FROM wp_postmeta WHERE post_id IN (3210213,3209688,1894564,1578 |    0.000 |
      ...
      ...
      ...
      

      Each time this has happened, we ended up having to issue a "kill -9" on the Mariadb master process to get it back. Once we did that, it started right back up and is running normally again.

      I have found no errors at the system level and Mariadb is not even recording any errors. Logging is still working because Mariadb does log some things during the event, like when a user doesn't use the right password. But there are zero errors logged.

      Just reviewing the documentation about this, it says that it could be caused by table_open_cache settings. On this server, there are about 1000 tables, including the tables in sys and mysql, etc. And we're using the default value of 2000 for table_open_cache right now. So I can't see how it could have anything to do with that setting.

      It's a difficult issue to debug since there is basically no data recorded about the problem and it's basically just locked up when I get to it.

      Any ideas?

      Attachments

        Issue Links

          Activity

            wk_bradp Brad added a comment -

            Thanks, Marko! The RPMs you listed are exactly what I updated to yesterday so we should be all set. It's very infrequent for us though. I'll set a reminder for myself to check reply back in 8 weeks. If we haven't had the problem by then, then I think we can call it fixed.

            wk_bradp Brad added a comment - Thanks, Marko! The RPMs you listed are exactly what I updated to yesterday so we should be all set. It's very infrequent for us though. I'll set a reminder for myself to check reply back in 8 weeks. If we haven't had the problem by then, then I think we can call it fixed.
            R1CH Richard Stanway added a comment - - edited

            I believe I am also running into this (or a very similar issue) on a 10.11.2 server. The primary DB activity is a bursty series of inserts and updates to a single table from 8 connections every minute. Almost all server settings are at default beyond these:

            innodb_buffer_pool_size = 64G
            innodb_doublewrite = 0
            innodb_file_per_table = ON
            innodb_log_write_ahead_size = 16384
            innodb_use_native_aio = 0
            innodb_use_atomic_writes = 0
            innodb_flush_neighbors = 0
            innodb_io_capacity = 1000
            innodb_io_capacity_max = 2500
            innodb_flush_log_at_trx_commit = 2
            

            Thread stacks:
            https://gist.githubusercontent.com/notr1ch/c37d05f3c537c5c3f6a3c1c4d53c43ea/raw/0f271da2deb43c166b9558a3250f2953c1e4007a/gistfile1.txt

            I also have a core file if it's helpful to get more information from.

            R1CH Richard Stanway added a comment - - edited I believe I am also running into this (or a very similar issue) on a 10.11.2 server. The primary DB activity is a bursty series of inserts and updates to a single table from 8 connections every minute. Almost all server settings are at default beyond these: innodb_buffer_pool_size = 64G innodb_doublewrite = 0 innodb_file_per_table = ON innodb_log_write_ahead_size = 16384 innodb_use_native_aio = 0 innodb_use_atomic_writes = 0 innodb_flush_neighbors = 0 innodb_io_capacity = 1000 innodb_io_capacity_max = 2500 innodb_flush_log_at_trx_commit = 2 Thread stacks: https://gist.githubusercontent.com/notr1ch/c37d05f3c537c5c3f6a3c1c4d53c43ea/raw/0f271da2deb43c166b9558a3250f2953c1e4007a/gistfile1.txt I also have a core file if it's helpful to get more information from.

            I've been using the custom build from https://ci.mariadb.org/33215/ since my last comment and the problem did not reproduce (previously it happened again within 4 hours of the server restart). I've just upgraded to 10.11.3 which seems to include this fix, and hopefully this problem is now solved.

            R1CH Richard Stanway added a comment - I've been using the custom build from https://ci.mariadb.org/33215/ since my last comment and the problem did not reproduce (previously it happened again within 4 hours of the server restart). I've just upgraded to 10.11.3 which seems to include this fix, and hopefully this problem is now solved.
            wk_bradp Brad added a comment -

            It hasn't quite been 6 weeks since we started using the patched version but I think we may be close enough to call this as fixed in my eyes. Thanks for all of your hard work! I'll report back if we do end up having issues again

            wk_bradp Brad added a comment - It hasn't quite been 6 weeks since we started using the patched version but I think we may be close enough to call this as fixed in my eyes. Thanks for all of your hard work! I'll report back if we do end up having issues again

            Thank you, wk_bradp. Coincidentally, MariaDB Server 10.6.13 was just released today.

            marko Marko Mäkelä added a comment - Thank you, wk_bradp . Coincidentally, MariaDB Server 10.6.13 was just released today.

            People

              marko Marko Mäkelä
              wk_bradp Brad
              Votes:
              3 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.