Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33315

InnoDB history length and undo tablespace files keep growing

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Incomplete
    • 10.11.6
    • N/A
    • None
    • CloudLinux 7h

    Description

      Hello, after an upgrade from 10.5.19 to 10.11.6 on one of our servers we've started getting complaints on DDL not working on some databases.
      As we checked, we have noticed affected databases had `MDL_SHARED` `Table metadata lock` with thread_id 0 which led us to check innodb status and recently enabled innodb_undo_tablespaces, which are now 16GB and appear to be growing in tandem with history list length.

      Trx id counter 23795545285
      Purge done for trx's n:o < 23673930766 undo n:o < 0 state: running
      History list length 51823666
       
      -rw-rw---- 1 mysql mysql 16928210944 Jan 25 20:35 /var/lib/mysql/undo001
      -rw-rw---- 1 mysql mysql 15804137472 Jan 25 20:33 /var/lib/mysql/undo002
      -rw-rw---- 1 mysql mysql 15988686848 Jan 25 20:34 /var/lib/mysql/undo003
      

      It seems to be stuck at the `23673930766` trx. The only purge related configuration change was innodb_purge_threads to 32, rest are defaults:

      MariaDB [(none)]> show global variables like '%undo%';
      +--------------------------+----------+
      | Variable_name            | Value    |
      +--------------------------+----------+
      | innodb_max_undo_log_size | 10485760 |
      | innodb_undo_directory    | ./       |
      | innodb_undo_log_truncate | OFF      |
      | innodb_undo_tablespaces  | 3        |
      +--------------------------+----------+
      4 rows in set (0.041 sec)
       
      MariaDB [(none)]> show global variables like '%purge%';
      +--------------------------------------+------------+
      | Variable_name                        | Value      |
      +--------------------------------------+------------+
      | aria_log_purge_type                  | immediate  |
      | innodb_max_purge_lag                 | 0          |
      | innodb_max_purge_lag_delay           | 0          |
      | innodb_max_purge_lag_wait            | 4294967295 |
      | innodb_purge_batch_size              | 1000       |
      | innodb_purge_rseg_truncate_frequency | 128        |
      | innodb_purge_threads                 | 32         |
      | relay_log_purge                      | ON         |
      +--------------------------------------+------------+
      8 rows in set (0.001 sec)
       
      $ mysql --version
      mysql  Ver 15.1 Distrib 10.11.6-MariaDB, for Linux (x86_64) using readline 5.1
      

      Does seem very similar to https://jira.mariadb.org/browse/MDEV-31676, but our innodb_max_undo_log_size is default and is undo logs are way over limit, so purges should be going.

      We've already tried experimenting with max_purge_lag% variables, but nothing appeared to help. Also, not all servers from our fleet that got upgraded seem to be affected either, with same configuration.

      If there's any other information needed, please let us know.

      Attachments

        Issue Links

          Activity

            Hi arnklo, can you try if executing

            SET GLOBAL innodb_max_purge_lag_wait=0;
            

            would make the purge of history run? This statement should be blocked until all history has been purged. You can try executing SHOW ENGINE INNODB STATUS from another connection, to see if there is any progress.

            The fix of MDEV-33213 was included in the latest quarterly releases. Would the problem be solved by upgrading to MariaDB Server 10.11.7?

            marko Marko Mäkelä added a comment - Hi arnklo , can you try if executing SET GLOBAL innodb_max_purge_lag_wait=0; would make the purge of history run? This statement should be blocked until all history has been purged. You can try executing SHOW ENGINE INNODB STATUS from another connection, to see if there is any progress. The fix of MDEV-33213 was included in the latest quarterly releases. Would the problem be solved by upgrading to MariaDB Server 10.11.7?
            arnklo Arnas Klova added a comment -

            Hey Marko,

            Sorry for the late update; we've been waiting for the governor-enabled release of 10.11.7 from CloudLinux, which they released to beta today. Interestingly, the database copy using an official MariaDB release from rpm managed to purge history.

            Trx id counter 23965298920
            Purge done for trx's n:o < 23965298918 undo n:o < 0 state: running but idle
            History list length 0
            LIST OF TRANSACTIONS FOR EACH SESSION:
            

            On the other hand, the original, in-use database after the upgrade today still appears to be stuck, innodb_max_purge_lag_wait does not seem to help:

            Trx id counter 24278697848
            Purge done for trx's n:o < 23673930766 undo n:o < 116 state: running
            History list length 261273486
            

            This seems wrong; we're moving the copy before purge to another server with CL 10.11.7 and standard release to check if it's replicable.

            arnklo Arnas Klova added a comment - Hey Marko, Sorry for the late update; we've been waiting for the governor-enabled release of 10.11.7 from CloudLinux, which they released to beta today. Interestingly, the database copy using an official MariaDB release from rpm managed to purge history. Trx id counter 23965298920 Purge done for trx's n:o < 23965298918 undo n:o < 0 state: running but idle History list length 0 LIST OF TRANSACTIONS FOR EACH SESSION: On the other hand, the original, in-use database after the upgrade today still appears to be stuck, innodb_max_purge_lag_wait does not seem to help: Trx id counter 24278697848 Purge done for trx's n:o < 23673930766 undo n:o < 116 state: running History list length 261273486 This seems wrong; we're moving the copy before purge to another server with CL 10.11.7 and standard release to check if it's replicable.

            Hi arnklo, thank you. When it comes to the InnoDB subsystem, the development snapshot that I mentioned to you is identical with the mariadb-10.11.7 release tag: the command

            git diff --name-only 307c7da1dcfae1c0362e491eac6f2fea03199491..mariadb-10.11.7
            

            does not list any files in storage/innobase.

            One thing that would prevent history from being purged is that some transactions exist in XA PREPARE state. But, there are no such transactions, because the LIST OF TRANSACTIONS FOR EACH SESSION: in the SHOW ENGINE INNODB STATUS output is empty.

            It seems that the undo logs are somehow corrupted in the server instance where the purge is stuck. Probably the easiest way out is to rebuild the database using a logical dump. If you are using mysqldump or mariadb-dump, you should specify the --no-autocommit option so that the InnoDB bulk loading mechanism (MDEV-515, MDEV-24621) will be enabled when loading the dump.

            marko Marko Mäkelä added a comment - Hi arnklo , thank you. When it comes to the InnoDB subsystem, the development snapshot that I mentioned to you is identical with the mariadb-10.11.7 release tag: the command git diff --name-only 307c7da1dcfae1c0362e491eac6f2fea03199491..mariadb-10.11.7 does not list any files in storage/innobase . One thing that would prevent history from being purged is that some transactions exist in XA PREPARE state. But, there are no such transactions, because the LIST OF TRANSACTIONS FOR EACH SESSION: in the SHOW ENGINE INNODB STATUS output is empty. It seems that the undo logs are somehow corrupted in the server instance where the purge is stuck. Probably the easiest way out is to rebuild the database using a logical dump. If you are using mysqldump or mariadb-dump , you should specify the --no-autocommit option so that the InnoDB bulk loading mechanism ( MDEV-515 , MDEV-24621 ) will be enabled when loading the dump.

            arnklo, one more thing that you could on that problematic server instance (or a physical copy of the files) is to execute a shutdown after SET GLOBAL innodb_fast_shutdown=0;. Would that get stuck too?

            marko Marko Mäkelä added a comment - arnklo , one more thing that you could on that problematic server instance (or a physical copy of the files) is to execute a shutdown after SET GLOBAL innodb_fast_shutdown=0; . Would that get stuck too?
            arnklo Arnas Klova added a comment -

            Hello, to follow up on this, with 10.11.7, through trial and error while experimenting with the affected copy, we ended up finding a procedure that would get transaction purge unstuck.
            The most important part is to restart the MariaDB server when there are exactly 0 threads running. Otherwise, it won't work.
            For this, we switched port and socket name in my.cnf, restarted service (at this point there are 0 active threads, and transactions are still stuck), waited a few seconds, then restarted again (at this point purge trx id gets unstuck, and HLL starts dropping), restored the port and socket back to original and did a final restart.
            For the record, innodb_fast_shutdown=0 was stuck for over 40 minutes, and we ended up cancelling
            Hopefully, it helps.

            arnklo Arnas Klova added a comment - Hello, to follow up on this, with 10.11.7, through trial and error while experimenting with the affected copy, we ended up finding a procedure that would get transaction purge unstuck. The most important part is to restart the MariaDB server when there are exactly 0 threads running. Otherwise, it won't work. For this, we switched port and socket name in my.cnf, restarted service (at this point there are 0 active threads, and transactions are still stuck), waited a few seconds, then restarted again (at this point purge trx id gets unstuck, and HLL starts dropping), restored the port and socket back to original and did a final restart. For the record, innodb_fast_shutdown=0 was stuck for over 40 minutes, and we ended up cancelling Hopefully, it helps.

            People

              marko Marko Mäkelä
              arnklo Arnas Klova
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.