[MDEV-33315] InnoDB history length and undo tablespace files keep growing - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Incomplete
Affects Version/s: 10.11.6
Fix Version/s: N/A
Component/s: Storage Engine - InnoDB
Labels:
None
Environment:
CloudLinux 7h

Description

Hello, after an upgrade from 10.5.19 to 10.11.6 on one of our servers we've started getting complaints on DDL not working on some databases.
As we checked, we have noticed affected databases had `MDL_SHARED` `Table metadata lock` with thread_id 0 which led us to check innodb status and recently enabled innodb_undo_tablespaces, which are now 16GB and appear to be growing in tandem with history list length.

Trx id counter 23795545285

Purge done for trx's n:o < 23673930766 undo n:o < 0 state: running

History list length 51823666

-rw-rw---- 1 mysql mysql 16928210944 Jan 25 20:35 /var/lib/mysql/undo001

-rw-rw---- 1 mysql mysql 15804137472 Jan 25 20:33 /var/lib/mysql/undo002

-rw-rw---- 1 mysql mysql 15988686848 Jan 25 20:34 /var/lib/mysql/undo003

It seems to be stuck at the `23673930766` trx. The only purge related configuration change was innodb_purge_threads to 32, rest are defaults:

MariaDB [(none)]> show global variables like '%undo%';

+--------------------------+----------+

| Variable_name            | Value    |

+--------------------------+----------+

| innodb_max_undo_log_size | 10485760 |

| innodb_undo_directory    | ./       |

| innodb_undo_log_truncate | OFF      |

| innodb_undo_tablespaces  | 3        |

+--------------------------+----------+

4 rows in set (0.041 sec)

MariaDB [(none)]> show global variables like '%purge%';

+--------------------------------------+------------+

| Variable_name                        | Value      |

+--------------------------------------+------------+

| aria_log_purge_type                  | immediate  |

| innodb_max_purge_lag                 | 0          |

| innodb_max_purge_lag_delay           | 0          |

| innodb_max_purge_lag_wait            | 4294967295 |

| innodb_purge_batch_size              | 1000       |

| innodb_purge_rseg_truncate_frequency | 128        |

| innodb_purge_threads                 | 32         |

| relay_log_purge                      | ON         |

+--------------------------------------+------------+

8 rows in set (0.001 sec)

$ mysql --version

mysql  Ver 15.1 Distrib 10.11.6-MariaDB, for Linux (x86_64) using readline 5.1

Does seem very similar to https://jira.mariadb.org/browse/MDEV-31676, but our innodb_max_undo_log_size is default and is undo logs are way over limit, so purges should be going.

We've already tried experimenting with max_purge_lag% variables, but nothing appeared to help. Also, not all servers from our fleet that got upgraded seem to be affected either, with same configuration.

If there's any other information needed, please let us know.

Attachments

Issue Links

relates to

MDEV-33213 History list is not shrunk unless there is a pause in the workload

Closed

MDEV-33325 Crash in flst_read_addr on corrupted data

Closed

Activity

Ascending order - Click to sort in descending order

View 6 older comments

Marko Mäkelä added a comment - 2024-02-12 12:45

Hi arnklo, can you try if executing

SET GLOBAL innodb_max_purge_lag_wait=0;

would make the purge of history run? This statement should be blocked until all history has been purged. You can try executing SHOW ENGINE INNODB STATUS from another connection, to see if there is any progress.

The fix of ~~MDEV-33213~~ was included in the latest quarterly releases. Would the problem be solved by upgrading to MariaDB Server 10.11.7?

Marko Mäkelä added a comment - 2024-02-12 12:45 Hi arnklo , can you try if executing SET GLOBAL innodb_max_purge_lag_wait=0; would make the purge of history run? This statement should be blocked until all history has been purged. You can try executing SHOW ENGINE INNODB STATUS from another connection, to see if there is any progress. The fix of MDEV-33213 was included in the latest quarterly releases. Would the problem be solved by upgrading to MariaDB Server 10.11.7?

Arnas Klova added a comment - 2024-02-21 19:06

Hey Marko,

Sorry for the late update; we've been waiting for the governor-enabled release of 10.11.7 from CloudLinux, which they released to beta today. Interestingly, the database copy using an official MariaDB release from rpm managed to purge history.

Trx id counter 23965298920

Purge done for trx's n:o < 23965298918 undo n:o < 0 state: running but idle

History list length 0

LIST OF TRANSACTIONS FOR EACH SESSION:

On the other hand, the original, in-use database after the upgrade today still appears to be stuck, innodb_max_purge_lag_wait does not seem to help:

Trx id counter 24278697848

Purge done for trx's n:o < 23673930766 undo n:o < 116 state: running

History list length 261273486

This seems wrong; we're moving the copy before purge to another server with CL 10.11.7 and standard release to check if it's replicable.

Arnas Klova added a comment - 2024-02-21 19:06 Hey Marko, Sorry for the late update; we've been waiting for the governor-enabled release of 10.11.7 from CloudLinux, which they released to beta today. Interestingly, the database copy using an official MariaDB release from rpm managed to purge history. Trx id counter 23965298920 Purge done for trx's n:o < 23965298918 undo n:o < 0 state: running but idle History list length 0 LIST OF TRANSACTIONS FOR EACH SESSION: On the other hand, the original, in-use database after the upgrade today still appears to be stuck, innodb_max_purge_lag_wait does not seem to help: Trx id counter 24278697848 Purge done for trx's n:o < 23673930766 undo n:o < 116 state: running History list length 261273486 This seems wrong; we're moving the copy before purge to another server with CL 10.11.7 and standard release to check if it's replicable.

Marko Mäkelä added a comment - 2024-02-22 07:57

Hi arnklo, thank you. When it comes to the InnoDB subsystem, the development snapshot that I mentioned to you is identical with the mariadb-10.11.7 release tag: the command

git diff --name-only 307c7da1dcfae1c0362e491eac6f2fea03199491..mariadb-10.11.7

does not list any files in storage/innobase.

One thing that would prevent history from being purged is that some transactions exist in XA PREPARE state. But, there are no such transactions, because the LIST OF TRANSACTIONS FOR EACH SESSION: in the SHOW ENGINE INNODB STATUS output is empty.

It seems that the undo logs are somehow corrupted in the server instance where the purge is stuck. Probably the easiest way out is to rebuild the database using a logical dump. If you are using mysqldump or mariadb-dump, you should specify the --no-autocommit option so that the InnoDB bulk loading mechanism (~~MDEV-515~~, ~~MDEV-24621~~) will be enabled when loading the dump.

Marko Mäkelä added a comment - 2024-02-22 07:57 Hi arnklo , thank you. When it comes to the InnoDB subsystem, the development snapshot that I mentioned to you is identical with the mariadb-10.11.7 release tag: the command git diff --name-only 307c7da1dcfae1c0362e491eac6f2fea03199491..mariadb-10.11.7 does not list any files in storage/innobase . One thing that would prevent history from being purged is that some transactions exist in XA PREPARE state. But, there are no such transactions, because the LIST OF TRANSACTIONS FOR EACH SESSION: in the SHOW ENGINE INNODB STATUS output is empty. It seems that the undo logs are somehow corrupted in the server instance where the purge is stuck. Probably the easiest way out is to rebuild the database using a logical dump. If you are using mysqldump or mariadb-dump , you should specify the --no-autocommit option so that the InnoDB bulk loading mechanism ( MDEV-515 , MDEV-24621 ) will be enabled when loading the dump.

Marko Mäkelä added a comment - 2024-02-22 09:53

arnklo, one more thing that you could on that problematic server instance (or a physical copy of the files) is to execute a shutdown after SET GLOBAL innodb_fast_shutdown=0;. Would that get stuck too?

Marko Mäkelä added a comment - 2024-02-22 09:53 arnklo , one more thing that you could on that problematic server instance (or a physical copy of the files) is to execute a shutdown after SET GLOBAL innodb_fast_shutdown=0; . Would that get stuck too?

Arnas Klova added a comment - 2024-06-06 07:09

Hello, to follow up on this, with 10.11.7, through trial and error while experimenting with the affected copy, we ended up finding a procedure that would get transaction purge unstuck.
The most important part is to restart the MariaDB server when there are exactly 0 threads running. Otherwise, it won't work.
For this, we switched port and socket name in my.cnf, restarted service (at this point there are 0 active threads, and transactions are still stuck), waited a few seconds, then restarted again (at this point purge trx id gets unstuck, and HLL starts dropping), restored the port and socket back to original and did a final restart.
For the record, innodb_fast_shutdown=0 was stuck for over 40 minutes, and we ended up cancelling
Hopefully, it helps.

Arnas Klova added a comment - 2024-06-06 07:09 Hello, to follow up on this, with 10.11.7, through trial and error while experimenting with the affected copy, we ended up finding a procedure that would get transaction purge unstuck. The most important part is to restart the MariaDB server when there are exactly 0 threads running. Otherwise, it won't work. For this, we switched port and socket name in my.cnf, restarted service (at this point there are 0 active threads, and transactions are still stuck), waited a few seconds, then restarted again (at this point purge trx id gets unstuck, and HLL starts dropping), restored the port and socket back to original and did a final restart. For the record, innodb_fast_shutdown=0 was stuck for over 40 minutes, and we ended up cancelling Hopefully, it helps.

MariaDB Server

InnoDB history length and undo tablespace files keep growing

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration