[MDEV-33315] InnoDB history length and undo tablespace files keep growing Created: 2024-01-25 Updated: 2024-02-05 |
|
| Status: | Needs Feedback |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.11.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Arnas Klova | Assignee: | Marko Mäkelä |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CloudLinux 7h |
||
| Issue Links: |
|
||||||||||||
| Description |
|
Hello, after an upgrade from 10.5.19 to 10.11.6 on one of our servers we've started getting complaints on DDL not working on some databases.
It seems to be stuck at the `23673930766` trx. The only purge related configuration change was innodb_purge_threads to 32, rest are defaults:
Does seem very similar to https://jira.mariadb.org/browse/MDEV-31676, but our innodb_max_undo_log_size is default and is undo logs are way over limit, so purges should be going. We've already tried experimenting with max_purge_lag% variables, but nothing appeared to help. Also, not all servers from our fleet that got upgraded seem to be affected either, with same configuration. If there's any other information needed, please let us know. |
| Comments |
| Comment by Marko Mäkelä [ 2024-01-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
We see that the last assigned transaction identifier (or next to-be-assigned identifier, I do not remember which) is 121,614,519 more than the last purged transaction identifier. Transaction identifiers are assigned on both transaction start (DB_TRX_ID written to the records) and commit (the "transaction number" written to the undo log is what purge cares about). If we assume that every started transaction is also committed (and not rolled back, or stuck indefinitely in XA PREPARE state), this would correspond to 60,807,259½ transactions. But, the history list length is only 51,823,666 transactions. In some other cases that I think would be fixed by I would suspect that you have some old transaction lingering in the system, possibly in XA PREPARE state. Even if you never used such statements, the binlog and replication subsystem will internally use two-phase commit in an attempt to keep the binlog consistent with InnoDB. There could be a bug in this area. Can you check the full SHOW ENGINE INNODB STATUS output and look for ACTIVE or ACTIVE (PREPARED) transactions whose identifier is significantly less than the last assigned transaction identifier? | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arnas Klova [ 2024-01-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you for looking into this. Here's output from `SHOW ENGINE INNODB STATUS` `TRANSACTIONS` block. Unfortunately, there we cannot find any active transactions with anomalous identifier that you've mentioned. Any new ACTIVE state transactions all appear to be close in value to `non started` transactions in the list.
| |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you, I see that there currently only exist read-only transactions. No persistent identifier is assigned for them; that is why they are identified as memory addresses in the output. Would you be willing to try a development snapshot where the | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arnas Klova [ 2024-01-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello, thanks for the information and suggestions. During the day we've made a really dirty copy of production data (ib* files after shutdown, undo00* while running) to minimize unscheduled downtime and tried it out with the development snapshot you've mentioned. Results do seem to be promising, as history length began decreasing:
A short moment later, MariaDB server crashed, likely because of the aforementioned semi-broken copy:
We could try out with proper copy later to see if crash goes away, if needed, it would just require for us to schedule maintenance, which would take a few days. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you for the update. Yes, I think that such a crash could occur due to using an inconsistent copy of the database. I filed MDEV-33325 for that crash on the corrupted data. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arnas Klova [ 2024-02-05 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello, thank you for filling out https://jira.mariadb.org/browse/MDEV-33325. We've got clean copy on another server as affected one is getting low on space due to growing undo logs. Unfortunately, it seems that the history length is stuck even with development build, purge is at the same exact transaction.
|