[MDEV-8421] ibdata1 size growing too much but cannot be explained by usage Created: 2015-07-03  Updated: 2021-01-16  Resolved: 2021-01-16

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB, Storage Engine - XtraDB
Affects Version/s: 10.0.14-galera
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Petros Moisiadis Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: None
Environment:

Linux db1 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2 x86_64 GNU/Linux



 Description   

We recently had an issue with one node of our 3-node galera cluster eating all hard disk space. It turned out to be that the ibdata1 file had grown up too much in size. However, this increase cannot be explained by the cluster's everyday usage.

The cluster is hosting one database that has reached the size of 25GB over a span of 5 years. When all free space was consumed, the size of ibdata1 had grown up to ~12GB, which is about 48% of the whole database size. Using 'innochecksum' and 'innodb_space' tools, we found out that 97% of pages in the ibdata1 file is undo log pages.

Unfortunately the cluster was loosely monitored at the time of the event, so we do not know if the increase was rapid or increamental over a longer period of time. In any case, the size of 12GB seems to be too big to have been caused by normal usage. After recovering from the situation (we rebootstrapped the cluster on one of the other two nodes and rejoined the affected node after removing its data files), we started monitoring ibdata1 on all 3 nodes. With the same everyday usage on the cluster, ibdata1 now remains the same size (~76MB) on all 3 nodes. So we are not seeing any incremental increase in size.

Some other, possibly relevant, facts:

  • At the time of the event, we got innodb status output with a "History list length" value of 40957337 which seems pretty big.
  • We had recenlty (before the event) introduced some changes so that one of our applications was issuing multiple queries doing "INSERT INTO ... SELECT ... " in the same query, which is considered bad practice as it introduces locks that can be avoided if the SELECT is done separately (we have now fixed this).
  • We had also configured the nodes with "innodb_locks_unsafe_for_binlog=1".

We have kept a copy of the 12GB ibdata1 file for further investigation, but we are not sure about what to investigate. Maybe, there is some repeatability of data that could make us think that an endless loop wrote 12GB of data before eating all space, or maybe there is something else.

We will really appreciate it if we get any advice on this by innodb / xtradb experts out there.

Thank you in advance.



 Comments   
Comment by Elena Stepanova [ 2015-07-08 ]

jplindst,

Could you please take a look at this and maybe advise on how to proceed with the investigation more efficiently.

Comment by Jan Lindström (Inactive) [ 2015-07-21 ]

See http://blog.jcole.us/2014/04/16/the-basics-of-the-innodb-undo-logging-and-history-system/.

While InnoDB purge does freeing undo log pages and unlinking them from the global history list to make them available for re-use. They remain on ibdata file. These free pages are never currently returned to OS and naturally there are several situations where these pages can't be really re-used e.g. on situation where new undo pages are bigger than fragmented datafile pages.

I do not know any means to really investigate after something mysterious has happened, only way is to monitor current disk usage and if some unexplained allocations are done see what operations really did cause that.

Comment by Elena Stepanova [ 2015-08-06 ]

jplindst,

According to the initial description, the mysteriously big ibdata file is available. Is there any way to analyze its contents to see what really takes all the space, whether it the undo pages or anything else?

Comment by Petros Moisiadis [ 2015-08-17 ]

As we have already reported, 97% of pages consists of undo pages indeed, but the size of the data is too big to have been produced by normal usage. So, we suspect that something has stuck and caused an abnormal amount of undo logs to be written. Since an infinite loop could lead to a lot of undo logs written with the same data, we may need to analyze that data for repeating patterns and then see if there is a code path that could lead to such loop...

It's also important to note that this happened only few days after we intorduced the "INSERT INTO ... SELECT ..." issue in our code and has not happened again since fixing this. So, it is very likely that the bug was triggered by our, at that time, unoptimized code.

Comment by Marko Mäkelä [ 2016-11-23 ]

The worst case scenario with regard to undo log growth involves big transactions with DML operations and long-running transactions (which could be the large DML operation, or just some read-only transaction that was started but not committed yet).
It could be good to split big transactions into smaller ones, so that the undo logs can be discarded earlier. For INSERT operations, undo logs will be discarded already at COMMIT. For UPDATE or DELETE operations (or INSERT that overwrites a previously deleted record that had not yet been purged), the undo logs will be preserved until purge removes them.
Long-running transactions would prevent the InnoDB purge subsystem from removing old update_undo logs for transactions that were committed after the MVCC read view for the long-running transaction was started.

An unlikely possibility could be a bug in the InnoDB page allocation, causing the data file to grow even though reusable (freed) pages exist. https://bugs.mysql.com/bug.php?id=59783 was fixed in 2011.

Generated at Thu Feb 08 07:27:02 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.