[MDEV-25215] Excessive logging "Cannot close file ./test/FTS_xxx.ibd because of n pending operations and pending fsync Created: 2021-03-22  Updated: 2023-03-10  Resolved: 2021-09-30

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5, 10.6
Fix Version/s: 10.5.13, 10.6.5

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Marko Mäkelä
Resolution: Fixed Votes: 4
Labels: regression-10.5

Attachments: JPEG File cannotclosefile.JPG     Text File mycnf.txt     Text File mysql-0-20211028.log     PNG File screenshot-1.png     PNG File screenshot-2.png    
Issue Links:
Blocks
Problem/Incident
is caused by MDEV-23855 InnoDB log checkpointing causes regre... Closed
Relates
relates to MDEV-26381 InnoDB: Failing assertion: fsize != o... Open
relates to MDEV-28110 FLUSH TABLE ... WITH READ LOCK output... Open
relates to MDEV-30829 InnoDB: Cannot close file <tablename>... Closed

 Description   

Within ~1 min test on current 10.6 with default log_warnings level I'm getting in the error log ~8500 records like this:

2021-03-22  2:58:35 8 [Note] InnoDB: Cannot close file ./test/FTS_00000000000012c8_00000000000015f9_INDEX_1.ibd because of 0 pending operations and pending fsync
2021-03-22  2:58:35 8 [Note] InnoDB: Cannot close file ./test/FTS_00000000000012c8_00000000000015f9_INDEX_2.ibd because of 0 pending operations and pending fsync
2021-03-22  2:58:35 8 [Note] InnoDB: Cannot close file ./test/FTS_00000000000012c8_00000000000015f9_INDEX_4.ibd because of 0 pending operations and pending fsync
2021-03-22  2:58:35 8 [Note] InnoDB: Cannot close file ./test/FTS_00000000000012c8_00000000000015f9_INDEX_5.ibd because of 1 pending operations and pending fsync
2021-03-22  2:58:35 8 [Note] InnoDB: Cannot close file ./test/FTS_00000000000012c8_00000000000015f9_INDEX_6.ibd because of 0 pending operations and pending fsync
...

10.5 seems to produce only those with "0 pending operations", so there are less of them, but still, ~4000 in a few-minute test. And these ones (with zero) are especially confusing.

I think if these are real problems which a user needs to take care of, there should be warnings or errors (and given the persistence, maybe we are missing at least a debug assertion or something); and if these are just technical notifications, there shouldn't be so many of them.



 Comments   
Comment by Simon Avery [ 2021-08-10 ]

We encountered this bug today on a busy production server and researching brought us here.

Mariadb: Server version: 10.5.11 (From Maria's C7 repo)
OS: CentOS 7

2021-08-09 22:49:11 5443978 [Note] InnoDB: Cannot close file ./DB0/Tab1.ibd because of 20 pending operations and pending fsync
2021-08-03 13:05:10 145279 [Note] InnoDB: Cannot close file ./DB1/Tab1#P#per.ibd because of 0 pending operations and pending fsync
2021-08-03 13:05:10 145911 [Note] InnoDB: Cannot close file ./DB1/Tab1#P#per.ibd because of 0 pending operations and pending fsync
2021-08-03 13:05:10 145515 [Note] InnoDB: Cannot close file ./DB1/Tab1#P#per.ibd because of 0 pending operations and pending fsync

Variations of this, affecting multiple databases and tables.

Total number of similar log entries is 200149 during this period.

Restarting Maria appears to have (temporarily) resolved the issue.

Comment by Simon Avery [ 2021-08-18 ]

Still ongoing for us. There's some suggestion that it's linked to some queries that aren't completing, although I can't confirm that definitely.

To give some indication of the sheer amount of lines in the log this is creating, this is the counts from one server over 14 days.

Comment by Marko Mäkelä [ 2021-08-30 ]

digdilem, do you see this with MariaDB 10.5.12? MDEV-25113, MDEV-25801, MDEV-25954 were fixed in that release.

Comment by Simon Avery [ 2021-08-31 ]

Hi Marko

Maybe, maybe not...

I have two machines reporting this.

Server 1 is on 10.5.9, so let's not focus on that. That's still reporting it occasionally, but not in volume.

Server 2 was upgraded from 10.5.11 to 10.5.12 (from Maria.org's Centos7/EL7 repos) on 3rd August (just before my last report). It's still reporting occasion output like so;

2021-08-21 21:39:37 4117695 [Note] InnoDB: Cannot close file ./Db1/Tb1.ibd because of 0 pending operations and pending fsync

However, nothing more recent than this - so that's 10 days clear.

I'll continue to monitor and should it resurface on anything 10.5.12 onwards, I'll update this.

Thanks all,

Comment by Marko Mäkelä [ 2021-09-14 ]

elenst, can you please check if you can yourself reproduce this with recent 10.5 or 10.6 trees?

I see that we had an internal discussion on this on August 30, and mleich only found a very small number of occurrences of this (I guess, from runs with rr record, which would at times create very ‘unfair’ scheduling of threads). He conducted some tests on HDD in order to get more realistic I/O performance and was not able to reproduce massive amounts of this.

Comment by Elena Stepanova [ 2021-09-14 ]

No, sorry, I can't, not in any near future. I assume if Matthias was able to reproduce them at all, he can take one of versions for which the complaints were raised and run the tests on a normal non-rr release-like build.

Comment by Marko Mäkelä [ 2021-09-23 ]

digdilem, how does it look like after more time running MariaDB 10.5.12?

Apart from MDEV-25113, MDEV-25801, MDEV-25954, which were fixed in 10.5.12, there is also MDEV-26445 that could be of interest, once the fix is available.

Comment by Simon Avery [ 2021-09-24 ]

@Marko Mäkelä - thanks for following up.

This is still an ongoing issue.

Follows a 14 day plot for this string against the same production server as before. This is Centos 7 and a busy machine.

Aside from the increased logging and marginal overhead associated with that, I have not noticed any other effects, nor have any issues been reported by the users.

We run another 27 MariaDb 10.5.12 servers on C7, but no others show this problem. The Maria configurations are not all identical, and the problem server has more ram (64G) and activity than most.

I've also attached the my.cnf* variables, in case anything odd pops out that might be linked.

Thank you

MariaDB [(none)]> select version();
---------------------

version()

---------------------

10.5.12-MariaDB-log

---------------------

mycnf.txt

Comment by Matthias Leich [ 2021-09-29 ]

I have replayed the problem on some tree 10.6 mid 2021-09.
After I removed the feature "Versioning" from the test the problem
disappears totally or becomes at least extreme rare (no replay within ~ 2000 RQG tests).

Comment by Marko Mäkelä [ 2021-09-29 ]

digdilem, thank you. In mycnf.txt I see that your innodb_log_file_size is only half a gigabyte, which is a fraction of the innodb_buffer_pool_size. That will force very frequent log checkpoints, which in turn will cause stalls. That could actually be the root cause of your messages, if those messages always say "0 pending operations and pending fsync". I do not think that there is any need to execute fsync() or fdatasync() outside log checkpoints. Starting with 10.5, thanks to MDEV-19176, recovery should work fine even if you make the redo log as big as the buffer pool, or possibly even larger.

Another factor is that if innodb_open_files is not specified, open_files_limit will be used instead. That may force frequent closing of data files. In mycnf.txt I see open_files_limit=65535, which may or may not be reasonable. But the innodb_log_file_size is definitely way too small.

Internally, we have repeated something, but only when using system versioned tables and FULLTEXT INDEX. Theoretically, there could be a bug in some system versioning code that forgets to commit a mini-transaction (which would release a page latch). But I think that such a bug should cause more trouble than just these messages. At the very least, it should prevent log checkpoints and cause log file overruns, for which we have a separate message.

Some of the logs from our internal testing show a nonzero "pending operations" count, but we also have logs that exclusively show "0 pending operations and pending fsync". I see that within a second, we can issue the message for several files. For us, they are the numerous internal tables for FULLTEXT INDEX for a single user table. In one log that I checked, all file names would start with the same prefix that includes the numeric main table identifier, such as test/FTS_000000000000125b_. Those messages were almost certainly issued by the LRU logic that is attempting to enforce the ‘open files’ limit in InnoDB.

We should definitely rate-limit that output somehow, even if this run is using a ‘misconfigured’ system. If all files are located in the same journaled file system, all files should be ‘on the same boat’ waiting for the fsync() or fdatasync() to complete.

Comment by Marko Mäkelä [ 2021-09-29 ]

Clarification: We normally execute frequent fdatasync() or fsync() on the InnoDB write-ahead log (ib_logfile0). On data files, I would believe that we only do that on log checkpoint or when extending files.

When using innodb_flush_method=O_DIRECT (which MDEV-24854 finally made the default on 10.6) in combination with innodb_use_native_aio=ON, the fsync() or fdatasync() for data files seems to be a no-op on my development system (ext4fs on NVMe, SSD, or HDD). You should check that the server is actually using native asynchronous I/O (io_submit() and friends) and not simulated I/O. With libaio (the only option on Linux before MDEV-24883 in 10.6 introduced io_uring) it is possible that you must adjust /proc/sys/fs/aio-max-nr so that io_setup() will not fail on startup.

Comment by Marko Mäkelä [ 2021-09-30 ]

I am able to reproduce the excessive logging with the following test:

--source include/have_innodb.inc
--source include/have_partition.inc
--source include/have_sequence.inc
 
CREATE TABLE t1 (pk INT AUTO_INCREMENT PRIMARY KEY, b INT) ENGINE=InnoDB
PARTITION BY key (pk) PARTITIONS 100;
 
INSERT INTO t1 (pk) SELECT * FROM seq_1_to_100;
BEGIN;
let $n=500;
while ($n)
{
eval UPDATE t1 SET b=pk+$n;
dec $n;
}
COMMIT;
DROP TABLE t1;

Edit: And with the following .opt file:

--innodb-open-files=10
--innodb-log-file-size=4m

The messages are displayed by fil_node_open_file() and fil_space_t::try_to_close(). I think that both need to be revised to limit the messages to at most 1 per fil_node_open_file() call, only in the case when the maximum configured number of open files will be exceeded.

Comment by Marko Mäkelä [ 2021-09-30 ]

Here is a little simpler and smaller version of the test.

--source include/have_innodb.inc
--source include/have_partition.inc
--source include/have_sequence.inc
 
CREATE TABLE t1 (pk INT AUTO_INCREMENT PRIMARY KEY) ENGINE=InnoDB
PARTITION BY KEY (pk) PARTITIONS 100;
 
INSERT INTO t1 SELECT * FROM seq_1_to_100;
--disable_query_log
let $n=300;
while ($n)
{
BEGIN; DELETE FROM t1; ROLLBACK;
dec $n;
}
--enable_query_log
DROP TABLE t1;

Comment by Marko Mäkelä [ 2021-09-30 ]

We will display a message for at most one file every 5 seconds.

Comment by Simon Avery [ 2021-09-30 ]

@Marko - Thanks for your comments and suggestions on our configuration, notably `innodb_log_file_size` - appreciated.

I've made the change to that specific server (setting it to 25G to match ibps), although as it's not a dynamic variable I won't get a chance to restart the server to pick up the change for a few days so won't immediately be able to report whether it eliminates or reduces these occurances.

Thanks again.

Comment by Marko Mäkelä [ 2021-09-30 ]

digdilem, you’re welcome. In the MariaDB Enterprise Server starting with version 10.5, innodb_log_file_size can be changed without restart; the log file will be resized on the next checkpoint. But, changing innodb_flush_method would require a server restart.

Comment by Simon Avery [ 2021-09-30 ]

Interesting. Is that dynamic change an Enterprise-only feature?

On trying in community 10.5.12 (Rocky8/EL8) doesn't seem to fly, and the docs state it's not dynamic.

{{MariaDB [(none)]> set global innodb_log_file_size = 26843545600;
ERROR 1238 (HY000): Variable 'innodb_log_file_size' is a read only variable}}

Comment by Marko Mäkelä [ 2021-10-01 ]

digdilem, that is right, the Community Server only allows changing innodb_log_file_size in the start-up configuration, to take effect when the server is restarted.

Comment by Simon Avery [ 2021-10-07 ]

HI Marko.

Update - no change.

I raised innodb_log_file_size to equal innodb_buffer_pool_size. The same server was restarted at 0600 on the 5th and picked up the change;

{{MariaDB [(none)]> show variables like 'innodb_log_file_size';
---------------------------------+

Variable_name Value

---------------------------------+

innodb_log_file_size 26843545600

---------------------------------+
1 row in set (0.001 sec)}}

On checking for the same log, am still seeing high frequencies of this log output in 10.5.12-MariaDB-log

2021-10-06 23:33:29 1369618 [Note] InnoDB: Cannot close file ./DB1/TB1.ibd because of 0 pending operations and pending fsync

So it doesn't look like frequency is affected by thrashing of the binary log due to a small innodb_log_file_size - attached 7 day frequency chart.

Hope this helps diagnose and fix.

Comment by Marko Mäkelä [ 2021-10-08 ]

digdilem, the output will be rate-limited in the upcoming 10.5.13 and 10.6.5 releases. I meant that the likelihood of seeing any such message at all should be related to the open files limit and the frequency of InnoDB log checkpoints. The bug simply was that we would output long bursts of those messages when the open files limit is being exceeded. The limit may still be exceeded; that cannot be prevented easily. I suppose that it is more likely when running on slower storage or when not using innodb_flush_method=O_DIRECT. This has nothing to do with the binary log; the checkpoints involve only the ib_logfile0 and the persistent InnoDB data files.

Comment by Ralf Becker [ 2021-10-28 ]

I want to raise attention of the combination of this bug with a Galera cluster: most of the time this bug takes out ALL nodes of a Galera cluster

We had this on our geo-distributed 5 node Galera cluster which we recently updated to 10.5.12 this morning for the 3rd time in 4 weeks.

What it causes on Galera is, that all nodes accumulate more and more queries until the cluster becomes completely unresponsive for updating queries. Trying to kill the queries wont work, they are stuck in state "killed" and the situation does not improve.

Only way we found so far out of that situation is to completely shut down the cluster / all nodes and restart them one by one with an SST.

I'm attaching the log of one of our nodes, in case someone want's to check it (I had to X out the database names, as they are customer names).

Is there a date for 10.5.13, or do we need to downgrade to 10.4, to fix the problem?

Ralf

mysql-0-20211028.log

Comment by Ralf Becker [ 2021-10-28 ]

I forgot t mention that we already used innodb_flush_method=O_DIRECT and after the 2nd incident and finding the bug report also set innodb_log_file_size=innodb_buffer_cache_size=8GB.

The above did NOT prevent having the described issue with a Galera cluster!

Ralf

Comment by Gopinath [ 2022-01-11 ]

Hi,

Encountered this issue even with MariaDB 10.5.13 Enterprise Edition.

2022-01-11 8:40:52 0 [Note] InnoDB: 10.5.13 started; log sequence number 361848522653; transaction id 405336
2022-01-11 8:40:52 0 [Note] InnoDB: Creating #1 encryption thread id 140471212390144 total threads 4.
2022-01-11 8:40:52 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2022-01-11 8:40:52 0 [Note] InnoDB: Creating #2 encryption thread id 140471103624960 total threads 4.
2022-01-11 8:40:52 0 [Note] InnoDB: Creating #3 encryption thread id 140471095232256 total threads 4.
2022-01-11 8:40:52 0 [Note] InnoDB: Creating #4 encryption thread id 140471086839552 total threads 4.
2022-01-11 8:40:52 0 [Note] Plugin 'FEEDBACK' is disabled.
2022-01-11 8:40:52 0 [Note] Using encryption key id 2 for temporary files
2022-01-11 8:40:52 0 [Note] InnoDB: Buffer pool(s) load completed at 220111 8:40:52
2022-01-11 8:40:52 0 [Warning] Failed to create a socket for IPv6 '::': errno: 97.
2022-01-11 8:40:52 0 [Note] Server socket created on IP: '0.0.0.0'.
2022-01-11 8:40:52 0 [Warning] 'proxies_priv' entry '@% root@xxxxxxx' ignored in --skip-name-resolve mode.
2022-01-11 8:40:52 0 [Note] Reading of all Master_info entries succeeded
2022-01-11 8:40:52 0 [Note] Added new Master_info '' to hash table
2022-01-11 8:40:52 0 [Note] mysqld: ready for connections.
Version: '10.5.13-9-MariaDB-enterprise-log' socket: '/var/run/mysqld/mysqld.sock' port: 53805 MariaDB Enterprise Server
2022-01-11 8:57:51 21 [Note] Start binlog_dump to slave_server(52396), pos(mysqld-bin.000001, 369), using_gtid(0), gtid('')
2022-01-11 9:10:35 22 [Note] InnoDB: Cannot close file ./ownbfmprd/stf_stat_availability_office_avl_hist.ibd because of 206 pending operations and pending fsync
2022-01-11 9:13:40 22 [Note] InnoDB: Cannot close file ./ownbfmprd/stf_stat_availability_office_bp_hist.ibd because of pending fsync
2022-01-11 9:14:42 22 [Note] InnoDB: Cannot close file ./ownbfmprd/fallback_stf_stat_availability_office_xs#P#p736782.ibd because of 1 pending operations
2022-01-11 9:23:08 22 [Note] InnoDB: Cannot close file ./ownbfmprd/stf_stat_availability_office_crystd_hist.ibd because of pending fsync
2022-01-11 9:25:07 22 [Note] InnoDB: Cannot close file ./ownbfmprd/stf_stat_availability_office_ediavl_hist.ibd because of 222 pending operations and pending fsync
2022-01-11 9:27:11 22 [Note] InnoDB: Cannot close file ./ownbfmprd/stf_stat_availability_office_edistd_hist.ibd because of pending fsync
2022-01-11 9:27:24 22 [Note] InnoDB: Cannot close file ./ownbfmprd/stf_stat_availability_ab_targetoffice_mp#P#p738529.ibd because of 1 pending operations
2022-01-11 9:27:33 22 [Note] InnoDB: Cannot close file ./ownbfmprd/stf_stat_availability_office_bench_met.ibd because of 1 pending operations
2022-01-11 9:28:39 22 [Note] InnoDB: Cannot close file ./ownbfmprd/stf_stat_availability_office_excluded_hist.ibd because of pending fsync
2022-01-11 9:29:35 22 [Note] InnoDB: Cannot close file ./ownbfmprd/stf_stat_availability_office_excluded_lfs_hist.ibd because of 146 pending operations and pending fsync
2022-01-11 9:31:17 22 [Note] InnoDB: Cannot close file ./ownbfmprd/stf_stat_availability_office_fop_hist.ibd because of pending fsync
2022-01-11 9:32:38 22 [Note] InnoDB: Cannot close file ./ownbfmprd/stf_stat_availability_office_fp_hist.ibd because of pending fsync
2022-01-11 10:57:54 22 [Note] InnoDB: Cannot close file ./ownbfmprd/stf_stat_availability_office_mp_hist.ibd because of pending fsync

Comment by Marko Mäkelä [ 2022-01-11 ]

gopinathkarangula, what you encountered is not the same as the originally reported issue, where the server error log was flooded with such messages. The rate-limiting seems to work for you. Please file a new ticket for this problem if you think that the configuration is reasonable.

ralfbecker, by innodb_buffer_cache_size I assume that you mean innodb_buffer_pool_size. Does 10.5.13 or 10.6.5 work for you?

Comment by Ralf Becker [ 2022-01-11 ]

@Marko Mäkelä: yes, innodb_buffer_pool_size, but it seem to have no effect on the issue we had.
Since the update to 10.5.13 it has never appeared again - so far.

Ralf

Generated at Thu Feb 08 09:36:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.