[MDEV-25215] Excessive logging "Cannot close file ./test/FTS_xxx.ibd because of n pending operations and pending fsync Created: 2021-03-22 Updated: 2023-03-10 Resolved: 2021-09-30 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.5, 10.6 |
| Fix Version/s: | 10.5.13, 10.6.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Elena Stepanova | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 4 |
| Labels: | regression-10.5 | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Description |
|
Within ~1 min test on current 10.6 with default log_warnings level I'm getting in the error log ~8500 records like this:
10.5 seems to produce only those with "0 pending operations", so there are less of them, but still, ~4000 in a few-minute test. And these ones (with zero) are especially confusing. I think if these are real problems which a user needs to take care of, there should be warnings or errors (and given the persistence, maybe we are missing at least a debug assertion or something); and if these are just technical notifications, there shouldn't be so many of them. |
| Comments |
| Comment by Simon Avery [ 2021-08-10 ] | |||||||||||||||||||
|
We encountered this bug today on a busy production server and researching brought us here. Mariadb: Server version: 10.5.11 (From Maria's C7 repo) 2021-08-09 22:49:11 5443978 [Note] InnoDB: Cannot close file ./DB0/Tab1.ibd because of 20 pending operations and pending fsync Variations of this, affecting multiple databases and tables. Total number of similar log entries is 200149 during this period. Restarting Maria appears to have (temporarily) resolved the issue. | |||||||||||||||||||
| Comment by Simon Avery [ 2021-08-18 ] | |||||||||||||||||||
|
Still ongoing for us. There's some suggestion that it's linked to some queries that aren't completing, although I can't confirm that definitely. To give some indication of the sheer amount of lines in the log this is creating, this is the counts from one server over 14 days. | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-08-30 ] | |||||||||||||||||||
|
digdilem, do you see this with MariaDB 10.5.12? | |||||||||||||||||||
| Comment by Simon Avery [ 2021-08-31 ] | |||||||||||||||||||
|
Hi Marko Maybe, maybe not... I have two machines reporting this. Server 1 is on 10.5.9, so let's not focus on that. That's still reporting it occasionally, but not in volume. Server 2 was upgraded from 10.5.11 to 10.5.12 (from Maria.org's Centos7/EL7 repos) on 3rd August (just before my last report). It's still reporting occasion output like so; 2021-08-21 21:39:37 4117695 [Note] InnoDB: Cannot close file ./Db1/Tb1.ibd because of 0 pending operations and pending fsync However, nothing more recent than this - so that's 10 days clear. I'll continue to monitor and should it resurface on anything 10.5.12 onwards, I'll update this. Thanks all, | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-09-14 ] | |||||||||||||||||||
|
elenst, can you please check if you can yourself reproduce this with recent 10.5 or 10.6 trees? I see that we had an internal discussion on this on August 30, and mleich only found a very small number of occurrences of this (I guess, from runs with rr record, which would at times create very ‘unfair’ scheduling of threads). He conducted some tests on HDD in order to get more realistic I/O performance and was not able to reproduce massive amounts of this. | |||||||||||||||||||
| Comment by Elena Stepanova [ 2021-09-14 ] | |||||||||||||||||||
|
No, sorry, I can't, not in any near future. I assume if Matthias was able to reproduce them at all, he can take one of versions for which the complaints were raised and run the tests on a normal non-rr release-like build. | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-09-23 ] | |||||||||||||||||||
|
digdilem, how does it look like after more time running MariaDB 10.5.12? Apart from | |||||||||||||||||||
| Comment by Simon Avery [ 2021-09-24 ] | |||||||||||||||||||
|
@Marko Mäkelä - thanks for following up. This is still an ongoing issue. Follows a 14 day plot for this string against the same production server as before. This is Centos 7 and a busy machine. Aside from the increased logging and marginal overhead associated with that, I have not noticed any other effects, nor have any issues been reported by the users. We run another 27 MariaDb 10.5.12 servers on C7, but no others show this problem. The Maria configurations are not all identical, and the problem server has more ram (64G) and activity than most. I've also attached the my.cnf* variables, in case anything odd pops out that might be linked. Thank you MariaDB [(none)]> select version();
---------------------
--------------------- | |||||||||||||||||||
| Comment by Matthias Leich [ 2021-09-29 ] | |||||||||||||||||||
|
I have replayed the problem on some tree 10.6 mid 2021-09. | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-09-29 ] | |||||||||||||||||||
|
digdilem, thank you. In mycnf.txt Another factor is that if innodb_open_files is not specified, open_files_limit will be used instead. That may force frequent closing of data files. In mycnf.txt Internally, we have repeated something, but only when using system versioned tables and FULLTEXT INDEX. Theoretically, there could be a bug in some system versioning code that forgets to commit a mini-transaction (which would release a page latch). But I think that such a bug should cause more trouble than just these messages. At the very least, it should prevent log checkpoints and cause log file overruns, for which we have a separate message. Some of the logs from our internal testing show a nonzero "pending operations" count, but we also have logs that exclusively show "0 pending operations and pending fsync". I see that within a second, we can issue the message for several files. For us, they are the numerous internal tables for FULLTEXT INDEX for a single user table. In one log that I checked, all file names would start with the same prefix that includes the numeric main table identifier, such as test/FTS_000000000000125b_. Those messages were almost certainly issued by the LRU logic that is attempting to enforce the ‘open files’ limit in InnoDB. We should definitely rate-limit that output somehow, even if this run is using a ‘misconfigured’ system. If all files are located in the same journaled file system, all files should be ‘on the same boat’ waiting for the fsync() or fdatasync() to complete. | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-09-29 ] | |||||||||||||||||||
|
Clarification: We normally execute frequent fdatasync() or fsync() on the InnoDB write-ahead log (ib_logfile0). On data files, I would believe that we only do that on log checkpoint or when extending files. When using innodb_flush_method=O_DIRECT (which | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-09-30 ] | |||||||||||||||||||
|
I am able to reproduce the excessive logging with the following test:
Edit: And with the following .opt file:
The messages are displayed by fil_node_open_file() and fil_space_t::try_to_close(). I think that both need to be revised to limit the messages to at most 1 per fil_node_open_file() call, only in the case when the maximum configured number of open files will be exceeded. | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-09-30 ] | |||||||||||||||||||
|
Here is a little simpler and smaller version of the test.
| |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-09-30 ] | |||||||||||||||||||
|
We will display a message for at most one file every 5 seconds. | |||||||||||||||||||
| Comment by Simon Avery [ 2021-09-30 ] | |||||||||||||||||||
|
@Marko - Thanks for your comments and suggestions on our configuration, notably `innodb_log_file_size` - appreciated. I've made the change to that specific server (setting it to 25G to match ibps), although as it's not a dynamic variable I won't get a chance to restart the server to pick up the change for a few days so won't immediately be able to report whether it eliminates or reduces these occurances. Thanks again. | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-09-30 ] | |||||||||||||||||||
|
digdilem, you’re welcome. In the MariaDB Enterprise Server starting with version 10.5, innodb_log_file_size can be changed without restart; the log file will be resized on the next checkpoint. But, changing innodb_flush_method would require a server restart. | |||||||||||||||||||
| Comment by Simon Avery [ 2021-09-30 ] | |||||||||||||||||||
|
Interesting. Is that dynamic change an Enterprise-only feature? On trying in community 10.5.12 (Rocky8/EL8) doesn't seem to fly, and the docs state it's not dynamic. {{MariaDB [(none)]> set global innodb_log_file_size = 26843545600; | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-10-01 ] | |||||||||||||||||||
|
digdilem, that is right, the Community Server only allows changing innodb_log_file_size in the start-up configuration, to take effect when the server is restarted. | |||||||||||||||||||
| Comment by Simon Avery [ 2021-10-07 ] | |||||||||||||||||||
|
HI Marko. Update - no change. I raised innodb_log_file_size to equal innodb_buffer_pool_size. The same server was restarted at 0600 on the 5th and picked up the change; {{MariaDB [(none)]> show variables like 'innodb_log_file_size';
---------------------
--------------------- On checking for the same log, am still seeing high frequencies of this log output in 10.5.12-MariaDB-log 2021-10-06 23:33:29 1369618 [Note] InnoDB: Cannot close file ./DB1/TB1.ibd because of 0 pending operations and pending fsync So it doesn't look like frequency is affected by thrashing of the binary log due to a small innodb_log_file_size - attached 7 day frequency chart. Hope this helps diagnose and fix. | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-10-08 ] | |||||||||||||||||||
|
digdilem, the output will be rate-limited in the upcoming 10.5.13 and 10.6.5 releases. I meant that the likelihood of seeing any such message at all should be related to the open files limit and the frequency of InnoDB log checkpoints. The bug simply was that we would output long bursts of those messages when the open files limit is being exceeded. The limit may still be exceeded; that cannot be prevented easily. I suppose that it is more likely when running on slower storage or when not using innodb_flush_method=O_DIRECT. This has nothing to do with the binary log; the checkpoints involve only the ib_logfile0 and the persistent InnoDB data files. | |||||||||||||||||||
| Comment by Ralf Becker [ 2021-10-28 ] | |||||||||||||||||||
|
I want to raise attention of the combination of this bug with a Galera cluster: most of the time this bug takes out ALL nodes of a Galera cluster We had this on our geo-distributed 5 node Galera cluster which we recently updated to 10.5.12 this morning for the 3rd time in 4 weeks. What it causes on Galera is, that all nodes accumulate more and more queries until the cluster becomes completely unresponsive for updating queries. Trying to kill the queries wont work, they are stuck in state "killed" and the situation does not improve. Only way we found so far out of that situation is to completely shut down the cluster / all nodes and restart them one by one with an SST. I'm attaching the log of one of our nodes, in case someone want's to check it (I had to X out the database names, as they are customer names). Is there a date for 10.5.13, or do we need to downgrade to 10.4, to fix the problem? Ralf | |||||||||||||||||||
| Comment by Ralf Becker [ 2021-10-28 ] | |||||||||||||||||||
|
I forgot t mention that we already used innodb_flush_method=O_DIRECT and after the 2nd incident and finding the bug report also set innodb_log_file_size=innodb_buffer_cache_size=8GB. The above did NOT prevent having the described issue with a Galera cluster! Ralf | |||||||||||||||||||
| Comment by Gopinath [ 2022-01-11 ] | |||||||||||||||||||
|
Hi, Encountered this issue even with MariaDB 10.5.13 Enterprise Edition. 2022-01-11 8:40:52 0 [Note] InnoDB: 10.5.13 started; log sequence number 361848522653; transaction id 405336 | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-01-11 ] | |||||||||||||||||||
|
gopinathkarangula, what you encountered is not the same as the originally reported issue, where the server error log was flooded with such messages. The rate-limiting seems to work for you. Please file a new ticket for this problem if you think that the configuration is reasonable. ralfbecker, by innodb_buffer_cache_size I assume that you mean innodb_buffer_pool_size. Does 10.5.13 or 10.6.5 work for you? | |||||||||||||||||||
| Comment by Ralf Becker [ 2022-01-11 ] | |||||||||||||||||||
|
@Marko Mäkelä: yes, innodb_buffer_pool_size, but it seem to have no effect on the issue we had. Ralf |