[MDEV-27530] InnoDB - Performance issues after upgrade 10.4.22 to 10.5.13 Created: 2022-01-17  Updated: 2023-01-12  Resolved: 2022-10-30

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.13
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Eugene Assignee: Unassigned
Resolution: Incomplete Votes: 5
Labels: None
Environment:

Linux 5.10.91-gentoo #1 SMP x86_64 AMD EPYC 7451 AuthenticAMD GNU/Linux
Dedicated raid1 nvme0n1p1[0] nvme1n1p1[2] for /var/lib/mysql directory
256G RAM
galera (26.4.10) cluster configured, but only one node was started after upgrade.


Attachments: PNG File galera_upgrade_fail_combined_chart.png     PNG File screenshot-1.png    
Issue Links:
PartOf
includes MDEV-23136 InnoDB init fail after upgrade from 1... Closed
Relates
relates to MDEV-25020 SELECT if there is IN clause with bin... Closed
relates to MDEV-26445 innodb_undo_log_truncate is unnecessa... Closed
relates to MDEV-28518 After update to 10.5 a lot of time is... Closed
relates to MDEV-30390 MariaDB 10.5 gets stuck on "Too many ... Closed

 Description   

Attempt to upgrade from 10.4.22 to 10.5.13 turned into huge performance issues and ended with rolling the upgrade back with restoring whole dataset from backup.

MariaDB configuration parameters:

sync_binlog                             = 0
binlog_cache_size                       = 32M
default_storage_engine                  = innodb
sort_buffer_size                        = 4M
read_rnd_buffer_size                    = 4M
table_open_cache                        = 120000
table_open_cache_instances              = 16
table_definition_cache                  = 1800000
key_buffer_size                         = 256M
query_cache_size                        = 0
query_cache_limit                       = 0
query_cache_type                        = 0
innodb_buffer_pool_size                 = 64G
innodb_log_buffer_size                  = 512M
innodb_log_file_size                    = 8G
innodb_autoinc_lock_mode                = 2
innodb_lock_wait_timeout                = 20
innodb_numa_interleave                  = 1
innodb_flush_log_at_trx_commit          = 2
innodb_io_capacity_max                  = 90000
innodb_io_capacity                      = 90000
innodb-print-all-deadlocks              = ON
innodb_flush_neighbors                  = 0
innodb_use_native_aio = 0
innodb_change_buffering = 'none'
wsrep_on

Parameters changed due to upgrade (for incompatible changes mentioned in release notes):

31c31
< innodb_adaptive_hash_index = 0
---
> #deprecated in 10.5 innodb_adaptive_hash_index = 0
44,45c44,45
< innodb_read_io_threads = 16
< innodb_write_io_threads = 16
---
> #broken 10.5 innodb_read_io_threads = 16
> #broken 10.5 innodb_write_io_threads = 16
47c47
< innodb_buffer_pool_instances = 32
---
> #deprecated in 10.5 innodb_buffer_pool_instances = 32
51c51
< innodb_log_optimize_ddl = 0
---
> #deprecated in 10.5 innodb_log_optimize_ddl = 0

The issue: Innodb doesn't support multiple buffer pool instances anymore

innodb_buffer_pool_instances is now deprecated option. This was known and announced, but the effect of violating the rule of innodb_read_io_threads+innodb_write_io_threads=innodb_buffer_pool_instances is dramatic. Mariadb was not able to handle the load it used to.
innodb_read_io_threads and innodb_write_io_threads default values are 4+4, while there's the only buffer pool. Even limiting them to 1+1 will not solve the issue. This causes numerous query cross-blocks as pool instance is always single, as it was mentioned to be a measure to improve the performance.

Unfortunately, there's no replacement parameter to be adjusted in order to overcome the issue.

The server was not able to handle desired load of parallel writes (while same server was easily handling same load before the upgrade and after reverting the changes back).

  • Number of threads running (that used to be around 7-10 normally) jumped to 100 and sometimes reached 250, number of threads connected that is normally within 60-100 jumped to 300-450.
  • Rate of select queries remained the same (~120 qps), but rate of insert/update queries dropped from 100 to 20 qps.
  • Number of active queries in the processlist changed from 7-10 (with longest query time of 40 seconds) to 30-50 with longest query running about 900 seconds. Reason for the latest one was that queries timed out (increasing lock wait timeouts and session timeouts from 20 to 45 gave no positive effect), and rollback took ages due to being unable to gain a lock.
  • Normally, mariadb has more than one table cache instance of 120K tables open. But after upgrade even one table cache instance was not utilized completely (reached 110K open tables only).
  • Normally, mariadb 10.4 was able to perform frequent or even parallel inserts even into same table (tables have size of up to 100GB). MariaDB 10.5 was not able to perform frequent (every 3 seconds) writes into small table (17MB) with no parallel writes (queries stuck, timed out and one was in trx_state : ROLLING BACK for 860 seconds!)
  • SHOW ENGINE INNODB STATUS output is full of messages like following one:

    --Thread 140045441414720 has waited at ha_innodb.cc line 14299 for 0.00 seconds the semaphore:
    Mutex at 0x55c00ead3340, Mutex DICT_SYS created /var/tmp/portage/dev-db/mariadb-10.5.13/work/mysql/storage/innobase/dict/dict0dict.cc:1027, lock
    

    however, this seems to have no relation to the issue.

Studying innodb-system-variables list for new or not configured parameters indicates that there are lots of deprecations, but no adjustment for handling high parallelism anymore.

The question is - are there any (other) replacement settings to overcome the issue caused by fixed single pool instance?



 Comments   
Comment by Vladislav Vaintroub [ 2022-01-17 ]

Why is innodb_xxx_io_threads "broken"?

> #broken 10.5 innodb_read_io_threads = 16
> #broken 10.5 innodb_write_io_threads = 16

I do not think it is anything that is either deprecated or supposed to be broken. It has a slightly different meaning, so what? There will be 16 threads handling IO completions at the same time, if IO completions can saturate 16 threads.

Also I think you're concentrated too much on what you believe was the culprit, the single buffer pool instance. As far as I remember it faired well, and better than 10.4 in most, if not all performance tests, including specifically those that had multiple parallel updates, and hope MDEV-15058 has discussions and data, attachment. Perhaps, MDEV-15058 is not issue you're looking for, and perhaps, that DICT_SYS mutex you mention in passing the issue indeed. Before jumping into conclusions, one would have to understand what the server is doing.

"innodb_read_io_threads+innodb_write_io_threads=innodb_buffer_pool_instances" rule does not really make any sense. There should be as many IO threads as there is sufficient to finish IO completion fast, possibly without disturbing the "foreground threads", handling users connections. It has, or had nothing to do with buffer pool instances. innodb_page_cleaners had something to do with buffer pool instance count.

Comment by Eugene [ 2022-01-17 ]
  • Why is innodb_xxx_io_threads "broken"?
    It is commented as "broken" as the value of these variables used to be coordinated with the value of another variable that is now deprecated. Anyway, this is just a comment to point that parameter is removed (read: default used) in new configuration, doesn't matter on how to mark this option - it is commented out in configuration file.
  • "innodb_read_io_threads+innodb_write_io_threads=innodb_buffer_pool_instances" rule does not really make any sense...
    You probably know this better and the experience described can be wrong, but at least it does make sense in 10.4. This was found occasionally in mysql (yes, even not mariadb) documentation, referred from some 3rd-party site (example - liquidweb) - it's a pity such an important thing is not explained and not referred from mariadb variable list directly, but applying this rule really has had an effect, this was tested several times. Probably it was side effect, but that effect does exist. It was not checked by benchmarks, it was tested on production, believe it or not. The reason is that no performance issue happen when every pool instance is accessed by one IO thread at a moment. In case several threads access single instance, they simply block each other.
  • culprit can be anything, not just innodb_buffer_pool_instances removed.
    But the problem is that this (pool instances and IO threads related) is the only group of parameters changed between 10.4 and 10.5 configurations we run. And moreover, initially server was started with just innodb_buffer_pool_instances commented out - when IO_threads parameters were also commented from configuration, performance improved slightly, but was still way bad compared to 10.4.
  • one would have to understand what the server is doing
    This is a good point, by the way. One could also advise on how to achieve this.
    So, any help or advise is appreciated.
Comment by Eugene [ 2022-01-18 ]

The chart attached illustrates what was written already.
As one can see, before the compilation, mariadb 10.4.22 was running and easily able to handle the load.
Upgrading a server with only removal of deprecated parameters from configuration rendered it unusable.
The chart has some comments, but also please note following points:

  • 100% utilization of one CPU core after upgrade, this means that mariadb 10.5.13 was in fact running on single thread. Looks like main thread blocked for some reason. Previously this could be seen when IO_threads total count was higher than number of buffer pool instances, so, with removal of latest one the case seems to be in effect again - note, this was just related to deprecated parameter. Is this really just a coincidence?
  • decreased IO activity as IO threads are not doing anything as they blocked each other
  • increased number of open files meaning tables are open for processing a query but queries are stuck, thus tables are not closed
  • Threads chart requires additional information. After restart into mariadb-10.5.13 every connected thread was in running state. After commenting out "io_threads" options situation improved and some connected but not running (mean, processed at least something and not waiting) threads appeared, however, on increased load (we were not able to achieve full load however, but increased it to approximately 30% of what 10.4.22 handled) number of running threads went abnormally high and number of open transactions also did, thus, we have had to abandon upgrade and revert to 10.4.22.
  • note longest query duration! This was a result of simple `INSERT INTO` query timeout, but four other queries dealing with the same table timed out while the first one was waiting for rolling the transaction back.

Comment by Marko Mäkelä [ 2022-02-21 ]

Have you tried a larger innodb_log_file_size? It currently only is 1/8 of the buffer pool size. Using a larger redo log file would make log checkpoints (and related page writes) much less frequent.

You might also try tweaking the innodb_lru_scan_depth and innodb_lru_flush_size if the working set does not fit in the buffer pool.

Comment by Eugene [ 2022-02-21 ]

Thank you for the advice, Marko!

Will try to increase innodb_log_file_size and tune innodb_lru* variables on next attempt. Unfortunately, there's no chance to fit working set into memory, it's several times bigger. Most probably, the option to test this will appear in a couple of weeks, will let you know about the result.

Comment by Marko Mäkelä [ 2022-03-30 ]

Could this be a duplicate of MDEV-27461? It could have been caused by MDEV-24278 in 10.5.9.

Comment by Eugene [ 2022-05-03 ]

Some more details found on this.
The issue seems to be not just about threads but also about handling table cache.
On 10.4 we used to have following settings:

table_open_cache			= 120000
table_open_cache_instances              = 16
table_definition_cache                  = 1800000

Total number of files in the dataset is over 4 millions. But in fact number of open files was slightly over about 240000 (two instances used).
Previously we used to have table_open_cache and table_definition_cache set to the same value of 1800000. However, on some operations (for example, performing backups with mysqldump) even this huge cache was not enough and while flushing tables to open new one there were unsignificant performance issues, thus table_open_cache was shrinked.
After the upgrade to 10.5 performance with same table cache settings (when table_open_cache is 120000 and table_open_cache_instances is 16) is way bad, as it was reported above. However, in case table_open_cache is set to 1800000, the performance is much better in the beginning. But it quickly turns into bad again when Open_tables value is about 800000 and number of open files is about 1000000.
Settings for open files: open_files_limit is 10000000, also system limits are set to maximum 10240000.

So the issue seems to be within the field of number of IO threads for reading and writing, amount of open files and also size of table cache. With small cache the issue is caused with flushing tables and opening new, but with the big one it's caused by handling big array of open files. An interesting point is that this issue never was that bad with 10.4 when it was possible to manually adjust number of threads.

Comment by Marko Mäkelä [ 2022-05-04 ]

euglorg, thank you for your update. The table_open_cache and table_definition_cache are only loosely related to InnoDB. My understanding is that table_open_cache controls the caching of open table handles (objects of the abstract class handler) for quick reuse. That is, if multiple SQL statements access the same InnoDB table in succession, an already created ha_innobase object could be reused. If I understand correctly, both caches should be emptied by a FLUSH TABLES command. The impact should be that on subsequent access, the tablename.frm file will be loaded (to add a TABLE_SHARE to table_definition_cache) and the table will be opened in InnoDB as well. The InnoDB internal data dictionary cache will not be directly affected by FLUSH TABLES.

There is a LRU eviction mechanism in InnoDB. Each partition, subpartition, or unpartitioned table counts as a separate dict_table_t object in InnoDB. The function dict_make_room_in_cache() will be periodically invoked to try to keep at most table_definition_cache objects allocated inside the InnoDB dict_sys cache. If any handle to a table is open (see table_open_cache), the table definition cannot be evicted from InnoDB. Reloading table definitions to InnoDB should involve accessing InnoDB internal data dictionary tables. I could imagine that on a busy server, those pages would typically not be cached in the buffer pool.

Reloading table definitions should only ‘punish’ the execution of SQL statements that need to access some affected tables. If you are observing reduced throughput for all queries across the server, there is a more likely explanation.

There is one more mechanism that I think is more likely to cause the regression that you observe. There is some logic like this at InnoDB startup, inherited from MySQL 5.6.6:

	if (innobase_open_files < 10) {
		innobase_open_files = 300;
		if (srv_file_per_table && tc_size > 300 && tc_size < open_files_limit) {
			innobase_open_files = tc_size;
		}
	}

The commit message by svoj mentions WL#6216, but that ticket is private. Anyway, you can see that unless you specify innodb_open_files to be larger than 10 (its default value is 0), it will be auto-adjusted to table_open_cache (tc_size), provided that its value is between 300 and open_files_limit. This feels rather arbitrary to me, especially given the fact that each partitioned table likely counts as one in the table caches outside storage engines, and as several tables or files inside storage engines.

In any case, this logic was not changed recently. What was changed in MariaDB Server 10.5 is the way how modified pages are written back to data files. I would recommend you to check MDEV-24949 and MDEV-27295. Enforcing the InnoDB open files limit may seriously interfere with page flushing, because closing a data file handle will require waiting for pending data page writes and fdatasync() or fsync() operations to complete. Such operations must be invoked related to log checkpoint. You may want to make the innodb_log_file_size much larger (it is only 1/8 of your buffer pool size!) to reduce the need for checkpoints. You might also want to enable ‘idle’ or background page flushing, to avoid I/O stalls during ‘furious flushing’. Your configured innodb_io_capacity=90000 feels large to me. Is the raw write speed of your storage really some gigabytes per second?

Comment by Ludovic Monnier [ 2022-05-27 ]

We observed the same issue since Maria 10.5 using a CREATE TABLE Test SELECT ... FROM... GROUP BY.... syntax and only a SELECT ... FROM... GROUP BY.... syntax

Very poor performance using mariadb 10.5 and after !!

I check the same request with same data using few version of mariadb :

#Version: 10.3.34-MariaDB-0+deb10u1-log (Debian 10).

  1. Query_time: 724.197647 Lock_time: 0.037806 Rows_sent: 0 Rows_examined: 82160658

#Version: 10.4.24-MariaDB-1:10.4.24+maria~buster-log (mariadb.org binary distribution) Avant derniere 10.4

  1. Query_time: 564.029273 Lock_time: 0.044647 Rows_sent: 0 Rows_examined: 82160658

#Version: 10.4.25-MariaDB-1:10.4.25+maria~buster-log (mariadb.org binary distribution)

  1. Query_time: 546.300378 Lock_time: 0.123554 Rows_sent: 0 Rows_examined: 82160658

With update 10.5.* , 10.6.* and 10.7.* :

  1. Query_time: 2116.641373 Lock_time: 0.024204 Rows_sent: 0 Rows_examined: 82160658
    => 35 minutes

Since 10.5.* we can observe a large tempory file .MAI during the request. This one was empty (only few Ko) using mariadb 10.4 and before.
Since 10.5.* we can observe a INNODB_BUFFER_POOL_PAGES_FREE going down to zero.

Comment by Eugene [ 2022-06-08 ]

It seems that workaround for the problem is - let the mariadbd have all the files of dataset open in same time.
For example, performance became much better in our setup with following settings:

open_files_limit                        = 10000000
table_open_cache                        = 3300000
table_definition_cache                  = 3300000
innodb_open_files                       = 3300000

with corresponding system limit of open files for the process. For our setup this rendered into having 2.7 milliones open files for mariadbd process. As you guess, this corresponds to number of tables we have in our setup, so all the .idb files are open. This is valid for mariadb 10.5 and 10.6. With such an amount of open files performance became much better, almost the same it used to be with 10.4.
Just to compare. Same dataset has had 240K (10 times less) open files while running on mariadb 10.4

Also checked what's happening with mariadb while adjusting settings. Everything runs quickly while it's possible to open another table without closing already open one. Once limit is hit (any of the above listed variables limiting number of open files) there's a necessity to flush and close the table. This process is very slow and locks the thread. Thus, while one table is being closed, no option to perform any operation on another one. On 10.4 this didn't happen: there were several IO writing and reading threads operating with different parts of buffer_pool. While on 10.5 and newer, there's just one big pool and once one thread operating with it is locked, nothing can be done until this lock is released.

Anyway. Upgrade to 10.5 or newer required having all the tables open in same time. Thus, startup, shutdown and any file-related operations are slow due to huge amount of open files. It's hard to imagine how it would perform on setup with 100M tables and intensive writing. As first table to flush will freeze the daemon. Moreover, on galera cluster it triggers flowcontrol, pauses replication and can stop whole cluster (this can be easily simulated with flushing table cache manually, but after that you will either have long downtime or kill the daemon to unlock other nodes and perform SST to the node you decided to experiment on).
So, removal of multiple buffer pool instances turned into huge amount of open files and full dataset open in same time, otherwise performance is very very poor...

Comment by Marko Mäkelä [ 2022-08-01 ]

euglorg, did you try the advice that was suggested in MDEV-27295?

If you want background flushing to be as eager as it was in 10.5.7 and 10.5.8, I believe that the following should come close:

SET GLOBAL innodb_max_dirty_pages_pct_lwm=0.001;

ludovic.monnier, I do not think that much has changed with respect to the Aria storage engine. Your observation could be explained by the change of some query plan, or what type of statistics are being collected and when. I do not know code outside InnoDB very well, but I would think that the Aria engine may be used as a temporary table during query execution, to compute some joins or ORDER BY or GROUP BY. Can you please file a separate bug for that, with enough detail so that the regression can be reproduced and fixed?

Comment by Joris de Leeuw [ 2022-09-22 ]

This issue seems to be fixed with MDEV-25020. I'm currently testing it on one of our systems to be sure.

Comment by Ludovic Monnier [ 2022-09-26 ]

Hi,

I tried to update 10.3.34 to the 10.5.17 version.
But i have always the same problem with the SELECT ... FROM... GROUP BY.... syntax

Using Version: 10.3.34 is much faster than 10.5.*:

600 seconds VS 1972.
i observe that with 10.5.* version and after, a large MAI file is create in the tempory directory.

Using version 10.3.34:

  1. Time: 220926 15:07:45
  2. User@Host: exploit[exploit] @ sfr-infra-bastion-1.sfr.mtg [10.20.2.5]
  3. Thread_id: 38 Schema: QC_hit: No
  4. Query_time: 595.161251 Lock_time: 0.059500 Rows_sent: 0 Rows_examined: 82160658
  5. Rows_affected: 1274204 Bytes_sent: 59
  6. Tmp_tables: 2 Tmp_disk_tables: 1 Tmp_table_sizes: 292143104
  7. Full_scan: Yes Full_join: No Tmp_table: Yes Tmp_table_on_disk: Yes
  8. Filesort: Yes Filesort_on_disk: No Merge_passes: 25 Priority_queue: No
    #
  9. explain: id select_type table type possible_keys key key_len ref rows r_rows filtered r_filtered Extra
  10. explain: 1 SIMPLE T ALL NULL NULL NULL NULL 74632078 79612250.00 100.00 100.00 Using temporary; Using filesort

Using version 10.5.17 :

  1. Time: 220926 16:29:36
  2. User@Host: exploit[exploit] @ sfr-infra-bastion-1.sfr.mtg [10.20.2.5]
  3. Thread_id: 31 Schema: QC_hit: No
  4. Query_time: 1972.255781 Lock_time: 0.107774 Rows_sent: 0 Rows_examined: 82160658
  5. Rows_affected: 1274204 Bytes_sent: 59
  6. Tmp_tables: 2 Tmp_disk_tables: 1 Tmp_table_sizes: 582377472
  7. Full_scan: Yes Full_join: No Tmp_table: Yes Tmp_table_on_disk: Yes
  8. Filesort: Yes Filesort_on_disk: No Merge_passes: 5 Priority_queue: No
    #
  9. explain: id select_type table type possible_keys key key_len ref rows r_rows filtered r_filtered Extra
  10. explain: 1 SIMPLE T ALL NULL NULL NULL NULL 74632078 79612250.00 100.00 100.00 Using temporary; Using filesort
Comment by Joris de Leeuw [ 2022-09-26 ]

@Ludovic Monnier. Do you see any positive difference between 10.5.16 and 10.5.17?

Is there any way to reproduce your issue? It seems you might have an other issue.

With the example of Sander Hoentjen shared in MDEV-28518 we see very positive differences between 10.6.16 and 10.5.17.
The query which first took 59.3507 seconds now just finishes in 0.1273 seconds.

Comment by Ludovic Monnier [ 2022-09-26 ]

Hi,

None difference between 10.5.* and 10.5.17.
Times is same with version since 10.5.* and newer. (more faster with versions before 10.5)

it's difficult to reproduce the same issue because table is large (82 160 658 rows)
The request is a "select group by" on this table using many columns.

I'll try to send you a sample database to reproduce it.

Comment by Roel Van de Paar [ 2022-09-29 ]

I noticed especially this difference before/after - with the 'after' looking as if some sort of limit was being hit.

Joriz how did your MDEV-25020 test go? Thank you

Comment by Eugene [ 2022-09-29 ]

This was most probably one of those limits I mentioned in the comment some time ago.

open_files_limit                        = 10000000
table_open_cache                        = 3300000
table_definition_cache                  = 3300000
innodb_open_files                       = 3300000

Once number of `innodb_open_files` changed, situation dramatically improved. Until it remained default, there was no positive effect of table_open_cache/table_definition_cache adjustment.
And the performance killer is file closing process in fact. Once you get rid of necessity of closing files, performance improves and becomes almost the same it used to be with 10.4. Thus, having huge cache "solves" the problem. In fact it just postpones the problem until the moment one has more tablespace files opened than `innodb_open_files` value. In our case this is 3.3M files, but the dataset is really large and we have that much tables, so hit the performance problem immediately - on even running `mysql_upgrade` script that checks all the tables while `innodb_open_files` used to be default. By the way, never needed to touch it until upgrading to 10.5.

So, number of open tables (controlled by `innodb_open_files` `table_open_cache` `open_files_limit`) is the limit visible on that chart and increasing it "solves" the issue. But disadvantages are: first is having all the tables opened all the time. It might turn into consistency problem some day, I suspect. And the second (and not yet investigated) problem is - what will happen when you decide to rotate partitions on partitioned tables? As every partition is a file, so there will be creation (and opening, thus, using more cache entries) of new partitions and closing older partitions. So during this procedure there might be performance drop, and it's not clear whether performance will become normal once all the partitions in all the tables are rotated.

Comment by Marko Mäkelä [ 2022-09-30 ]

euglorg, if you collected more data in the style of the graphs in MDEV-23855, you might confirm my finding in MDEV-25215. Quoting a comment from the ticket:

I see that your innodb_log_file_size is … a fraction of the innodb_buffer_pool_size. That will force very frequent log checkpoints, which in turn will cause stalls. That could actually be the root cause of your messages, if those messages always say "0 pending operations and pending fsync". I do not think that there is any need to execute fsync() or fdatasync() outside log checkpoints. Starting with 10.5, thanks to MDEV-19176, recovery should work fine even if you make the redo log as big as the buffer pool, or possibly even larger.

You did not provide any excerpt from your server error log. Do you find messages like this there:

2022-01-11 9:27:33 22 [Note] InnoDB: Cannot close file ./ownbfmprd/stf_stat_availability_office_bench_met.ibd because of 1 pending operations

The "pending operations" or "pending fsync" would be initiated for the log checkpoint.

You can either make the open file limits larger, or you could use a much larger innodb_log_file_size so that checkpoints are less frequent. You might also want to enable background flushing (see MDEV-27295), to avoid larger I/O spikes when a log checkpoint is urgently needed. The parameter innodb_io_capacity throttles the background flushing speed. I see that you already specify a reasonably large value for it.

Would a larger innodb_log_file_size solve the problem for you?

Comment by Eugene [ 2022-09-30 ]

Sorry for not updating that ticket, Marko. Unfortunately, we have had to downgrade to 10.4 everywhere, so at the moment it's impossible to check. However, I can't remember lines like you mentioned logged my mariadbd.
As for innodb_log_file_size, on next upgrade attempt I'll try to set it to 64 or 128GB and check the behavior. Can't provide any ETA, however.

Comment by Marko Mäkelä [ 2022-09-30 ]

euglorg, the checkpoints and page flushing work quite differently before 10.5, and crash recovery may run out of memory if the log file is more than about ⅓ of the buffer pool size.

In 10.9 thanks to MDEV-27812, you would not need a server restart for changing the size of the log.

Comment by Sergei Golubchik [ 2022-10-30 ]

euglorg, as a rule we close tickets that got no feedback for a month. But worry not, we'll reopen it if there will be new info, for example, after your new attempt to upgrade.

Generated at Thu Feb 08 09:53:36 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.