[MDEV-32050] UNDO logs still growing for write-intensive workloads Created: 2023-08-31 Updated: 2024-01-19 Resolved: 2023-10-25 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 11.1.1, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0, 11.1, 11.2, 11.2.0 |
| Fix Version/s: | 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3, 11.2.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Axel Schwenke | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 3 |
| Labels: | performance | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
The issue with growing innodb_history_length (aka growing disk footprint of UNDO logs) that was originally reported in |
| Comments |
| Comment by Julien Fritsch [ 2023-08-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
@axel said in slack to marko:
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2023-09-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I tested workload TPC-C (sysbench tpcc) for recent versions of MariaDB and with 3 different settings specific for limiting the UNDO log size. configuration variants:
throughput goes down when UNDO log growth is limited
total size of UNDO logs after the benchmark run (MB)
MariaDB 10.11 was commit 725bd568346 with the madvise patch applied ( | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I think that the logic that was introduced in | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
wlad pointed out that the history list length is not shrinking after a server bootstrap. thiru suggested a simple mtr test to reproduce this, which I wrote as follows:
On 10.5, this would show 12 to me right after bootstrap. On 10.6, I see 7, thanks to I believe that the main motivation for the parameter innodb_purge_rseg_truncate_frequency is to avoid frequent calls to trx_purge_free_segment(). The rest of the operations done in trx_purge_truncate_rseg_history() should hopefully be less disruptive. Edit: the attempted fix would introduce a resource leak similar to | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
wlad correctly pointed out that srv_wake_purge_thread_if_not_active() may invoke useless wake-up of the purge coordinator task. Maybe, if we fix that, the performance penalty of setting innodb_purge_rseg_truncate_frequency=1 will be bearable. That would allow the history list to shrink as fast as possible, at the end of each purge batch. I am working on a revised wake-up condition that would pass tests and result in good performance. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I think that it could make sense to move the time-consuming trx_purge_truncate_rseg_history() to a separate task from the purge coordinator, so that the processed undo pages can be freed concurrently with the next purge batch. I also improved the wakeup. This did not seem to help at all in a 256-connection, 60-second memory-bound oltp_update_index test (5×10000 rows, 100M buffer pool and log file). The throughput would gradually worsen in both cases. Replacing the default innodb_purge_threads=4 with the maximum innodb_purge_threads=32 does not make any noticeable difference for this workload. Using the maximum innodb_purge_batch_size=5000 instead of the default 300 lowered the throughput earlier, but with that setting the throughput would not monotonically decrease during the workload. I think that I must use offcputime from the BPF Compiler Collection to find the bottlenecks. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
To get reasonable stack traces from offcputime (working around https://github.com/iovisor/bcc/issues/1234), I compiled libc++ and libstdc++ with -fno-omit-frame-pointer as follows: libc++Added to glibc-2.37/debian/sysdeps/amd64.mk the line
and built with dpkg-buildpackage, copied the libc.so.6* somewhere from the build directory while the tests were running libstdc++
With LD_LIBRARY_PATH pointing to these self-built libc and libstdc++, I executed the workload script and concurrently
to record stacks for 30 seconds (which was shorter than my workload). In my first run, I am seeing a wait for rseg->latch in trx_undo_assign_low() (during an UPDATE statement, allocating the first undo log) as well as in trx_t::write_serialisation_history(). There are only 128 rollback segments, and I was using 256 concurrent transactions. This run was with 32 tables of 10000 rows each. I was actually hoping that unlike the previous runs using 5 tables, this would have shown an improvement when increasing the default innodb_purge_threads=5 to innodb_purge_threads=32, but that was not the case. I think that I must test with fewer concurrent connections to see where the other bottlenecks are. Adding more than 128 rollback segments would require significant changes to the file format. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I made another oltp_update_index test, with 5×10000 rows, 100MiB buffer pool and log file, but only 128 concurrent connections. I recorded 10 seconds of offcputime starting at around 40 seconds when the throughput starts to drop. In the flame graph, I observe 63% of waits in ha_innobase::update_row(). 10% is due to trx_undo_report_row_operation() (waiting for rseg->latch, an undo page latch, or the system tablespace latch), and 51% is due to row_upd_sec_index_entry(), mostly waiting for page latches in btr_cur_t::search_leaf(). Until MDEV-17598 has been implemented, MVCC and lock checks for secondary indexes will suffer heavily from purge lag; I repeated the test with innodb_undo_tablespaces=32 instead of the 10.6 default innodb_undo_tablespaces=0. This alleviates the bottleneck on the system tablespace latch that can be seen in purge-bottleneck-5-10000-128-40s-50s.svg The impact of the purge lag is heavily amplified when using a small buffer pool. After I increased the buffer pool and log to 1GiB, the throughput remained unchanged, even though most history was accumulated until the end of the test. Purging the history took 27 seconds (about 1/3 of the time that it took when using 10% of that buffer pool size with the same workload). It could make sense to monitor the waits during a slow shutdown when the history is being purged, maybe using a small buffer pool. That should highlight the bottlenecks in the purge subsystem itself. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
According to offcputime, during the slow shutdown, most threads are idle. When using a small buffer pool, I see some io_uring threads; with a larger buffer pool they disappear. The waits in purge are a small fraction of the total waits, and those are dominated by the secondary index checks (to be fixed by MDEV-17598). I also checked top during the workload and shutdown. During the workload with 128 concurrent client connections, I got about 3300% CPU usage (33 cores used). During the slow shutdown, that dropped to 100%. For some reason, the purge coordinator appears to be doing all work by itself here. This could be the key. The next step could be to add some diagnostic messages to the purge worker callback, to confirm if this really is the case. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I repeated the oltp_update_index benchmark with 32×1560 rows, innodb_undo_tablespaces=32 and 32 concurrent connections for 60 seconds. I used a 4GiB buffer pool to avoid any page I/O, and I ran it on RAM disk. In this benchmark, the purge_coordinator_callback will spend 5/6 of its waiting time in TrxUndoRsegsIterator::set_next(), waiting for purge_sys->rseg.latch. Under this workload, the following patch made almost no difference.
I was expecting that this could slow down the transaction start and commit, but the throughput in fact remained almost unchanged. There was slightly less history to purge at the end of the 60-second test. This synchronization primitive used to be a mutex, until | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I am trying to refactor purge_coordinator_callback so that it will continuously hold an exclusive purge_sys.rseg->latch for a little longer time, basically across one call of trx_purge_fetch_next_rec(), that is, choosing the next undo log record to process. That would hopefully reduce the waits in the purge coordinator as well as act as a natural throttle for the producers (transaction start or commit in DML operations). | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
My initial refactoring to acquire and release purge_sys.rseg->latch less often improved things a tiny bit in my benchmark. trx_purge() would now account for 2.58% of all waits, instead of previous 3.0%, and the majority (2.04%) would be due to trx_purge_choose_next_log(). It could make sense to keep holding purge_sys.rseg->latch for several consecutive parsed undo log records, or until a different purge_sys.rseg is chosen. To reduce the impact on DML, we might make transaction start avoid the rollback segment that the purge coordinator task is currently working on. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2023-09-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I ran a benchmark of commit ac29a54908e (HEAD of bb-10.6- 32 thd, innodb_purge_threads=4 (default)
32 thd, innodb_purge_threads=32
Observations:
The encouraging results made me repeat the benchmark with 64 threads: 64 thd, innodb_purge_threads=4 (default)
64 thd, innodb_purge_threads=32
The results are not as good as before, we had some UNDO log growth during the run. But the growth of the UNDO logs was much smaller for The fact that the innodb_purge_threads setting does not affect throughput but results in smaller UNDO logs, is worth noting. We should give that tuning advise to customers. And maybe raise the default (make it relative to nproc?) | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
axel, thank you, this is very promising. Today, I diagnosed the reason why a shutdown with innodb_fast_shutdown=0 quickly degrades to employing no purge_worker_task. At the start of the shutdown, the loop in purge_coordinator_state::do_purge() would invoke trx_purge() with 32 threads, then 31, 30, …, 1, and then keep using 1 thread until the very end. I made an experiment to revert the throttling logic that had been added in I see that one of my test scenarios in | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I posted some results from my experiment to | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
The oltp_update_non_index benchmark in | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
In a modified variant of the
Here are some figures after a 5-second workload using 16,000 rows (starting with 128 tables×250 rows/table) and the default innodb_purge_threads=4:
Based on perf record analysis, I suspect that the bottleneck are the waits for dict_sys.latch in a table ID lookup. Maybe each purge task should maintain a larger local cache that maps table ID to table pointers. Eviction of those tables from dict_sys would be prevented by dict_table_t reference counters. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
A possible enhancement could be that in trx_purge_attach_undo_recs(), which is run by purge_coordinator_task, we would look up the tables. We already associate each unique table_id with a purge worker there:
If only the purge_coordinator_task looks up the tables, there will be no “unhealthy competition” with purge workers. Also, if a table has been dropped, the purge coordinator will be able to immediately discard any undo log records, without adding them to node->undo_recs. We could add another lookup table to purge_node_t and perform the table lookup in the else branch above. At the end of the purge batch, the purge coordinator would go through the lookup tables and release the table references. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I implemented a crude prototype of the idea. This improved the slow shutdown time after a 5-second run of my Some problems related to DDL will need to be addressed:
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-10-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
It looks like I was able to remove the copying of undo log records in rollback, online ALTER TABLE and purge. I think that it is needed (or is simplest to retain) the copying in MVCC reads. I noticed some repeated buf_pool.page_hash lookups in purge. It could be worthwhile to save more references to buffer-fixed undo pages while a purge batch is active, or maybe even to construct a local hash table from page identifiers to buffer-fixed pages. The DDL related problems have not been addressed yet. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-10-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
It looks like I may have sorted out the trouble with DDL operations, at least when it comes to the regression test suite, but something is still causing hangs with some tests on indexed virtual columns. I think that adding a cache for buf_pool.page_hash could be a possible future enhancement. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-10-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks to some help from nikitamalyavin and wlad, I think that this now is ready for some performance and stress tests. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-10-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I got terrible performance yesterday. In response, I implemented the cache for buf_pool.page_hash lookup. In this way, only the purge_coordinator_task will buffer-fix and buffer-unfix undo log pages. That alone did not help much, but a larger innodb_purge_batch_size did. This will give the purge threads a chance to run for a longer time per batch, reducing the setup time. I think that we need to consider enlarging the maximum value of innodb_purge_batch_size and increasing the default from 300 to 1000. Partly this could be due to a different accounting: the new page cache would count each and every accessed undo page, while the old code only counted some of them. The batch size is a simple limit of undo pages that may be processed during a batch. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rick James [ 2023-10-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Is there (or will there be) a GLOBAL STATUS that hints at whether `innodb_purge_batch_size` is too small or too large? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-10-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
rjasdfiii, we can’t easily determine the ideal size of each batch, but we might count the number of times when innodb_purge_batch_size is the limiting factor. Provided that purge_sys.view is advancing at a steady rate (there are no long-running transactions or old read views active), the growth rate of such a counter would be a hint that increasing the innodb_purge_batch_size might help. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rick James [ 2023-10-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
@MarkoMakela - Thanks for the comment. I spend a lot of time helping others with problems. But I don't know all the internal details. When there is a [[VARIABLE]] that looks tempting to tune, I am often at a loss of how to measure whether to, and how much to, raise or lower it. Some variables even have multiple hints ([[buffer_pool]], [[table_open_cache]]) to simplify the task. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2023-10-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-10-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
This was approved by wlad and vlad.lesin and is now waiting for a final performance test, to see if increasing the default value of innodb_purge_batch_size from 300 to 1000 would be reasonable. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-10-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Based on the tests by axel, we will also change the default value of innodb_purge_batch_size from 300 undo log pages to 1000. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Larry Adams [ 2023-11-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Pretty excited about this one and the MDL race condition fix in 10.6.16. Letting it bake for a week or so before landing it on any production instance. I just upgraded from 10.5.13 to 10.6.15 on Monday. Performance went up already by almost a factor of 2 on the NVMe drives, but undo does not flush fast enough. Will check back in in a few weeks. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
TheWitness, I’m looking forward to your results. This was indeed long overdue. I think that the next big thing to reduce purge lag will be MDEV-17598, which will involve a file format change. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Larry Adams [ 2023-11-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I'll keep you posted. They update is scheduled on 11/28. The other ticket also a big bit of work. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
It looks like these changes increased the likelihood of some regression test failures. Hopefully after this fixup no such failures will occur. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Larry Adams [ 2023-11-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Marko, I'm planning on upgrading tomorrow. With these regressions, do you think I should wait? I'll be pulling the latest 10.6.16 RPM's | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Larry Adams [ 2023-11-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Ope, just saw it's moved @ Enterprise level till 12/11. Waiting for two more weeks then | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
TheWitness, the fixups that I mentioned involve modifying some regression tests only, no code changes. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Larry Adams [ 2023-11-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks! | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Larry Adams [ 2024-01-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
@Marko Mäkelä, I just upgraded from 10.6.15 to 10.6.16-11 this morning. So far, the history is looking under control. I'm presently watching I/O levels and will track MariaDB's ability to keep it in check over the next few days. If we can keep it under 5M, that'll be good. I've got a few things disabled, that I will re-enable once we get good measurements with this specific components disabled. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
TheWitness, the high level summary is that this change As soon as | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Larry Adams [ 2024-01-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
@Marko Mäkelä, Yea, I need something as the undo log is growing pretty consistently unlike 10.6.15-9 where is was staying under 2M consistently. Here are my current purge settings if you want me to tune anything. Let me know where to go to get the custom binary as I suspect it's going to grow over night. Larry | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Larry Adams [ 2024-01-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Here is the last 12 hours. You can see when I did the upgrade, it started increasing linearly as soon as I applied the patch. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Larry Adams [ 2024-01-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Last two weeks. You can see it's gone up before. Have to watch it over night. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Larry Adams [ 2024-01-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I was running some pretty long analytic queries last week when the undo log was going up. None today though. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Larry Adams [ 2024-01-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I either need that patch, or need to roll back to 10.6.15. Let me know ASAP. Not sure what to download and from where. The history is up to over 35M this morning. Larry | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
https://ci.mariadb.org/42505/ contains the packages for the 10.6 main branch for the fix of | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Larry Adams [ 2024-01-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Okay, thanks Marko. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Larry Adams [ 2024-01-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Marko, The RHEL8 RPM's are missing. Seems the only ones missing the RPM's. Can you get that fixed? Larry | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2024-01-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Take the next build, https://ci.mariadb.org/42507/
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Larry Adams [ 2024-01-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks Sergei, I rolled back to 10.6.15-10 Enterprise last night. I'll wait on another signal from Marko before taking other actions. The history list length is slowly winding down towards zero again. It's not as aggressive as I would like, but at least going in the right direction now. Larry |