[MDEV-26603] asynchronous redo log write Created: 2021-09-14 Updated: 2023-12-27 |
|
| Status: | Stalled |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Fix Version/s: | 11.5 |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Vladislav Vaintroub | Assignee: | Marko Mäkelä |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | Preview_10.9 | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Description |
|
The most important use case is for threadpool - this avoids blocking the thread, for the group commit lead. |
| Comments |
| Comment by Marko Mäkelä [ 2022-02-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
While working on Can you try to merge this with the work-in-progress | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2022-02-24 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marko, this merges almost fine. I can resolve the remaining conflicts once | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2022-02-25 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
mleich, some write-intensive testing would be good. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2022-02-25 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Build with success. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2022-02-25 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Pardon mleich, it was a small glitch related to PMEM. I'd recommend to also test on normal SSD/HDD, it is the primary target for the patch. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2022-02-28 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-03-02 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Now that | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2022-03-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Apparently, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-03-08 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Some added functions or parameters need to be documented. There is a design conflict with | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-03-08 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I realized that when the log resizing completes, there is no need to write anything extra to the log files, apart from the checkpoint block headers. That simplification is the easiest way to resolve the conflict. The log resizing of course needs to be extensively tested again, along with normal recovery and backup testing. I pushed some suggested cleanup. Please review it and submit a squashed commit to my final review. I did not yet review all of it. What I read so far looks OK to me. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-03-09 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This looks good. Could you please try to revive the log_flush_task that was removed in | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2022-03-09 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I tried to pre-flush the log with below code. Did not really make any measurable difference in testing.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2022-03-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Performance testing with sysbench shows no improvement: sysbench.pdf | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2022-03-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
There are some slight signs of improvement, especially with the low concurrency I think what's missing in testing, for the full picture is that probably
I think it is enough to only test update_index and update_non_index with the above variables. The O_DIRECT can be generally reconsidered, though this is up to @marko. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-03-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, after | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2022-03-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Performance testing with sysbench-tpcc shows no improvement either. For the CPU-bound case MDEV-26603 ist on par with 10.9 baseline. For the IO-bound case it falls behind: tpcc4.pdf Using async log flushing has the same problem. The general numbers increase, but the baseline is still faster: tpcc5.pdf IO-wise this test was done with the default setting for innodb_flush_method, should be O_DIRECT | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2022-03-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In numbers
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-03-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I conducted some benchmarks on my own, comparing preview-10.9-MDEV-26603-async-redo-write to its parent, both with a merge of In a test of LRU eviction flushing (innodb_buffer_pool_size=1G, innodb_log_file_size=4G, innodb_flush_log_at_trx_commit=0, a few gigabytes of working set), I observed an improvement:
In all other cases, I observed a small performance regression, for all 3 values of innodb_flush_log_at_trx_commit. This is a dual socket system, and I tried to use enough threads to have some cross NUMA traffic. It could be wise to revise the changes that were made to write_lock and flush_lock. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-04-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I tried to chase down the performance regression for the basic case (not the LRU flushing case where I was able to reproduce a 10% improvement). I do not know what I am doing wrong this time, but my notes from On both NVMe with ext4fs, and RAM with tmpfs, I consistently saw a performance regression. With libaio the regression on NVMe was smaller. With io_uring, there were kernel stack traces with schedule_timeout being called by io_wqe_worker when using the async redo log write. To avoid the "fake PMEM" code path of I tried innodb_force_recovery=2 to disable the purge of transaction history, and innodb_undo_tablespaces=100 to reduce contention on the fil_space_t::latch of the system tablespace. The regression became somewhat smaller but remained. Nothing in the performance_schema output really stood out. I tested a simple Sysbench oltp_update_index at 10 threads for 10 or 30 seconds, with innodb_flush_log_at_trx_commit=1 and a large enough buffer pool and log file so that no checkpoint flushing will take place. The performance regression was prominent even in such tiny runs. wlad, unless you are able to figure out the cause of the regression, I am afraid that we must put this on hold. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2022-04-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
No, I'm not able to figure out the cause, but I attach the flamegraphs for the posterity. The cpu flamegraph shows more log_write_up_to being outsources to background threadpool thread, which is theoretically good thing. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2022-04-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Since nobody is interested in pushing this into 10.9, or perform more Linux-specific benchmarking and analysis, I reassign the priority accordingly. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-04-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think that Part of MDEV-21423 (scalability bottleneck due to trx_sys.rw_trx_hash) was addressed by
Note: The 10.9 baseline is 2 commits older than the merge of | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-04-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I realized that unlike for
Except for the lowest concurrency, it is actually looking good. This 30-second benchmark is of course too short to draw any real conclusion, but it does not look too bad. For the 10.9 baseline, the checkpoint flush occurred while the test was running at 160 concurrent users. For the 640 concurrent users, I restarted a new sysbench prepare and run, as with all the recent benchmarks. That is why it is showing slightly better throughput than the 320-user test. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-04-20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
To assess the impact of
We can observe some insignificant improvement at 80 concurrent connections (which is "polluted" by the checkpoint flush that occurred during that test), and otherwise a performance regression or no improvement. One interesting change is that with the This time, no checkpoint flushing occurred during the 80-client run, and we see no significant improvement. The clear regression at 160 clients remained. The counterintuitive performance regression could partly be addressed by
Traversal of the entire trx_sys.rw_trx_hash table is necessary not only for checking locks on secondary indexes, but also for read view creation. Let us additionally specify --transaction-isolation=READ-UNCOMMITTED to reduce that activity (purge_sys.view must still be updated), and test it also with the
In this scenario with reduced activity around trx_sys.rw_trx_hash, We will need deeper analysis to identify the bottleneck that causes the counterintuitive performance regression. MDEV-21423 may or may not fix this. An artificial benchmark that concurrently updates a very large number of SEQUENCE objects ( http://www.brendangregg.com/offcpuanalysis.html could be useful if it did not emit most call frames as "unknown" in my recent tests. I should investigate if https://github.com/iovisor/bcc/issues/3884 would fix that. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-06-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
As noted in | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-04-28 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
For the record, the redo log checkpoint used to be written asynchronously until the code was simplified in MariaDB Server 10.5.0. I do not think that bringing it back would help much, but I thought that I would mention for the sake of completeness. |