[MDEV-26603] asynchronous redo log write Created: 2021-09-14  Updated: 2023-12-27

Status: Stalled
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: 11.5

Type: New Feature Priority: Minor
Reporter: Vladislav Vaintroub Assignee: Marko Mäkelä
Resolution: Unresolved Votes: 1
Labels: Preview_10.9

Attachments: File base_cpu.svg     File base_offcpu.svg     File patch_cpu.svg     File patch_offcpu.svg     PDF File sysbench.pdf     PDF File tpcc4.pdf     PDF File tpcc5.pdf    
Issue Links:
PartOf
is part of MDEV-28112 prepare 10.9.0 preview releases Closed
Relates
relates to MDEV-18115 Remove dummy tablespace for the redo log Closed
relates to MDEV-27848 Remove unused wait/io/file/innodb/inn... Closed
relates to MDEV-18959 Engine transaction recovery through p... Stalled
relates to MDEV-27916 InnoDB ignores log write errors Closed

 Description   

The most important use case is for threadpool - this avoids blocking the thread, for the group commit lead.



 Comments   
Comment by Marko Mäkelä [ 2022-02-22 ]

While working on MDEV-27812, I noticed that in MDEV-14425, I accidentally removed checks that the log writes succeed (MDEV-27916). Similar error handling must be implemented also for the asynchronous log write code path.

Can you try to merge this with the work-in-progress MDEV-27812 branch?

Comment by Vladislav Vaintroub [ 2022-02-24 ]

marko, this merges almost fine. I can resolve the remaining conflicts once MDEV-27812 is pushed.

Comment by Vladislav Vaintroub [ 2022-02-25 ]

mleich, some write-intensive testing would be good.

Comment by Matthias Leich [ 2022-02-25 ]

Build with success.
./mtr 1st fails during bootstrap with
mariadbd: /data/Server/bb-10.9-MDEV-26603-async-redo-write/storage/innobase/log/log0log.cc:776: void log_write_up_to(lsn_t, bool, const completion_callback*): Assertion `!callback' failed.

Comment by Vladislav Vaintroub [ 2022-02-25 ]

Pardon mleich, it was a small glitch related to PMEM. I'd recommend to also test on normal SSD/HDD, it is the primary target for the patch.

Comment by Matthias Leich [ 2022-02-28 ]

origin/bb-10.9-MDEV-26603-async-redo-write b78470cbb0335df848aab3074d0a0bffdd721fb8 2022-02-25T12:43:45+01:00
behaved well in RQG testing.

Comment by Marko Mäkelä [ 2022-03-02 ]

Now that MDEV-27812 has reached 10.9, this work needs to be rebased for the final review and testing.

Comment by Vladislav Vaintroub [ 2022-03-07 ]

Apparently, MDEV-27812 was redone in a way that was is incompatible with whatever was done here, wrt locking. I reassign this to marko, to figure out what to do, to fix locking error as highlighted here https://buildbot.mariadb.org/#/builders/234/builds/6237

Comment by Marko Mäkelä [ 2022-03-08 ]

Some added functions or parameters need to be documented.

There is a design conflict with MDEV-27812 that has to be solved as part of this task. On a log checkpoint where the log resizing completes, both the current and the resized log file will have to be durably written to the very end (up to the current log_sys.lsn), before any files are closed or renamed. In the current form of the change, this is not the case. There is a race condition, write_lock is being released twice, and the completion of log resizing would not be crash-safe.

Comment by Marko Mäkelä [ 2022-03-08 ]

I realized that when the log resizing completes, there is no need to write anything extra to the log files, apart from the checkpoint block headers. That simplification is the easiest way to resolve the conflict. The log resizing of course needs to be extensively tested again, along with normal recovery and backup testing.

I pushed some suggested cleanup. Please review it and submit a squashed commit to my final review. I did not yet review all of it. What I read so far looks OK to me.

Comment by Marko Mäkelä [ 2022-03-09 ]

This looks good. Could you please try to revive the log_flush_task that was removed in MDEV-25948? That is, check if adding an asynchronous log write between buf_flush_page_cleaner() batches could reduce the latency of checkpoint flushing.

Comment by Vladislav Vaintroub [ 2022-03-09 ]

I tried to pre-flush the log with below code. Did not really make any measurable difference in testing.

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
index bbbd26bd8b4..265aff708d9 100644
--- a/storage/innobase/buf/buf0flu.cc
+++ b/storage/innobase/buf/buf0flu.cc
@@ -1514,6 +1514,10 @@ static ulint buf_flush_list(ulint max_n= ULINT_UNDEFINED, lsn_t lsn= LSN_MAX)
   if (buf_pool.n_flush_list())
     return 0;
+  /* Initiate redo log writing, if not already in progress,
+  to speed up possible synchronous log writes later on.*/
+  log_buffer_flush_to_disk_async();
+
   mysql_mutex_lock(&buf_pool.mutex);
   const bool running= buf_pool.n_flush_list_ != 0;
   /* FIXME: we are performing a dirty read of buf_pool.flush_list.count

Comment by Axel Schwenke [ 2022-03-15 ]

Performance testing with sysbench shows no improvement: sysbench.pdf

Comment by Vladislav Vaintroub [ 2022-03-15 ]

There are some slight signs of improvement, especially with the low concurrency

I think what's missing in testing, for the full picture is that probably

  • innodb_flush_method=O_DSYNC, or ask marko how to change the code to use O_DIRECT for redo log writes by default - on some unclear/undocumented reason, it is yet used on Linux. This is especially relevant when using libaio, because according to widespread rumors, it does not work well with cached files.
  • thread_handling=pool-of-threads

I think it is enough to only test update_index and update_non_index with the above variables. The O_DIRECT can be generally reconsidered, though this is up to @marko.

Comment by Marko Mäkelä [ 2022-03-15 ]

Yes, after MDEV-27774 removed a bottleneck, it should be worth trying to enable O_DIRECT on the Linux redo log by default.

Comment by Axel Schwenke [ 2022-03-16 ]

Performance testing with sysbench-tpcc shows no improvement either. For the CPU-bound case MDEV-26603 ist on par with 10.9 baseline. For the IO-bound case it falls behind: tpcc4.pdf

Using async log flushing has the same problem. The general numbers increase, but the baseline is still faster: tpcc5.pdf

IO-wise this test was done with the default setting for innodb_flush_method, should be O_DIRECT

Comment by Axel Schwenke [ 2022-03-16 ]

In numbers

sysbench-tpcc @ 32 thd, scale 64x10 (cpu bound)
 
version                         TPS
--------------------------------------
mariadb-10.7.3                  3729.4
mariadb-10.8-0366fb5f54a        3799.9
mariadb-10.9-66b5b9214b1        5127.8
MDEV-26603-24541d06140          5150.4
 
 
sysbench-tpcc @ 32 thd, scale 160x10 (io bound)
 
version                         TPS
--------------------------------------
mariadb-10.7.3                  804.88
mariadb-10.8-0366fb5f54a        814.46
mariadb-10.9-66b5b9214b1        860.87
MDEV-26603-24541d06140          772.41
 
 
sysbench-tpcc @ 32 thd, scale 160x10 (io bound)
relaxed flushing (innodb_flush_log_at_trx_commit=0)
 
version                         TPS
--------------------------------------
mariadb-10.9-66b5b9214b1        924.90
MDEV-26603-24541d06140          884.13

Comment by Marko Mäkelä [ 2022-03-17 ]

I conducted some benchmarks on my own, comparing preview-10.9-MDEV-26603-async-redo-write to its parent, both with a merge of MDEV-28111, to enable O_DIRECT writes of the redo log by default. The storage was NVMe with ext4fs and io_uring on Linux 5.16.12.

In a test of LRU eviction flushing (innodb_buffer_pool_size=1G, innodb_log_file_size=4G, innodb_flush_log_at_trx_commit=0, a few gigabytes of working set), I observed an improvement:

version throughput/tps latency/ms
baseline 6357.62 26.20
async 7022.63 23.95

In all other cases, I observed a small performance regression, for all 3 values of innodb_flush_log_at_trx_commit.

This is a dual socket system, and I tried to use enough threads to have some cross NUMA traffic. It could be wise to revise the changes that were made to write_lock and flush_lock.

Comment by Marko Mäkelä [ 2022-04-01 ]

I tried to chase down the performance regression for the basic case (not the LRU flushing case where I was able to reproduce a 10% improvement).

I do not know what I am doing wrong this time, but my notes from MDEV-26004 did not quite work: I always got "unknown" symbols reported for mariadbd in the offtimecpu-bpfcc flame graphs, no matter what I tried. The symbols in my Linux 5.17.0 kernel resolved just fine.

On both NVMe with ext4fs, and RAM with tmpfs, I consistently saw a performance regression.

With libaio the regression on NVMe was smaller. With io_uring, there were kernel stack traces with schedule_timeout being called by io_wqe_worker when using the async redo log write.

To avoid the "fake PMEM" code path of MDEV-14425 which is not affected by these changes, I placed the data directory in /run/user/$UID instead of /dev/shm. With tmpfs and simulated AIO, the offtimecpu flame graphs indicated that there is some increased wait on a futex. According to the performance schema, the async log write reduced waits for log_sys.latch and some increased waits for some MDL. I tried disabling the backup locks, but the performance regression remained.

I tried innodb_force_recovery=2 to disable the purge of transaction history, and innodb_undo_tablespaces=100 to reduce contention on the fil_space_t::latch of the system tablespace. The regression became somewhat smaller but remained. Nothing in the performance_schema output really stood out.

I tested a simple Sysbench oltp_update_index at 10 threads for 10 or 30 seconds, with innodb_flush_log_at_trx_commit=1 and a large enough buffer pool and log file so that no checkpoint flushing will take place. The performance regression was prominent even in such tiny runs.

wlad, unless you are able to figure out the cause of the regression, I am afraid that we must put this on hold.

Comment by Vladislav Vaintroub [ 2022-04-01 ]

No, I'm not able to figure out the cause, but I attach the flamegraphs for the posterity. The cpu flamegraph shows more log_write_up_to being outsources to background threadpool thread, which is theoretically good thing.

Comment by Vladislav Vaintroub [ 2022-04-01 ]

Since nobody is interested in pushing this into 10.9, or perform more Linux-specific benchmarking and analysis, I reassign the priority accordingly.

Comment by Marko Mäkelä [ 2022-04-14 ]

I think that MDEV-28313 may have addressed the unexpected regression. If wlad and axel can confirm that there no longer is any performance regression, I think that this is OK to push. I did not measure the LRU flushing scenario, but I have no reason to believe that the previously observed performance gain would disappear in that case.

Part of MDEV-21423 (scalability bottleneck due to trx_sys.rw_trx_hash) was addressed by MDEV-28313. I tested the following with a quick benchmark (see MDEV-21423 for the description).

version 20 40 80 160 320 640
patched 171236.52 192544.82 115298.23 151579.70 146904.22 151999.43
10.9+MDEV-28313 169572.38 191460.20 137424.75 137625.91 141308.08 151053.27
10.9 167970.59 190672.82 142246.60 138316.21 136684.18 104582.04

Note: The 10.9 baseline is 2 commits older than the merge of MDEV-28313, but I do not think it matters for this benchmark.
As noted in MDEV-21423, the figures for 80 concurrent threads are unreliable, because a checkpoint flush would have occurred during it.

Comment by Marko Mäkelä [ 2022-04-14 ]

I realized that unlike for MDEV-28313, this work is best tested with innodb_flush_log_at_trx_commit=1. So, I reran the benchmark:

version 20 40 80 160 320 640
patched 39524.83 82090.86 154108.89 152452.54 128073.02 131764.78
10.9+MDEV-28313 43330.10 87049.76 151794.40 151358.43 127131.37 131002.28
10.9 44416.55 86055.97 151811.06 140709.94 128494.92 132801.09

Except for the lowest concurrency, it is actually looking good.

This 30-second benchmark is of course too short to draw any real conclusion, but it does not look too bad. For the 10.9 baseline, the checkpoint flush occurred while the test was running at 160 concurrent users. For the 640 concurrent users, I restarted a new sysbench prepare and run, as with all the recent benchmarks. That is why it is showing slightly better throughput than the 320-user test.

Comment by Marko Mäkelä [ 2022-04-20 ]

To assess the impact of MDEV-28313, I repeated a quick Sysbench 8×100,000-row oltp_update_index test without MDEV-28313 and with innodb_flush_log_at_trx_commit=1.
The version column legend is the same as in the previous comments, except for the introduction of 10.9+merge of async, which is the same as patched but without the MDEV-28313 changes.

version 20 40 80 160 320 640
10.9+merge
of async
40062.04 82227.38 154505.53 149740.18 123871.06 131360.35
10.9 42809.14 87178.33 152955.76 151528.31 124043.59 131941.35

We can observe some insignificant improvement at 80 concurrent connections (which is "polluted" by the checkpoint flush that occurred during that test), and otherwise a performance regression or no improvement.
This 30-second benchmark run is too short to draw any definite conclusion. Actually the bottom line of the table is from an equivalent setup with the bottom line of the previous table.

One interesting change is that with the MDEV-28313 change included, we saw a slight improvement at 160 concurrent connections, but without it, we can observe a regression.
I reran the test in a different way (prepare+run 30 seconds with 80 clients, prepare+run 30 seconds with 160 clients) to gain some more confidence:

version 80 160
10.9+merge
of async
151006.08 154541.20
10.9 150857.26 158157.07

This time, no checkpoint flushing occurred during the 80-client run, and we see no significant improvement. The clear regression at 160 clients remained.

The counterintuitive performance regression could partly be addressed by MDEV-28313. With the test oltp_update_non_index, performance problems related to the lock-free hash table trx_sys.rw_trx_hash (MDEV-21423) should matter less:

version 20 40 80 160 320 640
10.9+merge
of async
38514.14 89237.51 167100.82 192394.20 189902.25 193034.80
10.9 42022.65 97957.34 169509.91 187099.23 191413.50 199397.91

Traversal of the entire trx_sys.rw_trx_hash table is necessary not only for checking locks on secondary indexes, but also for read view creation. Let us additionally specify --transaction-isolation=READ-UNCOMMITTED to reduce that activity (purge_sys.view must still be updated), and test it also with the MDEV-28313 improvements:

version 20 40 80 160 320 640
patched 38794.57 89714.06 168784.54 191521.04 189094.02 192025.07
10.9+MDEV-28313 41801.26 97290.21 170614.28 187754.89 196493.73 197833.70
10.9+merge
of async
38383.13 89040.30 168254.81 192200.60 195663.06 193661.70
10.9 43503.02 98087.92 169159.82 189859.61 194903.15 199073.90

In this scenario with reduced activity around trx_sys.rw_trx_hash, MDEV-28313 should matter less, that is, the difference between the 2nd and 4th row should be mostly noise. However, we can still observe a consistent performance regression due to the asynchronous log writing.

We will need deeper analysis to identify the bottleneck that causes the counterintuitive performance regression. MDEV-21423 may or may not fix this. An artificial benchmark that concurrently updates a very large number of SEQUENCE objects (MDEV-10139) should completely rule out the InnoDB transaction subsystem, because operations on SEQUENCE objects only generate redo log, no undo log at all.

http://www.brendangregg.com/offcpuanalysis.html could be useful if it did not emit most call frames as "unknown" in my recent tests. I should investigate if https://github.com/iovisor/bcc/issues/3884 would fix that.

Comment by Marko Mäkelä [ 2022-06-07 ]

As noted in MDEV-28766, I repeated a test run after fixing the performance regression MDEV-28708. I still observe up to 10% performance regression at low numbers of concurrent connections after applying MDEV-26603. But, my test is probably way too small to draw any definite conclusion.

Comment by Marko Mäkelä [ 2023-04-28 ]

For the record, the redo log checkpoint used to be written asynchronously until the code was simplified in MariaDB Server 10.5.0.

I do not think that bringing it back would help much, but I thought that I would mention for the sake of completeness.

Generated at Thu Feb 08 09:46:33 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.