[MDEV-26603] asynchronous redo log write - Jira

Vladislav Vaintroub created issue - 2021-09-14 10:18

Vladislav Vaintroub made changes - 2021-09-14 10:19

Field	Original Value	New Value
Description	The most important use case is for threadpool - this avoid avoid blocking for the group commit lead, too	The most important use case is for threadpool - this avoids blocking the thread, for the group commit lead.

Marko Mäkelä made changes - 2021-09-20 10:27

Link

This issue relates to MDEV-18959 [ MDEV-18959 ]

Sergei Golubchik made changes - 2021-09-23 08:28

Fix Version/s		10.8 [ 26121 ]
Fix Version/s	10.7 [ 24805 ]

Sergei Golubchik made changes - 2021-12-06 21:21

Workflow

MariaDB v3 [ 125030 ]

MariaDB v4 [ 131554 ]

Sergei Golubchik made changes - 2022-02-01 11:32

Fix Version/s		10.9 [ 26905 ]
Fix Version/s	10.8 [ 26121 ]

Sergei Golubchik made changes - 2022-02-01 13:43

Priority

Major [ 3 ]

Critical [ 2 ]

Vladislav Vaintroub made changes - 2022-02-15 11:26

Status

Open [ 1 ]

In Progress [ 3 ]

Marko Mäkelä made changes - 2022-02-15 12:57

Link

This issue relates to ~~MDEV-27848~~ [ ~~MDEV-27848~~ ]

Vladislav Vaintroub made changes - 2022-02-22 11:57

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Vladislav Vaintroub made changes - 2022-02-22 11:58

Assignee	Vladislav Vaintroub [ wlad ]	Marko Mäkelä [ marko ]
Status	Stalled [ 10000 ]	In Review [ 10002 ]

Marko Mäkelä made changes - 2022-02-22 16:23

Link

This issue relates to ~~MDEV-27916~~ [ ~~MDEV-27916~~ ]

Marko Mäkelä added a comment - 2022-02-22 16:25

While working on ~~MDEV-27812~~, I noticed that in ~~MDEV-14425~~, I accidentally removed checks that the log writes succeed (~~MDEV-27916~~). Similar error handling must be implemented also for the asynchronous log write code path.

Can you try to merge this with the work-in-progress ~~MDEV-27812~~ branch?

Marko Mäkelä added a comment - 2022-02-22 16:25 While working on MDEV-27812 , I noticed that in MDEV-14425 , I accidentally removed checks that the log writes succeed ( MDEV-27916 ). Similar error handling must be implemented also for the asynchronous log write code path. Can you try to merge this with the work-in-progress MDEV-27812 branch?

Marko Mäkelä made changes - 2022-02-22 16:25

Assignee	Marko Mäkelä [ marko ]	Vladislav Vaintroub [ wlad ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Vladislav Vaintroub added a comment - 2022-02-24 09:23

marko, this merges almost fine. I can resolve the remaining conflicts once ~~MDEV-27812~~ is pushed.

Vladislav Vaintroub added a comment - 2022-02-24 09:23 marko , this merges almost fine. I can resolve the remaining conflicts once MDEV-27812 is pushed.

Vladislav Vaintroub made changes - 2022-02-25 08:46

Status

Stalled [ 10000 ]

In Testing [ 10301 ]

Vladislav Vaintroub made changes - 2022-02-25 08:46

Assignee

Vladislav Vaintroub [ wlad ]

Matthias Leich [ mleich ]

Vladislav Vaintroub added a comment - 2022-02-25 08:48

mleich, some write-intensive testing would be good.

Vladislav Vaintroub added a comment - 2022-02-25 08:48 mleich , some write-intensive testing would be good.

Matthias Leich added a comment - 2022-02-25 11:19

Build with success.
./mtr 1st fails during bootstrap with
mariadbd: /data/Server/bb-10.9-~~MDEV-26603~~-async-redo-write/storage/innobase/log/log0log.cc:776: void log_write_up_to(lsn_t, bool, const completion_callback*): Assertion `!callback' failed.

Matthias Leich added a comment - 2022-02-25 11:19 Build with success. ./mtr 1st fails during bootstrap with mariadbd: /data/Server/bb-10.9- MDEV-26603 -async-redo-write/storage/innobase/log/log0log.cc:776: void log_write_up_to(lsn_t, bool, const completion_callback*): Assertion `!callback' failed.

Matthias Leich made changes - 2022-02-25 11:19

Status

In Testing [ 10301 ]

Stalled [ 10000 ]

Matthias Leich made changes - 2022-02-25 11:20

Assignee

Matthias Leich [ mleich ]

Vladislav Vaintroub [ wlad ]

Vladislav Vaintroub made changes - 2022-02-25 12:36

Assignee

Vladislav Vaintroub [ wlad ]

Matthias Leich [ mleich ]

Vladislav Vaintroub added a comment - 2022-02-25 12:42

Pardon mleich, it was a small glitch related to PMEM. I'd recommend to also test on normal SSD/HDD, it is the primary target for the patch.

Vladislav Vaintroub added a comment - 2022-02-25 12:42 Pardon mleich , it was a small glitch related to PMEM. I'd recommend to also test on normal SSD/HDD, it is the primary target for the patch.

Matthias Leich made changes - 2022-02-25 16:42

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Matthias Leich made changes - 2022-02-25 16:43

Status

In Progress [ 3 ]

In Testing [ 10301 ]

Matthias Leich added a comment - 2022-02-28 13:45

origin/bb-10.9-MDEV-26603-async-redo-write b78470cbb0335df848aab3074d0a0bffdd721fb8 2022-02-25T12:43:45+01:00

behaved well in RQG testing.

Matthias Leich added a comment - 2022-02-28 13:45 origin/bb-10.9-MDEV-26603-async-redo-write b78470cbb0335df848aab3074d0a0bffdd721fb8 2022-02-25T12:43:45+01:00 behaved well in RQG testing.

Matthias Leich made changes - 2022-02-28 13:46

Status

In Testing [ 10301 ]

Stalled [ 10000 ]

Matthias Leich made changes - 2022-02-28 13:46

Assignee

Matthias Leich [ mleich ]

Vladislav Vaintroub [ wlad ]

Marko Mäkelä added a comment - 2022-03-02 15:25

Now that ~~MDEV-27812~~ has reached 10.9, this work needs to be rebased for the final review and testing.

Marko Mäkelä added a comment - 2022-03-02 15:25 Now that MDEV-27812 has reached 10.9, this work needs to be rebased for the final review and testing.

Vladislav Vaintroub made changes - 2022-03-07 12:20

Assignee

Vladislav Vaintroub [ wlad ]

Marko Mäkelä [ marko ]

Vladislav Vaintroub added a comment - 2022-03-07 12:22 - edited

Apparently, ~~MDEV-27812~~ was redone in a way that was is incompatible with whatever was done here, wrt locking. I reassign this to marko, to figure out what to do, to fix locking error as highlighted here https://buildbot.mariadb.org/#/builders/234/builds/6237

Vladislav Vaintroub added a comment - 2022-03-07 12:22 - edited Apparently, MDEV-27812 was redone in a way that was is incompatible with whatever was done here, wrt locking. I reassign this to marko , to figure out what to do, to fix locking error as highlighted here https://buildbot.mariadb.org/#/builders/234/builds/6237

Marko Mäkelä added a comment - 2022-03-08 08:27

Some added functions or parameters need to be documented.

There is a design conflict with ~~MDEV-27812~~ that has to be solved as part of this task. On a log checkpoint where the log resizing completes, both the current and the resized log file will have to be durably written to the very end (up to the current log_sys.lsn), before any files are closed or renamed. In the current form of the change, this is not the case. There is a race condition, write_lock is being released twice, and the completion of log resizing would not be crash-safe.

Marko Mäkelä added a comment - 2022-03-08 08:27 Some added functions or parameters need to be documented. There is a design conflict with MDEV-27812 that has to be solved as part of this task. On a log checkpoint where the log resizing completes, both the current and the resized log file will have to be durably written to the very end (up to the current log_sys.lsn ), before any files are closed or renamed. In the current form of the change, this is not the case. There is a race condition, write_lock is being released twice, and the completion of log resizing would not be crash-safe.

Marko Mäkelä made changes - 2022-03-08 08:27

Assignee

Marko Mäkelä [ marko ]

Vladislav Vaintroub [ wlad ]

Marko Mäkelä added a comment - 2022-03-08 16:31

I realized that when the log resizing completes, there is no need to write anything extra to the log files, apart from the checkpoint block headers. That simplification is the easiest way to resolve the conflict. The log resizing of course needs to be extensively tested again, along with normal recovery and backup testing.

I pushed some suggested cleanup. Please review it and submit a squashed commit to my final review. I did not yet review all of it. What I read so far looks OK to me.

Marko Mäkelä added a comment - 2022-03-08 16:31 I realized that when the log resizing completes, there is no need to write anything extra to the log files, apart from the checkpoint block headers. That simplification is the easiest way to resolve the conflict. The log resizing of course needs to be extensively tested again, along with normal recovery and backup testing. I pushed some suggested cleanup. Please review it and submit a squashed commit to my final review. I did not yet review all of it. What I read so far looks OK to me.

Marko Mäkelä added a comment - 2022-03-09 09:14

This looks good. Could you please try to revive the log_flush_task that was removed in ~~MDEV-25948~~? That is, check if adding an asynchronous log write between buf_flush_page_cleaner() batches could reduce the latency of checkpoint flushing.

Marko Mäkelä added a comment - 2022-03-09 09:14 This looks good. Could you please try to revive the log_flush_task that was removed in MDEV-25948 ? That is, check if adding an asynchronous log write between buf_flush_page_cleaner() batches could reduce the latency of checkpoint flushing.

Vladislav Vaintroub added a comment - 2022-03-09 11:59

I tried to pre-flush the log with below code. Did not really make any measurable difference in testing.

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc

index bbbd26bd8b4..265aff708d9 100644

--- a/storage/innobase/buf/buf0flu.cc

+++ b/storage/innobase/buf/buf0flu.cc

@@ -1514,6 +1514,10 @@ static ulint buf_flush_list(ulint max_n= ULINT_UNDEFINED, lsn_t lsn= LSN_MAX)

   if (buf_pool.n_flush_list())

     return 0;

+  /* Initiate redo log writing, if not already in progress,

+  to speed up possible synchronous log writes later on.*/

+  log_buffer_flush_to_disk_async();

   mysql_mutex_lock(&buf_pool.mutex);

   const bool running= buf_pool.n_flush_list_ != 0;

   /* FIXME: we are performing a dirty read of buf_pool.flush_list.count

Vladislav Vaintroub added a comment - 2022-03-09 11:59 I tried to pre-flush the log with below code. Did not really make any measurable difference in testing. diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index bbbd26bd8b4..265aff708d9 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -1514,6 +1514,10 @@ static ulint buf_flush_list(ulint max_n= ULINT_UNDEFINED, lsn_t lsn= LSN_MAX) if (buf_pool.n_flush_list()) return 0; + /* Initiate redo log writing, if not already in progress, + to speed up possible synchronous log writes later on.*/ + log_buffer_flush_to_disk_async(); + mysql_mutex_lock(&buf_pool.mutex); const bool running= buf_pool.n_flush_list_ != 0; /* FIXME: we are performing a dirty read of buf_pool.flush_list.count

Marko Mäkelä made changes - 2022-03-09 16:03

Status

Stalled [ 10000 ]

In Testing [ 10301 ]

Marko Mäkelä made changes - 2022-03-09 16:03

Assignee

Vladislav Vaintroub [ wlad ]

Matthias Leich [ mleich ]

Axel Schwenke made changes - 2022-03-15 12:45

Attachment

sysbench.pdf [ 62547 ]

Axel Schwenke added a comment - 2022-03-15 12:53

Performance testing with sysbench shows no improvement: sysbench.pdf

Axel Schwenke added a comment - 2022-03-15 12:53 Performance testing with sysbench shows no improvement: sysbench.pdf

Vladislav Vaintroub added a comment - 2022-03-15 15:27

There are some slight signs of improvement, especially with the low concurrency

I think what's missing in testing, for the full picture is that probably

innodb_flush_method=O_DSYNC, or ask marko how to change the code to use O_DIRECT for redo log writes by default - on some unclear/undocumented reason, it is yet used on Linux. This is especially relevant when using libaio, because according to widespread rumors, it does not work well with cached files.
thread_handling=pool-of-threads

I think it is enough to only test update_index and update_non_index with the above variables. The O_DIRECT can be generally reconsidered, though this is up to @marko.

Vladislav Vaintroub added a comment - 2022-03-15 15:27 There are some slight signs of improvement, especially with the low concurrency I think what's missing in testing, for the full picture is that probably innodb_flush_method=O_DSYNC, or ask marko how to change the code to use O_DIRECT for redo log writes by default - on some unclear/undocumented reason, it is yet used on Linux. This is especially relevant when using libaio, because according to widespread rumors, it does not work well with cached files. thread_handling=pool-of-threads I think it is enough to only test update_index and update_non_index with the above variables. The O_DIRECT can be generally reconsidered, though this is up to @marko.

Marko Mäkelä added a comment - 2022-03-15 15:43

Yes, after ~~MDEV-27774~~ removed a bottleneck, it should be worth trying to enable O_DIRECT on the Linux redo log by default.

Marko Mäkelä added a comment - 2022-03-15 15:43 Yes, after MDEV-27774 removed a bottleneck, it should be worth trying to enable O_DIRECT on the Linux redo log by default.

Axel Schwenke made changes - 2022-03-16 08:56

Attachment		tpcc4.pdf [ 62560 ]
Attachment		tpcc5.pdf [ 62561 ]

Axel Schwenke added a comment - 2022-03-16 08:59 - edited

Performance testing with sysbench-tpcc shows no improvement either. For the CPU-bound case ~~MDEV-26603~~ ist on par with 10.9 baseline. For the IO-bound case it falls behind: tpcc4.pdf

Using async log flushing has the same problem. The general numbers increase, but the baseline is still faster: tpcc5.pdf

IO-wise this test was done with the default setting for innodb_flush_method, should be O_DIRECT

Axel Schwenke added a comment - 2022-03-16 08:59 - edited Performance testing with sysbench-tpcc shows no improvement either. For the CPU-bound case MDEV-26603 ist on par with 10.9 baseline. For the IO-bound case it falls behind: tpcc4.pdf Using async log flushing has the same problem. The general numbers increase, but the baseline is still faster: tpcc5.pdf IO-wise this test was done with the default setting for innodb_flush_method , should be O_DIRECT

Axel Schwenke added a comment - 2022-03-16 09:07

In numbers

sysbench-tpcc @ 32 thd, scale 64x10 (cpu bound)

version                         TPS

--------------------------------------

mariadb-10.7.3                  3729.4

mariadb-10.8-0366fb5f54a        3799.9

mariadb-10.9-66b5b9214b1        5127.8

MDEV-26603-24541d06140          5150.4

sysbench-tpcc @ 32 thd, scale 160x10 (io bound)

version                         TPS

--------------------------------------

mariadb-10.7.3                  804.88

mariadb-10.8-0366fb5f54a        814.46

mariadb-10.9-66b5b9214b1        860.87

MDEV-26603-24541d06140          772.41

sysbench-tpcc @ 32 thd, scale 160x10 (io bound)

relaxed flushing (innodb_flush_log_at_trx_commit=0)

version                         TPS

--------------------------------------

mariadb-10.9-66b5b9214b1        924.90

MDEV-26603-24541d06140          884.13

Axel Schwenke added a comment - 2022-03-16 09:07 In numbers sysbench-tpcc @ 32 thd, scale 64x10 (cpu bound) version TPS -------------------------------------- mariadb-10.7.3 3729.4 mariadb-10.8-0366fb5f54a 3799.9 mariadb-10.9-66b5b9214b1 5127.8 MDEV-26603-24541d06140 5150.4 sysbench-tpcc @ 32 thd, scale 160x10 (io bound) version TPS -------------------------------------- mariadb-10.7.3 804.88 mariadb-10.8-0366fb5f54a 814.46 mariadb-10.9-66b5b9214b1 860.87 MDEV-26603-24541d06140 772.41 sysbench-tpcc @ 32 thd, scale 160x10 (io bound) relaxed flushing (innodb_flush_log_at_trx_commit=0) version TPS -------------------------------------- mariadb-10.9-66b5b9214b1 924.90 MDEV-26603-24541d06140 884.13

Sergei Golubchik made changes - 2022-03-17 10:18

Link

This issue is part of ~~MDEV-28112~~ [ ~~MDEV-28112~~ ]

Marko Mäkelä added a comment - 2022-03-17 11:47

I conducted some benchmarks on my own, comparing preview-10.9-MDEV-26603-async-redo-write to its parent, both with a merge of ~~MDEV-28111~~, to enable O_DIRECT writes of the redo log by default. The storage was NVMe with ext4fs and io_uring on Linux 5.16.12.

In a test of LRU eviction flushing (innodb_buffer_pool_size=1G, innodb_log_file_size=4G, innodb_flush_log_at_trx_commit=0, a few gigabytes of working set), I observed an improvement:

version	throughput/tps	latency/ms
baseline	6357.62	26.20
async	7022.63	23.95

In all other cases, I observed a small performance regression, for all 3 values of innodb_flush_log_at_trx_commit.

This is a dual socket system, and I tried to use enough threads to have some cross NUMA traffic. It could be wise to revise the changes that were made to write_lock and flush_lock.

Marko Mäkelä added a comment - 2022-03-17 11:47 I conducted some benchmarks on my own, comparing preview-10.9-MDEV-26603-async-redo-write to its parent, both with a merge of MDEV-28111 , to enable O_DIRECT writes of the redo log by default . The storage was NVMe with ext4fs and io_uring on Linux 5.16.12. In a test of LRU eviction flushing ( innodb_buffer_pool_size=1G , innodb_log_file_size=4G , innodb_flush_log_at_trx_commit=0 , a few gigabytes of working set), I observed an improvement: version throughput/tps latency/ms baseline 6357.62 26.20 async 7022.63 23.95 In all other cases, I observed a small performance regression, for all 3 values of innodb_flush_log_at_trx_commit . This is a dual socket system, and I tried to use enough threads to have some cross NUMA traffic. It could be wise to revise the changes that were made to write_lock and flush_lock .

Marko Mäkelä added a comment - 2022-04-01 12:17

I tried to chase down the performance regression for the basic case (not the LRU flushing case where I was able to reproduce a 10% improvement).

I do not know what I am doing wrong this time, but my notes from ~~MDEV-26004~~ did not quite work: I always got "unknown" symbols reported for mariadbd in the offtimecpu-bpfcc flame graphs, no matter what I tried. The symbols in my Linux 5.17.0 kernel resolved just fine.

On both NVMe with ext4fs, and RAM with tmpfs, I consistently saw a performance regression.

With libaio the regression on NVMe was smaller. With io_uring, there were kernel stack traces with schedule_timeout being called by io_wqe_worker when using the async redo log write.

To avoid the "fake PMEM" code path of ~~MDEV-14425~~ which is not affected by these changes, I placed the data directory in /run/user/$UID instead of /dev/shm. With tmpfs and simulated AIO, the offtimecpu flame graphs indicated that there is some increased wait on a futex. According to the performance schema, the async log write reduced waits for log_sys.latch and some increased waits for some MDL. I tried disabling the backup locks, but the performance regression remained.

I tried innodb_force_recovery=2 to disable the purge of transaction history, and innodb_undo_tablespaces=100 to reduce contention on the fil_space_t::latch of the system tablespace. The regression became somewhat smaller but remained. Nothing in the performance_schema output really stood out.

I tested a simple Sysbench oltp_update_index at 10 threads for 10 or 30 seconds, with innodb_flush_log_at_trx_commit=1 and a large enough buffer pool and log file so that no checkpoint flushing will take place. The performance regression was prominent even in such tiny runs.

wlad, unless you are able to figure out the cause of the regression, I am afraid that we must put this on hold.

Marko Mäkelä added a comment - 2022-04-01 12:17 I tried to chase down the performance regression for the basic case (not the LRU flushing case where I was able to reproduce a 10% improvement). I do not know what I am doing wrong this time, but my notes from MDEV-26004 did not quite work: I always got "unknown" symbols reported for mariadbd in the offtimecpu-bpfcc flame graphs, no matter what I tried. The symbols in my Linux 5.17.0 kernel resolved just fine. On both NVMe with ext4fs, and RAM with tmpfs, I consistently saw a performance regression. With libaio the regression on NVMe was smaller. With io_uring , there were kernel stack traces with schedule_timeout being called by io_wqe_worker when using the async redo log write. To avoid the "fake PMEM" code path of MDEV-14425 which is not affected by these changes, I placed the data directory in /run/user/$UID instead of /dev/shm . With tmpfs and simulated AIO, the offtimecpu flame graphs indicated that there is some increased wait on a futex. According to the performance schema, the async log write reduced waits for log_sys.latch and some increased waits for some MDL. I tried disabling the backup locks, but the performance regression remained. I tried innodb_force_recovery=2 to disable the purge of transaction history, and innodb_undo_tablespaces=100 to reduce contention on the fil_space_t::latch of the system tablespace. The regression became somewhat smaller but remained. Nothing in the performance_schema output really stood out. I tested a simple Sysbench oltp_update_index at 10 threads for 10 or 30 seconds, with innodb_flush_log_at_trx_commit=1 and a large enough buffer pool and log file so that no checkpoint flushing will take place. The performance regression was prominent even in such tiny runs. wlad , unless you are able to figure out the cause of the regression, I am afraid that we must put this on hold.

Marko Mäkelä made changes - 2022-04-01 12:17

Assignee	Matthias Leich [ mleich ]	Vladislav Vaintroub [ wlad ]
Status	In Testing [ 10301 ]	Stalled [ 10000 ]

Vladislav Vaintroub made changes - 2022-04-01 13:27

Attachment		base_cpu.svg [ 62801 ]
Attachment		patch_cpu.svg [ 62802 ]

Vladislav Vaintroub made changes - 2022-04-01 13:29

Attachment		base_offcpu.svg [ 62803 ]
Attachment		patch_offcpu.svg [ 62804 ]

Vladislav Vaintroub added a comment - 2022-04-01 13:31

No, I'm not able to figure out the cause, but I attach the flamegraphs for the posterity. The cpu flamegraph shows more log_write_up_to being outsources to background threadpool thread, which is theoretically good thing.

Vladislav Vaintroub added a comment - 2022-04-01 13:31 No, I'm not able to figure out the cause, but I attach the flamegraphs for the posterity. The cpu flamegraph shows more log_write_up_to being outsources to background threadpool thread, which is theoretically good thing.

Vladislav Vaintroub made changes - 2022-04-01 13:31

Priority

Critical [ 2 ]

Minor [ 4 ]

Vladislav Vaintroub added a comment - 2022-04-01 13:32 - edited

Since nobody is interested in pushing this into 10.9, or perform more Linux-specific benchmarking and analysis, I reassign the priority accordingly.

Vladislav Vaintroub added a comment - 2022-04-01 13:32 - edited Since nobody is interested in pushing this into 10.9, or perform more Linux-specific benchmarking and analysis, I reassign the priority accordingly.

Sergei Golubchik made changes - 2022-04-06 14:07

Fix Version/s

10.9 [ 26905 ]

Ralf Gebhardt made changes - 2022-04-13 09:22

Fix Version/s

10.10 [ 27530 ]

Marko Mäkelä added a comment - 2022-04-14 12:11

I think that ~~MDEV-28313~~ may have addressed the unexpected regression. If wlad and axel can confirm that there no longer is any performance regression, I think that this is OK to push. I did not measure the LRU flushing scenario, but I have no reason to believe that the previously observed performance gain would disappear in that case.

Part of MDEV-21423 (scalability bottleneck due to trx_sys.rw_trx_hash) was addressed by ~~MDEV-28313~~. I tested the following with a quick benchmark (see MDEV-21423 for the description).

version	20	40	80	160	320	640
patched	171236.52	192544.82	115298.23	151579.70	146904.22	151999.43
10.9+MDEV-28313	169572.38	191460.20	137424.75	137625.91	141308.08	151053.27
10.9	167970.59	190672.82	142246.60	138316.21	136684.18	104582.04

Note: The 10.9 baseline is 2 commits older than the merge of ~~MDEV-28313~~, but I do not think it matters for this benchmark.
As noted in MDEV-21423, the figures for 80 concurrent threads are unreliable, because a checkpoint flush would have occurred during it.

Marko Mäkelä added a comment - 2022-04-14 12:11 I think that MDEV-28313 may have addressed the unexpected regression. If wlad and axel can confirm that there no longer is any performance regression, I think that this is OK to push. I did not measure the LRU flushing scenario, but I have no reason to believe that the previously observed performance gain would disappear in that case. Part of MDEV-21423 (scalability bottleneck due to trx_sys.rw_trx_hash ) was addressed by MDEV-28313 . I tested the following with a quick benchmark (see MDEV-21423 for the description). version 20 40 80 160 320 640 patched 171236.52 192544.82 115298.23 151579.70 146904.22 151999.43 10.9+MDEV-28313 169572.38 191460.20 137424.75 137625.91 141308.08 151053.27 10.9 167970.59 190672.82 142246.60 138316.21 136684.18 104582.04 Note: The 10.9 baseline is 2 commits older than the merge of MDEV-28313 , but I do not think it matters for this benchmark. As noted in MDEV-21423 , the figures for 80 concurrent threads are unreliable, because a checkpoint flush would have occurred during it.

Marko Mäkelä added a comment - 2022-04-14 13:12

I realized that unlike for ~~MDEV-28313~~, this work is best tested with innodb_flush_log_at_trx_commit=1. So, I reran the benchmark:

version	20	40	80	160	320	640
patched	39524.83	82090.86	154108.89	152452.54	128073.02	131764.78
10.9+MDEV-28313	43330.10	87049.76	151794.40	151358.43	127131.37	131002.28
10.9	44416.55	86055.97	151811.06	140709.94	128494.92	132801.09

Except for the lowest concurrency, it is actually looking good.

This 30-second benchmark is of course too short to draw any real conclusion, but it does not look too bad. For the 10.9 baseline, the checkpoint flush occurred while the test was running at 160 concurrent users. For the 640 concurrent users, I restarted a new sysbench prepare and run, as with all the recent benchmarks. That is why it is showing slightly better throughput than the 320-user test.

Marko Mäkelä added a comment - 2022-04-14 13:12 I realized that unlike for MDEV-28313 , this work is best tested with innodb_flush_log_at_trx_commit=1 . So, I reran the benchmark: version 20 40 80 160 320 640 patched 39524.83 82090.86 154108.89 152452.54 128073.02 131764.78 10.9+MDEV-28313 43330.10 87049.76 151794.40 151358.43 127131.37 131002.28 10.9 44416.55 86055.97 151811.06 140709.94 128494.92 132801.09 Except for the lowest concurrency, it is actually looking good. This 30-second benchmark is of course too short to draw any real conclusion, but it does not look too bad. For the 10.9 baseline, the checkpoint flush occurred while the test was running at 160 concurrent users. For the 640 concurrent users, I restarted a new sysbench prepare and run, as with all the recent benchmarks. That is why it is showing slightly better throughput than the 320-user test.

Marko Mäkelä added a comment - 2022-04-20 06:12

To assess the impact of ~~MDEV-28313~~, I repeated a quick Sysbench 8×100,000-row oltp_update_index test without ~~MDEV-28313~~ and with innodb_flush_log_at_trx_commit=1.
The version column legend is the same as in the previous comments, except for the introduction of 10.9+merge of async, which is the same as patched but without the ~~MDEV-28313~~ changes.

version	20	40	80	160	320	640
10.9+merge of async	40062.04	82227.38	154505.53	149740.18	123871.06	131360.35
10.9	42809.14	87178.33	152955.76	151528.31	124043.59	131941.35

We can observe some insignificant improvement at 80 concurrent connections (which is "polluted" by the checkpoint flush that occurred during that test), and otherwise a performance regression or no improvement.
This 30-second benchmark run is too short to draw any definite conclusion. Actually the bottom line of the table is from an equivalent setup with the bottom line of the previous table.

One interesting change is that with the ~~MDEV-28313~~ change included, we saw a slight improvement at 160 concurrent connections, but without it, we can observe a regression.
I reran the test in a different way (prepare+run 30 seconds with 80 clients, prepare+run 30 seconds with 160 clients) to gain some more confidence:

version	80	160
10.9+merge of async	151006.08	154541.20
10.9	150857.26	158157.07

This time, no checkpoint flushing occurred during the 80-client run, and we see no significant improvement. The clear regression at 160 clients remained.

The counterintuitive performance regression could partly be addressed by ~~MDEV-28313~~. With the test oltp_update_non_index, performance problems related to the lock-free hash table trx_sys.rw_trx_hash (MDEV-21423) should matter less:

version	20	40	80	160	320	640
10.9+merge of async	38514.14	89237.51	167100.82	192394.20	189902.25	193034.80
10.9	42022.65	97957.34	169509.91	187099.23	191413.50	199397.91

Traversal of the entire trx_sys.rw_trx_hash table is necessary not only for checking locks on secondary indexes, but also for read view creation. Let us additionally specify --transaction-isolation=READ-UNCOMMITTED to reduce that activity (purge_sys.view must still be updated), and test it also with the ~~MDEV-28313~~ improvements:

version	20	40	80	160	320	640
patched	38794.57	89714.06	168784.54	191521.04	189094.02	192025.07
10.9+MDEV-28313	41801.26	97290.21	170614.28	187754.89	196493.73	197833.70
10.9+merge of async	38383.13	89040.30	168254.81	192200.60	195663.06	193661.70
10.9	43503.02	98087.92	169159.82	189859.61	194903.15	199073.90

In this scenario with reduced activity around trx_sys.rw_trx_hash, ~~MDEV-28313~~ should matter less, that is, the difference between the 2nd and 4th row should be mostly noise. However, we can still observe a consistent performance regression due to the asynchronous log writing.

We will need deeper analysis to identify the bottleneck that causes the counterintuitive performance regression. MDEV-21423 may or may not fix this. An artificial benchmark that concurrently updates a very large number of SEQUENCE objects (~~MDEV-10139~~) should completely rule out the InnoDB transaction subsystem, because operations on SEQUENCE objects only generate redo log, no undo log at all.

http://www.brendangregg.com/offcpuanalysis.html could be useful if it did not emit most call frames as "unknown" in my recent tests. I should investigate if https://github.com/iovisor/bcc/issues/3884 would fix that.

Marko Mäkelä added a comment - 2022-04-20 06:12 To assess the impact of MDEV-28313 , I repeated a quick Sysbench 8×100,000-row oltp_update_index test without MDEV-28313 and with innodb_flush_log_at_trx_commit=1 . The version column legend is the same as in the previous comments, except for the introduction of 10.9+merge of async , which is the same as patched but without the MDEV-28313 changes. version 20 40 80 160 320 640 10.9 +merge of async 40062.04 82227.38 154505.53 149740.18 123871.06 131360.35 10.9 42809.14 87178.33 152955.76 151528.31 124043.59 131941.35 We can observe some insignificant improvement at 80 concurrent connections (which is "polluted" by the checkpoint flush that occurred during that test), and otherwise a performance regression or no improvement. This 30-second benchmark run is too short to draw any definite conclusion. Actually the bottom line of the table is from an equivalent setup with the bottom line of the previous table. One interesting change is that with the MDEV-28313 change included, we saw a slight improvement at 160 concurrent connections, but without it, we can observe a regression. I reran the test in a different way (prepare+run 30 seconds with 80 clients, prepare+run 30 seconds with 160 clients) to gain some more confidence: version 80 160 10.9 +merge of async 151006.08 154541.20 10.9 150857.26 158157.07 This time, no checkpoint flushing occurred during the 80-client run, and we see no significant improvement. The clear regression at 160 clients remained. The counterintuitive performance regression could partly be addressed by MDEV-28313 . With the test oltp_update_ non _index , performance problems related to the lock-free hash table trx_sys.rw_trx_hash ( MDEV-21423 ) should matter less: version 20 40 80 160 320 640 10.9 +merge of async 38514.14 89237.51 167100.82 192394.20 189902.25 193034.80 10.9 42022.65 97957.34 169509.91 187099.23 191413.50 199397.91 Traversal of the entire trx_sys.rw_trx_hash table is necessary not only for checking locks on secondary indexes, but also for read view creation. Let us additionally specify --transaction-isolation=READ-UNCOMMITTED to reduce that activity ( purge_sys.view must still be updated), and test it also with the MDEV-28313 improvements: version 20 40 80 160 320 640 patched 38794.57 89714.06 168784.54 191521.04 189094.02 192025.07 10.9+MDEV-28313 41801.26 97290.21 170614.28 187754.89 196493.73 197833.70 10.9 +merge of async 38383.13 89040.30 168254.81 192200.60 195663.06 193661.70 10.9 43503.02 98087.92 169159.82 189859.61 194903.15 199073.90 In this scenario with reduced activity around trx_sys.rw_trx_hash , MDEV-28313 should matter less, that is, the difference between the 2nd and 4th row should be mostly noise. However, we can still observe a consistent performance regression due to the asynchronous log writing. We will need deeper analysis to identify the bottleneck that causes the counterintuitive performance regression. MDEV-21423 may or may not fix this. An artificial benchmark that concurrently updates a very large number of SEQUENCE objects ( MDEV-10139 ) should completely rule out the InnoDB transaction subsystem, because operations on SEQUENCE objects only generate redo log, no undo log at all. http://www.brendangregg.com/offcpuanalysis.html could be useful if it did not emit most call frames as "unknown" in my recent tests. I should investigate if https://github.com/iovisor/bcc/issues/3884 would fix that.

Marko Mäkelä added a comment - 2022-06-07 14:49

As noted in ~~MDEV-28766~~, I repeated a test run after fixing the performance regression ~~MDEV-28708~~. I still observe up to 10% performance regression at low numbers of concurrent connections after applying ~~MDEV-26603~~. But, my test is probably way too small to draw any definite conclusion.

Marko Mäkelä added a comment - 2022-06-07 14:49 As noted in MDEV-28766 , I repeated a test run after fixing the performance regression MDEV-28708 . I still observe up to 10% performance regression at low numbers of concurrent connections after applying MDEV-26603 . But, my test is probably way too small to draw any definite conclusion.

Sergei Golubchik made changes - 2022-06-15 13:36

Fix Version/s		10.11 [ 27614 ]
Fix Version/s	10.10 [ 27530 ]

Sergei Golubchik made changes - 2022-08-02 19:58

Fix Version/s		10.12 [ 28320 ]
Fix Version/s	10.11 [ 27614 ]

Ralf Gebhardt made changes - 2022-12-27 14:12

Labels

Preview_removed_10.9

Ralf Gebhardt made changes - 2023-02-15 15:42

Fix Version/s		11.2 [ 28603 ]
Fix Version/s	11.0 [ 28320 ]

Ralf Gebhardt made changes - 2023-03-01 15:24

Assignee

Vladislav Vaintroub [ wlad ]

Marko Mäkelä [ marko ]

Ralf Gebhardt made changes - 2023-03-21 09:30

Labels

Preview_removed_10.9

Preview_10.9 Preview_11.1 Preview_removed_10.9

Ralf Gebhardt made changes - 2023-03-21 09:31

Labels

Preview_10.9 Preview_11.1 Preview_removed_10.9

Preview_10.9 Preview_11.1

Ralf Gebhardt made changes - 2023-03-21 09:43

Labels

Preview_10.9 Preview_11.1

Preview_10.9

Marko Mäkelä made changes - 2023-04-28 10:21

Link

This issue relates to ~~MDEV-18115~~ [ ~~MDEV-18115~~ ]

Marko Mäkelä added a comment - 2023-04-28 10:21

For the record, the redo log checkpoint used to be written asynchronously until the code was simplified in MariaDB Server 10.5.0.

I do not think that bringing it back would help much, but I thought that I would mention for the sake of completeness.

Marko Mäkelä added a comment - 2023-04-28 10:21 For the record, the redo log checkpoint used to be written asynchronously until the code was simplified in MariaDB Server 10.5.0. I do not think that bringing it back would help much, but I thought that I would mention for the sake of completeness.

Ralf Gebhardt made changes - 2023-08-16 12:10

Fix Version/s		11.3 [ 28565 ]
Fix Version/s	11.2 [ 28603 ]

Ralf Gebhardt made changes - 2023-08-16 14:27

Fix Version/s		11.4 [ 29301 ]
Fix Version/s	11.3 [ 28565 ]

Sergei Golubchik made changes - 2023-12-22 17:45

Fix Version/s		11.5 [ 29506 ]
Fix Version/s	11.4 [ 29301 ]

Ralf Gebhardt made changes - 2023-12-27 10:35

Issue Type

Task [ 3 ]

New Feature [ 2 ]

Marko Mäkelä added a comment - 2024-02-20 15:35

I spent some time merging the changes from 10.9 to 11.0; I think 10.11 would have been the same in terms of conflicts (many of them due to ~~MDEV-33379~~, which reminded me of this task). I hit a fundamental conflict:

ulint buf_flush_LRU(ulint max_n, bool evict)

  mysql_mutex_assert_owner(&buf_pool.mutex);

<<<<<<< HEAD

  flush_counters_t n;

  buf_do_LRU_batch(max_n, evict, &n);

||||||| 10d9b890b0f

  log_buffer_flush_to_disk();

=======

  log_buffer_flush_to_disk_async();

>>>>>>> fbf8646335280150a6ecf5727effb1a719f26b22

  ulint pages= n.flushed;

  if (n.evicted)

This was the only invocation of an asynchronous log write if we ignore the rare special case of innodb_undo_log_truncate=ON in mtr_t::commit_shrink(). The call to the synchronous log write had been removed in ~~MDEV-26055~~ when we made the buf_flush_page_cleaner() thread spend the rest of its innodb_io_capacity per-second ‘budget’ on LRU eviction flushing.

It does not seem feasible to pursue with this.

Marko Mäkelä added a comment - 2024-02-20 15:35 I spent some time merging the changes from 10.9 to 11.0; I think 10.11 would have been the same in terms of conflicts (many of them due to MDEV-33379 , which reminded me of this task). I hit a fundamental conflict: ulint buf_flush_LRU(ulint max_n, bool evict) { mysql_mutex_assert_owner(&buf_pool.mutex); <<<<<<< HEAD flush_counters_t n; buf_do_LRU_batch(max_n, evict, &n); ||||||| 10d9b890b0f log_buffer_flush_to_disk(); ======= log_buffer_flush_to_disk_async(); >>>>>>> fbf8646335280150a6ecf5727effb1a719f26b22 ulint pages= n.flushed; if (n.evicted) This was the only invocation of an asynchronous log write if we ignore the rare special case of innodb_undo_log_truncate=ON in mtr_t::commit_shrink() . The call to the synchronous log write had been removed in MDEV-26055 when we made the buf_flush_page_cleaner() thread spend the rest of its innodb_io_capacity per-second ‘budget’ on LRU eviction flushing. It does not seem feasible to pursue with this.

Marko Mäkelä made changes - 2024-02-20 15:35

issue.field.resolutiondate

2024-02-20 15:35:24.0

2024-02-20 15:35:24.09

Marko Mäkelä made changes - 2024-02-20 15:35

Fix Version/s		N/A [ 14700 ]
Fix Version/s	11.5 [ 29506 ]
Resolution		Won't Fix [ 2 ]
Status	Stalled [ 10000 ]	Closed [ 6 ]

MariaDB Server

asynchronous redo log write

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration