[MDEV-26603] asynchronous redo log write - Jira

Details

Type: New Feature
Status: Closed (View Workflow)
Priority: Minor
Resolution: Won't Fix
Fix Version/s: N/A
Component/s: Storage Engine - InnoDB
Labels:
- Preview_10.9

Description

The most important use case is for threadpool - this avoids blocking the thread, for the group commit lead.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

base_cpu.svg
606 kB
2022-04-01 13:27
base_offcpu.svg
90 kB
2022-04-01 13:29
patch_cpu.svg
710 kB
2022-04-01 13:27
patch_offcpu.svg
90 kB
2022-04-01 13:29
sysbench.pdf
25 kB
2022-03-15 12:45
tpcc4.pdf
364 kB
2022-03-16 08:56
tpcc5.pdf
192 kB
2022-03-16 08:56

Issue Links

is part of

MDEV-28112 prepare 10.9.0 preview releases

Closed

relates to

MDEV-18115 Remove dummy tablespace for the redo log

Closed

MDEV-27848 Remove unused wait/io/file/innodb/innodb_log_file

Closed

MDEV-18959 Engine transaction recovery through persistent binlog

Stalled

MDEV-27916 InnoDB ignores log write errors

Closed

Activity

Ascending order - Click to sort in descending order

View 22 older comments

Marko Mäkelä added a comment - 2022-04-14 13:12

I realized that unlike for ~~MDEV-28313~~, this work is best tested with innodb_flush_log_at_trx_commit=1. So, I reran the benchmark:

version	20	40	80	160	320	640
patched	39524.83	82090.86	154108.89	152452.54	128073.02	131764.78
10.9+MDEV-28313	43330.10	87049.76	151794.40	151358.43	127131.37	131002.28
10.9	44416.55	86055.97	151811.06	140709.94	128494.92	132801.09

Except for the lowest concurrency, it is actually looking good.

This 30-second benchmark is of course too short to draw any real conclusion, but it does not look too bad. For the 10.9 baseline, the checkpoint flush occurred while the test was running at 160 concurrent users. For the 640 concurrent users, I restarted a new sysbench prepare and run, as with all the recent benchmarks. That is why it is showing slightly better throughput than the 320-user test.

Marko Mäkelä added a comment - 2022-04-14 13:12 I realized that unlike for MDEV-28313 , this work is best tested with innodb_flush_log_at_trx_commit=1 . So, I reran the benchmark: version 20 40 80 160 320 640 patched 39524.83 82090.86 154108.89 152452.54 128073.02 131764.78 10.9+MDEV-28313 43330.10 87049.76 151794.40 151358.43 127131.37 131002.28 10.9 44416.55 86055.97 151811.06 140709.94 128494.92 132801.09 Except for the lowest concurrency, it is actually looking good. This 30-second benchmark is of course too short to draw any real conclusion, but it does not look too bad. For the 10.9 baseline, the checkpoint flush occurred while the test was running at 160 concurrent users. For the 640 concurrent users, I restarted a new sysbench prepare and run, as with all the recent benchmarks. That is why it is showing slightly better throughput than the 320-user test.

Marko Mäkelä added a comment - 2022-04-20 06:12

To assess the impact of ~~MDEV-28313~~, I repeated a quick Sysbench 8×100,000-row oltp_update_index test without ~~MDEV-28313~~ and with innodb_flush_log_at_trx_commit=1.
The version column legend is the same as in the previous comments, except for the introduction of 10.9+merge of async, which is the same as patched but without the ~~MDEV-28313~~ changes.

version	20	40	80	160	320	640
10.9+merge of async	40062.04	82227.38	154505.53	149740.18	123871.06	131360.35
10.9	42809.14	87178.33	152955.76	151528.31	124043.59	131941.35

We can observe some insignificant improvement at 80 concurrent connections (which is "polluted" by the checkpoint flush that occurred during that test), and otherwise a performance regression or no improvement.
This 30-second benchmark run is too short to draw any definite conclusion. Actually the bottom line of the table is from an equivalent setup with the bottom line of the previous table.

One interesting change is that with the ~~MDEV-28313~~ change included, we saw a slight improvement at 160 concurrent connections, but without it, we can observe a regression.
I reran the test in a different way (prepare+run 30 seconds with 80 clients, prepare+run 30 seconds with 160 clients) to gain some more confidence:

version	80	160
10.9+merge of async	151006.08	154541.20
10.9	150857.26	158157.07

This time, no checkpoint flushing occurred during the 80-client run, and we see no significant improvement. The clear regression at 160 clients remained.

The counterintuitive performance regression could partly be addressed by ~~MDEV-28313~~. With the test oltp_update_non_index, performance problems related to the lock-free hash table trx_sys.rw_trx_hash (MDEV-21423) should matter less:

version	20	40	80	160	320	640
10.9+merge of async	38514.14	89237.51	167100.82	192394.20	189902.25	193034.80
10.9	42022.65	97957.34	169509.91	187099.23	191413.50	199397.91

Traversal of the entire trx_sys.rw_trx_hash table is necessary not only for checking locks on secondary indexes, but also for read view creation. Let us additionally specify --transaction-isolation=READ-UNCOMMITTED to reduce that activity (purge_sys.view must still be updated), and test it also with the ~~MDEV-28313~~ improvements:

version	20	40	80	160	320	640
patched	38794.57	89714.06	168784.54	191521.04	189094.02	192025.07
10.9+MDEV-28313	41801.26	97290.21	170614.28	187754.89	196493.73	197833.70
10.9+merge of async	38383.13	89040.30	168254.81	192200.60	195663.06	193661.70
10.9	43503.02	98087.92	169159.82	189859.61	194903.15	199073.90

In this scenario with reduced activity around trx_sys.rw_trx_hash, ~~MDEV-28313~~ should matter less, that is, the difference between the 2nd and 4th row should be mostly noise. However, we can still observe a consistent performance regression due to the asynchronous log writing.

We will need deeper analysis to identify the bottleneck that causes the counterintuitive performance regression. MDEV-21423 may or may not fix this. An artificial benchmark that concurrently updates a very large number of SEQUENCE objects (~~MDEV-10139~~) should completely rule out the InnoDB transaction subsystem, because operations on SEQUENCE objects only generate redo log, no undo log at all.

http://www.brendangregg.com/offcpuanalysis.html could be useful if it did not emit most call frames as "unknown" in my recent tests. I should investigate if https://github.com/iovisor/bcc/issues/3884 would fix that.

Marko Mäkelä added a comment - 2022-04-20 06:12 To assess the impact of MDEV-28313 , I repeated a quick Sysbench 8×100,000-row oltp_update_index test without MDEV-28313 and with innodb_flush_log_at_trx_commit=1 . The version column legend is the same as in the previous comments, except for the introduction of 10.9+merge of async , which is the same as patched but without the MDEV-28313 changes. version 20 40 80 160 320 640 10.9 +merge of async 40062.04 82227.38 154505.53 149740.18 123871.06 131360.35 10.9 42809.14 87178.33 152955.76 151528.31 124043.59 131941.35 We can observe some insignificant improvement at 80 concurrent connections (which is "polluted" by the checkpoint flush that occurred during that test), and otherwise a performance regression or no improvement. This 30-second benchmark run is too short to draw any definite conclusion. Actually the bottom line of the table is from an equivalent setup with the bottom line of the previous table. One interesting change is that with the MDEV-28313 change included, we saw a slight improvement at 160 concurrent connections, but without it, we can observe a regression. I reran the test in a different way (prepare+run 30 seconds with 80 clients, prepare+run 30 seconds with 160 clients) to gain some more confidence: version 80 160 10.9 +merge of async 151006.08 154541.20 10.9 150857.26 158157.07 This time, no checkpoint flushing occurred during the 80-client run, and we see no significant improvement. The clear regression at 160 clients remained. The counterintuitive performance regression could partly be addressed by MDEV-28313 . With the test oltp_update_ non _index , performance problems related to the lock-free hash table trx_sys.rw_trx_hash ( MDEV-21423 ) should matter less: version 20 40 80 160 320 640 10.9 +merge of async 38514.14 89237.51 167100.82 192394.20 189902.25 193034.80 10.9 42022.65 97957.34 169509.91 187099.23 191413.50 199397.91 Traversal of the entire trx_sys.rw_trx_hash table is necessary not only for checking locks on secondary indexes, but also for read view creation. Let us additionally specify --transaction-isolation=READ-UNCOMMITTED to reduce that activity ( purge_sys.view must still be updated), and test it also with the MDEV-28313 improvements: version 20 40 80 160 320 640 patched 38794.57 89714.06 168784.54 191521.04 189094.02 192025.07 10.9+MDEV-28313 41801.26 97290.21 170614.28 187754.89 196493.73 197833.70 10.9 +merge of async 38383.13 89040.30 168254.81 192200.60 195663.06 193661.70 10.9 43503.02 98087.92 169159.82 189859.61 194903.15 199073.90 In this scenario with reduced activity around trx_sys.rw_trx_hash , MDEV-28313 should matter less, that is, the difference between the 2nd and 4th row should be mostly noise. However, we can still observe a consistent performance regression due to the asynchronous log writing. We will need deeper analysis to identify the bottleneck that causes the counterintuitive performance regression. MDEV-21423 may or may not fix this. An artificial benchmark that concurrently updates a very large number of SEQUENCE objects ( MDEV-10139 ) should completely rule out the InnoDB transaction subsystem, because operations on SEQUENCE objects only generate redo log, no undo log at all. http://www.brendangregg.com/offcpuanalysis.html could be useful if it did not emit most call frames as "unknown" in my recent tests. I should investigate if https://github.com/iovisor/bcc/issues/3884 would fix that.

Marko Mäkelä added a comment - 2022-06-07 14:49

As noted in ~~MDEV-28766~~, I repeated a test run after fixing the performance regression ~~MDEV-28708~~. I still observe up to 10% performance regression at low numbers of concurrent connections after applying ~~MDEV-26603~~. But, my test is probably way too small to draw any definite conclusion.

Marko Mäkelä added a comment - 2022-06-07 14:49 As noted in MDEV-28766 , I repeated a test run after fixing the performance regression MDEV-28708 . I still observe up to 10% performance regression at low numbers of concurrent connections after applying MDEV-26603 . But, my test is probably way too small to draw any definite conclusion.

Marko Mäkelä added a comment - 2023-04-28 10:21

For the record, the redo log checkpoint used to be written asynchronously until the code was simplified in MariaDB Server 10.5.0.

I do not think that bringing it back would help much, but I thought that I would mention for the sake of completeness.

Marko Mäkelä added a comment - 2023-04-28 10:21 For the record, the redo log checkpoint used to be written asynchronously until the code was simplified in MariaDB Server 10.5.0. I do not think that bringing it back would help much, but I thought that I would mention for the sake of completeness.

Marko Mäkelä added a comment - 2024-02-20 15:35

I spent some time merging the changes from 10.9 to 11.0; I think 10.11 would have been the same in terms of conflicts (many of them due to ~~MDEV-33379~~, which reminded me of this task). I hit a fundamental conflict:

ulint buf_flush_LRU(ulint max_n, bool evict)

  mysql_mutex_assert_owner(&buf_pool.mutex);

<<<<<<< HEAD

  flush_counters_t n;

  buf_do_LRU_batch(max_n, evict, &n);

||||||| 10d9b890b0f

  log_buffer_flush_to_disk();

=======

  log_buffer_flush_to_disk_async();

>>>>>>> fbf8646335280150a6ecf5727effb1a719f26b22

  ulint pages= n.flushed;

  if (n.evicted)

This was the only invocation of an asynchronous log write if we ignore the rare special case of innodb_undo_log_truncate=ON in mtr_t::commit_shrink(). The call to the synchronous log write had been removed in ~~MDEV-26055~~ when we made the buf_flush_page_cleaner() thread spend the rest of its innodb_io_capacity per-second ‘budget’ on LRU eviction flushing.

It does not seem feasible to pursue with this.

Marko Mäkelä added a comment - 2024-02-20 15:35 I spent some time merging the changes from 10.9 to 11.0; I think 10.11 would have been the same in terms of conflicts (many of them due to MDEV-33379 , which reminded me of this task). I hit a fundamental conflict: ulint buf_flush_LRU(ulint max_n, bool evict) { mysql_mutex_assert_owner(&buf_pool.mutex); <<<<<<< HEAD flush_counters_t n; buf_do_LRU_batch(max_n, evict, &n); ||||||| 10d9b890b0f log_buffer_flush_to_disk(); ======= log_buffer_flush_to_disk_async(); >>>>>>> fbf8646335280150a6ecf5727effb1a719f26b22 ulint pages= n.flushed; if (n.evicted) This was the only invocation of an asynchronous log write if we ignore the rare special case of innodb_undo_log_truncate=ON in mtr_t::commit_shrink() . The call to the synchronous log write had been removed in MDEV-26055 when we made the buf_flush_page_cleaner() thread spend the rest of its innodb_io_capacity per-second ‘budget’ on LRU eviction flushing. It does not seem feasible to pursue with this.

MariaDB Server

asynchronous redo log write

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration