Details

    Description

      The most important use case is for threadpool - this avoids blocking the thread, for the group commit lead.

      Attachments

        1. base_cpu.svg
          606 kB
          Vladislav Vaintroub
        2. base_offcpu.svg
          90 kB
          Vladislav Vaintroub
        3. patch_cpu.svg
          710 kB
          Vladislav Vaintroub
        4. patch_offcpu.svg
          90 kB
          Vladislav Vaintroub
        5. sysbench.pdf
          25 kB
          Axel Schwenke
        6. tpcc4.pdf
          364 kB
          Axel Schwenke
        7. tpcc5.pdf
          192 kB
          Axel Schwenke

        Issue Links

          Activity

            wlad Vladislav Vaintroub created issue -
            wlad Vladislav Vaintroub made changes -
            Field Original Value New Value
            Description The most important use case is for threadpool - this avoid avoid blocking for the group commit lead, too The most important use case is for threadpool - this avoids blocking the thread, for the group commit lead.
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 10.8 [ 26121 ]
            Fix Version/s 10.7 [ 24805 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 125030 ] MariaDB v4 [ 131554 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.9 [ 26905 ]
            Fix Version/s 10.8 [ 26121 ]
            serg Sergei Golubchik made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            wlad Vladislav Vaintroub made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            marko Marko Mäkelä made changes -
            wlad Vladislav Vaintroub made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            wlad Vladislav Vaintroub made changes -
            Assignee Vladislav Vaintroub [ wlad ] Marko Mäkelä [ marko ]
            Status Stalled [ 10000 ] In Review [ 10002 ]
            marko Marko Mäkelä made changes -

            While working on MDEV-27812, I noticed that in MDEV-14425, I accidentally removed checks that the log writes succeed (MDEV-27916). Similar error handling must be implemented also for the asynchronous log write code path.

            Can you try to merge this with the work-in-progress MDEV-27812 branch?

            marko Marko Mäkelä added a comment - While working on MDEV-27812 , I noticed that in MDEV-14425 , I accidentally removed checks that the log writes succeed ( MDEV-27916 ). Similar error handling must be implemented also for the asynchronous log write code path. Can you try to merge this with the work-in-progress MDEV-27812 branch?
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Vladislav Vaintroub [ wlad ]
            Status In Review [ 10002 ] Stalled [ 10000 ]

            marko, this merges almost fine. I can resolve the remaining conflicts once MDEV-27812 is pushed.

            wlad Vladislav Vaintroub added a comment - marko , this merges almost fine. I can resolve the remaining conflicts once MDEV-27812 is pushed.
            wlad Vladislav Vaintroub made changes -
            Status Stalled [ 10000 ] In Testing [ 10301 ]
            wlad Vladislav Vaintroub made changes -
            Assignee Vladislav Vaintroub [ wlad ] Matthias Leich [ mleich ]

            mleich, some write-intensive testing would be good.

            wlad Vladislav Vaintroub added a comment - mleich , some write-intensive testing would be good.

            Build with success.
            ./mtr 1st fails during bootstrap with
            mariadbd: /data/Server/bb-10.9-MDEV-26603-async-redo-write/storage/innobase/log/log0log.cc:776: void log_write_up_to(lsn_t, bool, const completion_callback*): Assertion `!callback' failed.

            mleich Matthias Leich added a comment - Build with success. ./mtr 1st fails during bootstrap with mariadbd: /data/Server/bb-10.9- MDEV-26603 -async-redo-write/storage/innobase/log/log0log.cc:776: void log_write_up_to(lsn_t, bool, const completion_callback*): Assertion `!callback' failed.
            mleich Matthias Leich made changes -
            Status In Testing [ 10301 ] Stalled [ 10000 ]
            mleich Matthias Leich made changes -
            Assignee Matthias Leich [ mleich ] Vladislav Vaintroub [ wlad ]
            wlad Vladislav Vaintroub made changes -
            Assignee Vladislav Vaintroub [ wlad ] Matthias Leich [ mleich ]

            Pardon mleich, it was a small glitch related to PMEM. I'd recommend to also test on normal SSD/HDD, it is the primary target for the patch.

            wlad Vladislav Vaintroub added a comment - Pardon mleich , it was a small glitch related to PMEM. I'd recommend to also test on normal SSD/HDD, it is the primary target for the patch.
            mleich Matthias Leich made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            mleich Matthias Leich made changes -
            Status In Progress [ 3 ] In Testing [ 10301 ]

            origin/bb-10.9-MDEV-26603-async-redo-write b78470cbb0335df848aab3074d0a0bffdd721fb8 2022-02-25T12:43:45+01:00
            behaved well in RQG testing.
            

            mleich Matthias Leich added a comment - origin/bb-10.9-MDEV-26603-async-redo-write b78470cbb0335df848aab3074d0a0bffdd721fb8 2022-02-25T12:43:45+01:00 behaved well in RQG testing.
            mleich Matthias Leich made changes -
            Status In Testing [ 10301 ] Stalled [ 10000 ]
            mleich Matthias Leich made changes -
            Assignee Matthias Leich [ mleich ] Vladislav Vaintroub [ wlad ]

            Now that MDEV-27812 has reached 10.9, this work needs to be rebased for the final review and testing.

            marko Marko Mäkelä added a comment - Now that MDEV-27812 has reached 10.9, this work needs to be rebased for the final review and testing.
            wlad Vladislav Vaintroub made changes -
            Assignee Vladislav Vaintroub [ wlad ] Marko Mäkelä [ marko ]
            wlad Vladislav Vaintroub added a comment - - edited

            Apparently, MDEV-27812 was redone in a way that was is incompatible with whatever was done here, wrt locking. I reassign this to marko, to figure out what to do, to fix locking error as highlighted here https://buildbot.mariadb.org/#/builders/234/builds/6237

            wlad Vladislav Vaintroub added a comment - - edited Apparently, MDEV-27812 was redone in a way that was is incompatible with whatever was done here, wrt locking. I reassign this to marko , to figure out what to do, to fix locking error as highlighted here https://buildbot.mariadb.org/#/builders/234/builds/6237

            Some added functions or parameters need to be documented.

            There is a design conflict with MDEV-27812 that has to be solved as part of this task. On a log checkpoint where the log resizing completes, both the current and the resized log file will have to be durably written to the very end (up to the current log_sys.lsn), before any files are closed or renamed. In the current form of the change, this is not the case. There is a race condition, write_lock is being released twice, and the completion of log resizing would not be crash-safe.

            marko Marko Mäkelä added a comment - Some added functions or parameters need to be documented. There is a design conflict with MDEV-27812 that has to be solved as part of this task. On a log checkpoint where the log resizing completes, both the current and the resized log file will have to be durably written to the very end (up to the current log_sys.lsn ), before any files are closed or renamed. In the current form of the change, this is not the case. There is a race condition, write_lock is being released twice, and the completion of log resizing would not be crash-safe.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Vladislav Vaintroub [ wlad ]

            I realized that when the log resizing completes, there is no need to write anything extra to the log files, apart from the checkpoint block headers. That simplification is the easiest way to resolve the conflict. The log resizing of course needs to be extensively tested again, along with normal recovery and backup testing.

            I pushed some suggested cleanup. Please review it and submit a squashed commit to my final review. I did not yet review all of it. What I read so far looks OK to me.

            marko Marko Mäkelä added a comment - I realized that when the log resizing completes, there is no need to write anything extra to the log files, apart from the checkpoint block headers. That simplification is the easiest way to resolve the conflict. The log resizing of course needs to be extensively tested again, along with normal recovery and backup testing. I pushed some suggested cleanup. Please review it and submit a squashed commit to my final review. I did not yet review all of it. What I read so far looks OK to me.

            This looks good. Could you please try to revive the log_flush_task that was removed in MDEV-25948? That is, check if adding an asynchronous log write between buf_flush_page_cleaner() batches could reduce the latency of checkpoint flushing.

            marko Marko Mäkelä added a comment - This looks good. Could you please try to revive the log_flush_task that was removed in MDEV-25948 ? That is, check if adding an asynchronous log write between buf_flush_page_cleaner() batches could reduce the latency of checkpoint flushing.

            I tried to pre-flush the log with below code. Did not really make any measurable difference in testing.

            diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
            index bbbd26bd8b4..265aff708d9 100644
            --- a/storage/innobase/buf/buf0flu.cc
            +++ b/storage/innobase/buf/buf0flu.cc
            @@ -1514,6 +1514,10 @@ static ulint buf_flush_list(ulint max_n= ULINT_UNDEFINED, lsn_t lsn= LSN_MAX)
               if (buf_pool.n_flush_list())
                 return 0;
            +  /* Initiate redo log writing, if not already in progress,
            +  to speed up possible synchronous log writes later on.*/
            +  log_buffer_flush_to_disk_async();
            +
               mysql_mutex_lock(&buf_pool.mutex);
               const bool running= buf_pool.n_flush_list_ != 0;
               /* FIXME: we are performing a dirty read of buf_pool.flush_list.count
            

            wlad Vladislav Vaintroub added a comment - I tried to pre-flush the log with below code. Did not really make any measurable difference in testing. diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index bbbd26bd8b4..265aff708d9 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -1514,6 +1514,10 @@ static ulint buf_flush_list(ulint max_n= ULINT_UNDEFINED, lsn_t lsn= LSN_MAX) if (buf_pool.n_flush_list()) return 0; + /* Initiate redo log writing, if not already in progress, + to speed up possible synchronous log writes later on.*/ + log_buffer_flush_to_disk_async(); + mysql_mutex_lock(&buf_pool.mutex); const bool running= buf_pool.n_flush_list_ != 0; /* FIXME: we are performing a dirty read of buf_pool.flush_list.count
            marko Marko Mäkelä made changes -
            Status Stalled [ 10000 ] In Testing [ 10301 ]
            marko Marko Mäkelä made changes -
            Assignee Vladislav Vaintroub [ wlad ] Matthias Leich [ mleich ]
            axel Axel Schwenke made changes -
            Attachment sysbench.pdf [ 62547 ]
            axel Axel Schwenke added a comment -

            Performance testing with sysbench shows no improvement: sysbench.pdf

            axel Axel Schwenke added a comment - Performance testing with sysbench shows no improvement: sysbench.pdf

            There are some slight signs of improvement, especially with the low concurrency

            I think what's missing in testing, for the full picture is that probably

            • innodb_flush_method=O_DSYNC, or ask marko how to change the code to use O_DIRECT for redo log writes by default - on some unclear/undocumented reason, it is yet used on Linux. This is especially relevant when using libaio, because according to widespread rumors, it does not work well with cached files.
            • thread_handling=pool-of-threads

            I think it is enough to only test update_index and update_non_index with the above variables. The O_DIRECT can be generally reconsidered, though this is up to @marko.

            wlad Vladislav Vaintroub added a comment - There are some slight signs of improvement, especially with the low concurrency I think what's missing in testing, for the full picture is that probably innodb_flush_method=O_DSYNC, or ask marko how to change the code to use O_DIRECT for redo log writes by default - on some unclear/undocumented reason, it is yet used on Linux. This is especially relevant when using libaio, because according to widespread rumors, it does not work well with cached files. thread_handling=pool-of-threads I think it is enough to only test update_index and update_non_index with the above variables. The O_DIRECT can be generally reconsidered, though this is up to @marko.

            Yes, after MDEV-27774 removed a bottleneck, it should be worth trying to enable O_DIRECT on the Linux redo log by default.

            marko Marko Mäkelä added a comment - Yes, after MDEV-27774 removed a bottleneck, it should be worth trying to enable O_DIRECT on the Linux redo log by default.
            axel Axel Schwenke made changes -
            Attachment tpcc4.pdf [ 62560 ]
            Attachment tpcc5.pdf [ 62561 ]
            axel Axel Schwenke added a comment - - edited

            Performance testing with sysbench-tpcc shows no improvement either. For the CPU-bound case MDEV-26603 ist on par with 10.9 baseline. For the IO-bound case it falls behind: tpcc4.pdf

            Using async log flushing has the same problem. The general numbers increase, but the baseline is still faster: tpcc5.pdf

            IO-wise this test was done with the default setting for innodb_flush_method, should be O_DIRECT

            axel Axel Schwenke added a comment - - edited Performance testing with sysbench-tpcc shows no improvement either. For the CPU-bound case MDEV-26603 ist on par with 10.9 baseline. For the IO-bound case it falls behind: tpcc4.pdf Using async log flushing has the same problem. The general numbers increase, but the baseline is still faster: tpcc5.pdf IO-wise this test was done with the default setting for innodb_flush_method , should be O_DIRECT
            axel Axel Schwenke added a comment -

            In numbers

            sysbench-tpcc @ 32 thd, scale 64x10 (cpu bound)
             
            version                         TPS
            --------------------------------------
            mariadb-10.7.3                  3729.4
            mariadb-10.8-0366fb5f54a        3799.9
            mariadb-10.9-66b5b9214b1        5127.8
            MDEV-26603-24541d06140          5150.4
             
             
            sysbench-tpcc @ 32 thd, scale 160x10 (io bound)
             
            version                         TPS
            --------------------------------------
            mariadb-10.7.3                  804.88
            mariadb-10.8-0366fb5f54a        814.46
            mariadb-10.9-66b5b9214b1        860.87
            MDEV-26603-24541d06140          772.41
             
             
            sysbench-tpcc @ 32 thd, scale 160x10 (io bound)
            relaxed flushing (innodb_flush_log_at_trx_commit=0)
             
            version                         TPS
            --------------------------------------
            mariadb-10.9-66b5b9214b1        924.90
            MDEV-26603-24541d06140          884.13
            

            axel Axel Schwenke added a comment - In numbers sysbench-tpcc @ 32 thd, scale 64x10 (cpu bound)   version TPS -------------------------------------- mariadb-10.7.3 3729.4 mariadb-10.8-0366fb5f54a 3799.9 mariadb-10.9-66b5b9214b1 5127.8 MDEV-26603-24541d06140 5150.4     sysbench-tpcc @ 32 thd, scale 160x10 (io bound)   version TPS -------------------------------------- mariadb-10.7.3 804.88 mariadb-10.8-0366fb5f54a 814.46 mariadb-10.9-66b5b9214b1 860.87 MDEV-26603-24541d06140 772.41     sysbench-tpcc @ 32 thd, scale 160x10 (io bound) relaxed flushing (innodb_flush_log_at_trx_commit=0)   version TPS -------------------------------------- mariadb-10.9-66b5b9214b1 924.90 MDEV-26603-24541d06140 884.13
            serg Sergei Golubchik made changes -

            I conducted some benchmarks on my own, comparing preview-10.9-MDEV-26603-async-redo-write to its parent, both with a merge of MDEV-28111, to enable O_DIRECT writes of the redo log by default. The storage was NVMe with ext4fs and io_uring on Linux 5.16.12.

            In a test of LRU eviction flushing (innodb_buffer_pool_size=1G, innodb_log_file_size=4G, innodb_flush_log_at_trx_commit=0, a few gigabytes of working set), I observed an improvement:

            version throughput/tps latency/ms
            baseline 6357.62 26.20
            async 7022.63 23.95

            In all other cases, I observed a small performance regression, for all 3 values of innodb_flush_log_at_trx_commit.

            This is a dual socket system, and I tried to use enough threads to have some cross NUMA traffic. It could be wise to revise the changes that were made to write_lock and flush_lock.

            marko Marko Mäkelä added a comment - I conducted some benchmarks on my own, comparing preview-10.9-MDEV-26603-async-redo-write to its parent, both with a merge of MDEV-28111 , to enable O_DIRECT writes of the redo log by default . The storage was NVMe with ext4fs and io_uring on Linux 5.16.12. In a test of LRU eviction flushing ( innodb_buffer_pool_size=1G , innodb_log_file_size=4G , innodb_flush_log_at_trx_commit=0 , a few gigabytes of working set), I observed an improvement: version throughput/tps latency/ms baseline 6357.62 26.20 async 7022.63 23.95 In all other cases, I observed a small performance regression, for all 3 values of innodb_flush_log_at_trx_commit . This is a dual socket system, and I tried to use enough threads to have some cross NUMA traffic. It could be wise to revise the changes that were made to write_lock and flush_lock .

            I tried to chase down the performance regression for the basic case (not the LRU flushing case where I was able to reproduce a 10% improvement).

            I do not know what I am doing wrong this time, but my notes from MDEV-26004 did not quite work: I always got "unknown" symbols reported for mariadbd in the offtimecpu-bpfcc flame graphs, no matter what I tried. The symbols in my Linux 5.17.0 kernel resolved just fine.

            On both NVMe with ext4fs, and RAM with tmpfs, I consistently saw a performance regression.

            With libaio the regression on NVMe was smaller. With io_uring, there were kernel stack traces with schedule_timeout being called by io_wqe_worker when using the async redo log write.

            To avoid the "fake PMEM" code path of MDEV-14425 which is not affected by these changes, I placed the data directory in /run/user/$UID instead of /dev/shm. With tmpfs and simulated AIO, the offtimecpu flame graphs indicated that there is some increased wait on a futex. According to the performance schema, the async log write reduced waits for log_sys.latch and some increased waits for some MDL. I tried disabling the backup locks, but the performance regression remained.

            I tried innodb_force_recovery=2 to disable the purge of transaction history, and innodb_undo_tablespaces=100 to reduce contention on the fil_space_t::latch of the system tablespace. The regression became somewhat smaller but remained. Nothing in the performance_schema output really stood out.

            I tested a simple Sysbench oltp_update_index at 10 threads for 10 or 30 seconds, with innodb_flush_log_at_trx_commit=1 and a large enough buffer pool and log file so that no checkpoint flushing will take place. The performance regression was prominent even in such tiny runs.

            wlad, unless you are able to figure out the cause of the regression, I am afraid that we must put this on hold.

            marko Marko Mäkelä added a comment - I tried to chase down the performance regression for the basic case (not the LRU flushing case where I was able to reproduce a 10% improvement). I do not know what I am doing wrong this time, but my notes from MDEV-26004 did not quite work: I always got "unknown" symbols reported for mariadbd in the offtimecpu-bpfcc flame graphs, no matter what I tried. The symbols in my Linux 5.17.0 kernel resolved just fine. On both NVMe with ext4fs, and RAM with tmpfs, I consistently saw a performance regression. With libaio the regression on NVMe was smaller. With io_uring , there were kernel stack traces with schedule_timeout being called by io_wqe_worker when using the async redo log write. To avoid the "fake PMEM" code path of MDEV-14425 which is not affected by these changes, I placed the data directory in /run/user/$UID instead of /dev/shm . With tmpfs and simulated AIO, the offtimecpu flame graphs indicated that there is some increased wait on a futex. According to the performance schema, the async log write reduced waits for log_sys.latch and some increased waits for some MDL. I tried disabling the backup locks, but the performance regression remained. I tried innodb_force_recovery=2 to disable the purge of transaction history, and innodb_undo_tablespaces=100 to reduce contention on the fil_space_t::latch of the system tablespace. The regression became somewhat smaller but remained. Nothing in the performance_schema output really stood out. I tested a simple Sysbench oltp_update_index at 10 threads for 10 or 30 seconds, with innodb_flush_log_at_trx_commit=1 and a large enough buffer pool and log file so that no checkpoint flushing will take place. The performance regression was prominent even in such tiny runs. wlad , unless you are able to figure out the cause of the regression, I am afraid that we must put this on hold.
            marko Marko Mäkelä made changes -
            Assignee Matthias Leich [ mleich ] Vladislav Vaintroub [ wlad ]
            Status In Testing [ 10301 ] Stalled [ 10000 ]
            wlad Vladislav Vaintroub made changes -
            Attachment base_cpu.svg [ 62801 ]
            Attachment patch_cpu.svg [ 62802 ]
            wlad Vladislav Vaintroub made changes -
            Attachment base_offcpu.svg [ 62803 ]
            Attachment patch_offcpu.svg [ 62804 ]

            No, I'm not able to figure out the cause, but I attach the flamegraphs for the posterity. The cpu flamegraph shows more log_write_up_to being outsources to background threadpool thread, which is theoretically good thing.

            wlad Vladislav Vaintroub added a comment - No, I'm not able to figure out the cause, but I attach the flamegraphs for the posterity. The cpu flamegraph shows more log_write_up_to being outsources to background threadpool thread, which is theoretically good thing.
            wlad Vladislav Vaintroub made changes -
            Priority Critical [ 2 ] Minor [ 4 ]
            wlad Vladislav Vaintroub added a comment - - edited

            Since nobody is interested in pushing this into 10.9, or perform more Linux-specific benchmarking and analysis, I reassign the priority accordingly.

            wlad Vladislav Vaintroub added a comment - - edited Since nobody is interested in pushing this into 10.9, or perform more Linux-specific benchmarking and analysis, I reassign the priority accordingly.
            serg Sergei Golubchik made changes -
            Fix Version/s 10.9 [ 26905 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.10 [ 27530 ]

            I think that MDEV-28313 may have addressed the unexpected regression. If wlad and axel can confirm that there no longer is any performance regression, I think that this is OK to push. I did not measure the LRU flushing scenario, but I have no reason to believe that the previously observed performance gain would disappear in that case.

            Part of MDEV-21423 (scalability bottleneck due to trx_sys.rw_trx_hash) was addressed by MDEV-28313. I tested the following with a quick benchmark (see MDEV-21423 for the description).

            version 20 40 80 160 320 640
            patched 171236.52 192544.82 115298.23 151579.70 146904.22 151999.43
            10.9+MDEV-28313 169572.38 191460.20 137424.75 137625.91 141308.08 151053.27
            10.9 167970.59 190672.82 142246.60 138316.21 136684.18 104582.04

            Note: The 10.9 baseline is 2 commits older than the merge of MDEV-28313, but I do not think it matters for this benchmark.
            As noted in MDEV-21423, the figures for 80 concurrent threads are unreliable, because a checkpoint flush would have occurred during it.

            marko Marko Mäkelä added a comment - I think that MDEV-28313 may have addressed the unexpected regression. If wlad and axel can confirm that there no longer is any performance regression, I think that this is OK to push. I did not measure the LRU flushing scenario, but I have no reason to believe that the previously observed performance gain would disappear in that case. Part of MDEV-21423 (scalability bottleneck due to trx_sys.rw_trx_hash ) was addressed by MDEV-28313 . I tested the following with a quick benchmark (see MDEV-21423 for the description). version 20 40 80 160 320 640 patched 171236.52 192544.82 115298.23 151579.70 146904.22 151999.43 10.9+MDEV-28313 169572.38 191460.20 137424.75 137625.91 141308.08 151053.27 10.9 167970.59 190672.82 142246.60 138316.21 136684.18 104582.04 Note: The 10.9 baseline is 2 commits older than the merge of MDEV-28313 , but I do not think it matters for this benchmark. As noted in MDEV-21423 , the figures for 80 concurrent threads are unreliable, because a checkpoint flush would have occurred during it.

            I realized that unlike for MDEV-28313, this work is best tested with innodb_flush_log_at_trx_commit=1. So, I reran the benchmark:

            version 20 40 80 160 320 640
            patched 39524.83 82090.86 154108.89 152452.54 128073.02 131764.78
            10.9+MDEV-28313 43330.10 87049.76 151794.40 151358.43 127131.37 131002.28
            10.9 44416.55 86055.97 151811.06 140709.94 128494.92 132801.09

            Except for the lowest concurrency, it is actually looking good.

            This 30-second benchmark is of course too short to draw any real conclusion, but it does not look too bad. For the 10.9 baseline, the checkpoint flush occurred while the test was running at 160 concurrent users. For the 640 concurrent users, I restarted a new sysbench prepare and run, as with all the recent benchmarks. That is why it is showing slightly better throughput than the 320-user test.

            marko Marko Mäkelä added a comment - I realized that unlike for MDEV-28313 , this work is best tested with innodb_flush_log_at_trx_commit=1 . So, I reran the benchmark: version 20 40 80 160 320 640 patched 39524.83 82090.86 154108.89 152452.54 128073.02 131764.78 10.9+MDEV-28313 43330.10 87049.76 151794.40 151358.43 127131.37 131002.28 10.9 44416.55 86055.97 151811.06 140709.94 128494.92 132801.09 Except for the lowest concurrency, it is actually looking good. This 30-second benchmark is of course too short to draw any real conclusion, but it does not look too bad. For the 10.9 baseline, the checkpoint flush occurred while the test was running at 160 concurrent users. For the 640 concurrent users, I restarted a new sysbench prepare and run, as with all the recent benchmarks. That is why it is showing slightly better throughput than the 320-user test.

            To assess the impact of MDEV-28313, I repeated a quick Sysbench 8×100,000-row oltp_update_index test without MDEV-28313 and with innodb_flush_log_at_trx_commit=1.
            The version column legend is the same as in the previous comments, except for the introduction of 10.9+merge of async, which is the same as patched but without the MDEV-28313 changes.

            version 20 40 80 160 320 640
            10.9+merge
            of async
            40062.04 82227.38 154505.53 149740.18 123871.06 131360.35
            10.9 42809.14 87178.33 152955.76 151528.31 124043.59 131941.35

            We can observe some insignificant improvement at 80 concurrent connections (which is "polluted" by the checkpoint flush that occurred during that test), and otherwise a performance regression or no improvement.
            This 30-second benchmark run is too short to draw any definite conclusion. Actually the bottom line of the table is from an equivalent setup with the bottom line of the previous table.

            One interesting change is that with the MDEV-28313 change included, we saw a slight improvement at 160 concurrent connections, but without it, we can observe a regression.
            I reran the test in a different way (prepare+run 30 seconds with 80 clients, prepare+run 30 seconds with 160 clients) to gain some more confidence:

            version 80 160
            10.9+merge
            of async
            151006.08 154541.20
            10.9 150857.26 158157.07

            This time, no checkpoint flushing occurred during the 80-client run, and we see no significant improvement. The clear regression at 160 clients remained.

            The counterintuitive performance regression could partly be addressed by MDEV-28313. With the test oltp_update_non_index, performance problems related to the lock-free hash table trx_sys.rw_trx_hash (MDEV-21423) should matter less:

            version 20 40 80 160 320 640
            10.9+merge
            of async
            38514.14 89237.51 167100.82 192394.20 189902.25 193034.80
            10.9 42022.65 97957.34 169509.91 187099.23 191413.50 199397.91

            Traversal of the entire trx_sys.rw_trx_hash table is necessary not only for checking locks on secondary indexes, but also for read view creation. Let us additionally specify --transaction-isolation=READ-UNCOMMITTED to reduce that activity (purge_sys.view must still be updated), and test it also with the MDEV-28313 improvements:

            version 20 40 80 160 320 640
            patched 38794.57 89714.06 168784.54 191521.04 189094.02 192025.07
            10.9+MDEV-28313 41801.26 97290.21 170614.28 187754.89 196493.73 197833.70
            10.9+merge
            of async
            38383.13 89040.30 168254.81 192200.60 195663.06 193661.70
            10.9 43503.02 98087.92 169159.82 189859.61 194903.15 199073.90

            In this scenario with reduced activity around trx_sys.rw_trx_hash, MDEV-28313 should matter less, that is, the difference between the 2nd and 4th row should be mostly noise. However, we can still observe a consistent performance regression due to the asynchronous log writing.

            We will need deeper analysis to identify the bottleneck that causes the counterintuitive performance regression. MDEV-21423 may or may not fix this. An artificial benchmark that concurrently updates a very large number of SEQUENCE objects (MDEV-10139) should completely rule out the InnoDB transaction subsystem, because operations on SEQUENCE objects only generate redo log, no undo log at all.

            http://www.brendangregg.com/offcpuanalysis.html could be useful if it did not emit most call frames as "unknown" in my recent tests. I should investigate if https://github.com/iovisor/bcc/issues/3884 would fix that.

            marko Marko Mäkelä added a comment - To assess the impact of MDEV-28313 , I repeated a quick Sysbench 8×100,000-row oltp_update_index test without MDEV-28313 and with innodb_flush_log_at_trx_commit=1 . The version column legend is the same as in the previous comments, except for the introduction of 10.9+merge of async , which is the same as patched but without the MDEV-28313 changes. version 20 40 80 160 320 640 10.9 +merge of async 40062.04 82227.38 154505.53 149740.18 123871.06 131360.35 10.9 42809.14 87178.33 152955.76 151528.31 124043.59 131941.35 We can observe some insignificant improvement at 80 concurrent connections (which is "polluted" by the checkpoint flush that occurred during that test), and otherwise a performance regression or no improvement. This 30-second benchmark run is too short to draw any definite conclusion. Actually the bottom line of the table is from an equivalent setup with the bottom line of the previous table. One interesting change is that with the MDEV-28313 change included, we saw a slight improvement at 160 concurrent connections, but without it, we can observe a regression. I reran the test in a different way (prepare+run 30 seconds with 80 clients, prepare+run 30 seconds with 160 clients) to gain some more confidence: version 80 160 10.9 +merge of async 151006.08 154541.20 10.9 150857.26 158157.07 This time, no checkpoint flushing occurred during the 80-client run, and we see no significant improvement. The clear regression at 160 clients remained. The counterintuitive performance regression could partly be addressed by MDEV-28313 . With the test oltp_update_ non _index , performance problems related to the lock-free hash table trx_sys.rw_trx_hash ( MDEV-21423 ) should matter less: version 20 40 80 160 320 640 10.9 +merge of async 38514.14 89237.51 167100.82 192394.20 189902.25 193034.80 10.9 42022.65 97957.34 169509.91 187099.23 191413.50 199397.91 Traversal of the entire trx_sys.rw_trx_hash table is necessary not only for checking locks on secondary indexes, but also for read view creation. Let us additionally specify --transaction-isolation=READ-UNCOMMITTED to reduce that activity ( purge_sys.view must still be updated), and test it also with the MDEV-28313 improvements: version 20 40 80 160 320 640 patched 38794.57 89714.06 168784.54 191521.04 189094.02 192025.07 10.9+MDEV-28313 41801.26 97290.21 170614.28 187754.89 196493.73 197833.70 10.9 +merge of async 38383.13 89040.30 168254.81 192200.60 195663.06 193661.70 10.9 43503.02 98087.92 169159.82 189859.61 194903.15 199073.90 In this scenario with reduced activity around trx_sys.rw_trx_hash , MDEV-28313 should matter less, that is, the difference between the 2nd and 4th row should be mostly noise. However, we can still observe a consistent performance regression due to the asynchronous log writing. We will need deeper analysis to identify the bottleneck that causes the counterintuitive performance regression. MDEV-21423 may or may not fix this. An artificial benchmark that concurrently updates a very large number of SEQUENCE objects ( MDEV-10139 ) should completely rule out the InnoDB transaction subsystem, because operations on SEQUENCE objects only generate redo log, no undo log at all. http://www.brendangregg.com/offcpuanalysis.html could be useful if it did not emit most call frames as "unknown" in my recent tests. I should investigate if https://github.com/iovisor/bcc/issues/3884 would fix that.

            As noted in MDEV-28766, I repeated a test run after fixing the performance regression MDEV-28708. I still observe up to 10% performance regression at low numbers of concurrent connections after applying MDEV-26603. But, my test is probably way too small to draw any definite conclusion.

            marko Marko Mäkelä added a comment - As noted in MDEV-28766 , I repeated a test run after fixing the performance regression MDEV-28708 . I still observe up to 10% performance regression at low numbers of concurrent connections after applying MDEV-26603 . But, my test is probably way too small to draw any definite conclusion.
            serg Sergei Golubchik made changes -
            Fix Version/s 10.11 [ 27614 ]
            Fix Version/s 10.10 [ 27530 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.12 [ 28320 ]
            Fix Version/s 10.11 [ 27614 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Labels Preview_removed_10.9
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 11.2 [ 28603 ]
            Fix Version/s 11.0 [ 28320 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Assignee Vladislav Vaintroub [ wlad ] Marko Mäkelä [ marko ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Labels Preview_removed_10.9 Preview_10.9 Preview_11.1 Preview_removed_10.9
            ralf.gebhardt Ralf Gebhardt made changes -
            Labels Preview_10.9 Preview_11.1 Preview_removed_10.9 Preview_10.9 Preview_11.1
            ralf.gebhardt Ralf Gebhardt made changes -
            Labels Preview_10.9 Preview_11.1 Preview_10.9
            marko Marko Mäkelä made changes -

            For the record, the redo log checkpoint used to be written asynchronously until the code was simplified in MariaDB Server 10.5.0.

            I do not think that bringing it back would help much, but I thought that I would mention for the sake of completeness.

            marko Marko Mäkelä added a comment - For the record, the redo log checkpoint used to be written asynchronously until the code was simplified in MariaDB Server 10.5.0. I do not think that bringing it back would help much, but I thought that I would mention for the sake of completeness.
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 11.3 [ 28565 ]
            Fix Version/s 11.2 [ 28603 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 11.4 [ 29301 ]
            Fix Version/s 11.3 [ 28565 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 11.5 [ 29506 ]
            Fix Version/s 11.4 [ 29301 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Issue Type Task [ 3 ] New Feature [ 2 ]

            I spent some time merging the changes from 10.9 to 11.0; I think 10.11 would have been the same in terms of conflicts (many of them due to MDEV-33379, which reminded me of this task). I hit a fundamental conflict:

            ulint buf_flush_LRU(ulint max_n, bool evict)
            {
              mysql_mutex_assert_owner(&buf_pool.mutex);
             
            <<<<<<< HEAD
              flush_counters_t n;
              buf_do_LRU_batch(max_n, evict, &n);
            ||||||| 10d9b890b0f
              log_buffer_flush_to_disk();
            =======
              log_buffer_flush_to_disk_async();
            >>>>>>> fbf8646335280150a6ecf5727effb1a719f26b22
             
              ulint pages= n.flushed;
             
              if (n.evicted)
            

            This was the only invocation of an asynchronous log write if we ignore the rare special case of innodb_undo_log_truncate=ON in mtr_t::commit_shrink(). The call to the synchronous log write had been removed in MDEV-26055 when we made the buf_flush_page_cleaner() thread spend the rest of its innodb_io_capacity per-second ‘budget’ on LRU eviction flushing.

            It does not seem feasible to pursue with this.

            marko Marko Mäkelä added a comment - I spent some time merging the changes from 10.9 to 11.0; I think 10.11 would have been the same in terms of conflicts (many of them due to MDEV-33379 , which reminded me of this task). I hit a fundamental conflict: ulint buf_flush_LRU(ulint max_n, bool evict) { mysql_mutex_assert_owner(&buf_pool.mutex);   <<<<<<< HEAD flush_counters_t n; buf_do_LRU_batch(max_n, evict, &n); ||||||| 10d9b890b0f log_buffer_flush_to_disk(); ======= log_buffer_flush_to_disk_async(); >>>>>>> fbf8646335280150a6ecf5727effb1a719f26b22   ulint pages= n.flushed;   if (n.evicted) This was the only invocation of an asynchronous log write if we ignore the rare special case of innodb_undo_log_truncate=ON in mtr_t::commit_shrink() . The call to the synchronous log write had been removed in MDEV-26055 when we made the buf_flush_page_cleaner() thread spend the rest of its innodb_io_capacity per-second ‘budget’ on LRU eviction flushing. It does not seem feasible to pursue with this.
            marko Marko Mäkelä made changes -
            issue.field.resolutiondate 2024-02-20 15:35:24.0 2024-02-20 15:35:24.09
            marko Marko Mäkelä made changes -
            Fix Version/s N/A [ 14700 ]
            Fix Version/s 11.5 [ 29506 ]
            Resolution Won't Fix [ 2 ]
            Status Stalled [ 10000 ] Closed [ 6 ]

            People

              marko Marko Mäkelä
              wlad Vladislav Vaintroub
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.