Details

    Description

      The most important use case is for threadpool - this avoids blocking the thread, for the group commit lead.

      Attachments

        1. base_cpu.svg
          606 kB
        2. base_offcpu.svg
          90 kB
        3. patch_cpu.svg
          710 kB
        4. patch_offcpu.svg
          90 kB
        5. sysbench.pdf
          25 kB
        6. tpcc4.pdf
          364 kB
        7. tpcc5.pdf
          192 kB

        Issue Links

          Activity

            I realized that unlike for MDEV-28313, this work is best tested with innodb_flush_log_at_trx_commit=1. So, I reran the benchmark:

            version 20 40 80 160 320 640
            patched 39524.83 82090.86 154108.89 152452.54 128073.02 131764.78
            10.9+MDEV-28313 43330.10 87049.76 151794.40 151358.43 127131.37 131002.28
            10.9 44416.55 86055.97 151811.06 140709.94 128494.92 132801.09

            Except for the lowest concurrency, it is actually looking good.

            This 30-second benchmark is of course too short to draw any real conclusion, but it does not look too bad. For the 10.9 baseline, the checkpoint flush occurred while the test was running at 160 concurrent users. For the 640 concurrent users, I restarted a new sysbench prepare and run, as with all the recent benchmarks. That is why it is showing slightly better throughput than the 320-user test.

            marko Marko Mäkelä added a comment - I realized that unlike for MDEV-28313 , this work is best tested with innodb_flush_log_at_trx_commit=1 . So, I reran the benchmark: version 20 40 80 160 320 640 patched 39524.83 82090.86 154108.89 152452.54 128073.02 131764.78 10.9+MDEV-28313 43330.10 87049.76 151794.40 151358.43 127131.37 131002.28 10.9 44416.55 86055.97 151811.06 140709.94 128494.92 132801.09 Except for the lowest concurrency, it is actually looking good. This 30-second benchmark is of course too short to draw any real conclusion, but it does not look too bad. For the 10.9 baseline, the checkpoint flush occurred while the test was running at 160 concurrent users. For the 640 concurrent users, I restarted a new sysbench prepare and run, as with all the recent benchmarks. That is why it is showing slightly better throughput than the 320-user test.

            To assess the impact of MDEV-28313, I repeated a quick Sysbench 8×100,000-row oltp_update_index test without MDEV-28313 and with innodb_flush_log_at_trx_commit=1.
            The version column legend is the same as in the previous comments, except for the introduction of 10.9+merge of async, which is the same as patched but without the MDEV-28313 changes.

            version 20 40 80 160 320 640
            10.9+merge
            of async
            40062.04 82227.38 154505.53 149740.18 123871.06 131360.35
            10.9 42809.14 87178.33 152955.76 151528.31 124043.59 131941.35

            We can observe some insignificant improvement at 80 concurrent connections (which is "polluted" by the checkpoint flush that occurred during that test), and otherwise a performance regression or no improvement.
            This 30-second benchmark run is too short to draw any definite conclusion. Actually the bottom line of the table is from an equivalent setup with the bottom line of the previous table.

            One interesting change is that with the MDEV-28313 change included, we saw a slight improvement at 160 concurrent connections, but without it, we can observe a regression.
            I reran the test in a different way (prepare+run 30 seconds with 80 clients, prepare+run 30 seconds with 160 clients) to gain some more confidence:

            version 80 160
            10.9+merge
            of async
            151006.08 154541.20
            10.9 150857.26 158157.07

            This time, no checkpoint flushing occurred during the 80-client run, and we see no significant improvement. The clear regression at 160 clients remained.

            The counterintuitive performance regression could partly be addressed by MDEV-28313. With the test oltp_update_non_index, performance problems related to the lock-free hash table trx_sys.rw_trx_hash (MDEV-21423) should matter less:

            version 20 40 80 160 320 640
            10.9+merge
            of async
            38514.14 89237.51 167100.82 192394.20 189902.25 193034.80
            10.9 42022.65 97957.34 169509.91 187099.23 191413.50 199397.91

            Traversal of the entire trx_sys.rw_trx_hash table is necessary not only for checking locks on secondary indexes, but also for read view creation. Let us additionally specify --transaction-isolation=READ-UNCOMMITTED to reduce that activity (purge_sys.view must still be updated), and test it also with the MDEV-28313 improvements:

            version 20 40 80 160 320 640
            patched 38794.57 89714.06 168784.54 191521.04 189094.02 192025.07
            10.9+MDEV-28313 41801.26 97290.21 170614.28 187754.89 196493.73 197833.70
            10.9+merge
            of async
            38383.13 89040.30 168254.81 192200.60 195663.06 193661.70
            10.9 43503.02 98087.92 169159.82 189859.61 194903.15 199073.90

            In this scenario with reduced activity around trx_sys.rw_trx_hash, MDEV-28313 should matter less, that is, the difference between the 2nd and 4th row should be mostly noise. However, we can still observe a consistent performance regression due to the asynchronous log writing.

            We will need deeper analysis to identify the bottleneck that causes the counterintuitive performance regression. MDEV-21423 may or may not fix this. An artificial benchmark that concurrently updates a very large number of SEQUENCE objects (MDEV-10139) should completely rule out the InnoDB transaction subsystem, because operations on SEQUENCE objects only generate redo log, no undo log at all.

            http://www.brendangregg.com/offcpuanalysis.html could be useful if it did not emit most call frames as "unknown" in my recent tests. I should investigate if https://github.com/iovisor/bcc/issues/3884 would fix that.

            marko Marko Mäkelä added a comment - To assess the impact of MDEV-28313 , I repeated a quick Sysbench 8×100,000-row oltp_update_index test without MDEV-28313 and with innodb_flush_log_at_trx_commit=1 . The version column legend is the same as in the previous comments, except for the introduction of 10.9+merge of async , which is the same as patched but without the MDEV-28313 changes. version 20 40 80 160 320 640 10.9 +merge of async 40062.04 82227.38 154505.53 149740.18 123871.06 131360.35 10.9 42809.14 87178.33 152955.76 151528.31 124043.59 131941.35 We can observe some insignificant improvement at 80 concurrent connections (which is "polluted" by the checkpoint flush that occurred during that test), and otherwise a performance regression or no improvement. This 30-second benchmark run is too short to draw any definite conclusion. Actually the bottom line of the table is from an equivalent setup with the bottom line of the previous table. One interesting change is that with the MDEV-28313 change included, we saw a slight improvement at 160 concurrent connections, but without it, we can observe a regression. I reran the test in a different way (prepare+run 30 seconds with 80 clients, prepare+run 30 seconds with 160 clients) to gain some more confidence: version 80 160 10.9 +merge of async 151006.08 154541.20 10.9 150857.26 158157.07 This time, no checkpoint flushing occurred during the 80-client run, and we see no significant improvement. The clear regression at 160 clients remained. The counterintuitive performance regression could partly be addressed by MDEV-28313 . With the test oltp_update_ non _index , performance problems related to the lock-free hash table trx_sys.rw_trx_hash ( MDEV-21423 ) should matter less: version 20 40 80 160 320 640 10.9 +merge of async 38514.14 89237.51 167100.82 192394.20 189902.25 193034.80 10.9 42022.65 97957.34 169509.91 187099.23 191413.50 199397.91 Traversal of the entire trx_sys.rw_trx_hash table is necessary not only for checking locks on secondary indexes, but also for read view creation. Let us additionally specify --transaction-isolation=READ-UNCOMMITTED to reduce that activity ( purge_sys.view must still be updated), and test it also with the MDEV-28313 improvements: version 20 40 80 160 320 640 patched 38794.57 89714.06 168784.54 191521.04 189094.02 192025.07 10.9+MDEV-28313 41801.26 97290.21 170614.28 187754.89 196493.73 197833.70 10.9 +merge of async 38383.13 89040.30 168254.81 192200.60 195663.06 193661.70 10.9 43503.02 98087.92 169159.82 189859.61 194903.15 199073.90 In this scenario with reduced activity around trx_sys.rw_trx_hash , MDEV-28313 should matter less, that is, the difference between the 2nd and 4th row should be mostly noise. However, we can still observe a consistent performance regression due to the asynchronous log writing. We will need deeper analysis to identify the bottleneck that causes the counterintuitive performance regression. MDEV-21423 may or may not fix this. An artificial benchmark that concurrently updates a very large number of SEQUENCE objects ( MDEV-10139 ) should completely rule out the InnoDB transaction subsystem, because operations on SEQUENCE objects only generate redo log, no undo log at all. http://www.brendangregg.com/offcpuanalysis.html could be useful if it did not emit most call frames as "unknown" in my recent tests. I should investigate if https://github.com/iovisor/bcc/issues/3884 would fix that.

            As noted in MDEV-28766, I repeated a test run after fixing the performance regression MDEV-28708. I still observe up to 10% performance regression at low numbers of concurrent connections after applying MDEV-26603. But, my test is probably way too small to draw any definite conclusion.

            marko Marko Mäkelä added a comment - As noted in MDEV-28766 , I repeated a test run after fixing the performance regression MDEV-28708 . I still observe up to 10% performance regression at low numbers of concurrent connections after applying MDEV-26603 . But, my test is probably way too small to draw any definite conclusion.

            For the record, the redo log checkpoint used to be written asynchronously until the code was simplified in MariaDB Server 10.5.0.

            I do not think that bringing it back would help much, but I thought that I would mention for the sake of completeness.

            marko Marko Mäkelä added a comment - For the record, the redo log checkpoint used to be written asynchronously until the code was simplified in MariaDB Server 10.5.0. I do not think that bringing it back would help much, but I thought that I would mention for the sake of completeness.

            I spent some time merging the changes from 10.9 to 11.0; I think 10.11 would have been the same in terms of conflicts (many of them due to MDEV-33379, which reminded me of this task). I hit a fundamental conflict:

            ulint buf_flush_LRU(ulint max_n, bool evict)
            {
              mysql_mutex_assert_owner(&buf_pool.mutex);
             
            <<<<<<< HEAD
              flush_counters_t n;
              buf_do_LRU_batch(max_n, evict, &n);
            ||||||| 10d9b890b0f
              log_buffer_flush_to_disk();
            =======
              log_buffer_flush_to_disk_async();
            >>>>>>> fbf8646335280150a6ecf5727effb1a719f26b22
             
              ulint pages= n.flushed;
             
              if (n.evicted)
            

            This was the only invocation of an asynchronous log write if we ignore the rare special case of innodb_undo_log_truncate=ON in mtr_t::commit_shrink(). The call to the synchronous log write had been removed in MDEV-26055 when we made the buf_flush_page_cleaner() thread spend the rest of its innodb_io_capacity per-second ‘budget’ on LRU eviction flushing.

            It does not seem feasible to pursue with this.

            marko Marko Mäkelä added a comment - I spent some time merging the changes from 10.9 to 11.0; I think 10.11 would have been the same in terms of conflicts (many of them due to MDEV-33379 , which reminded me of this task). I hit a fundamental conflict: ulint buf_flush_LRU(ulint max_n, bool evict) { mysql_mutex_assert_owner(&buf_pool.mutex);   <<<<<<< HEAD flush_counters_t n; buf_do_LRU_batch(max_n, evict, &n); ||||||| 10d9b890b0f log_buffer_flush_to_disk(); ======= log_buffer_flush_to_disk_async(); >>>>>>> fbf8646335280150a6ecf5727effb1a719f26b22   ulint pages= n.flushed;   if (n.evicted) This was the only invocation of an asynchronous log write if we ignore the rare special case of innodb_undo_log_truncate=ON in mtr_t::commit_shrink() . The call to the synchronous log write had been removed in MDEV-26055 when we made the buf_flush_page_cleaner() thread spend the rest of its innodb_io_capacity per-second ‘budget’ on LRU eviction flushing. It does not seem feasible to pursue with this.

            People

              marko Marko Mäkelä
              wlad Vladislav Vaintroub
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.