Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23475

InnoDB performance regression for write-heavy workloads

Details

    Description

      Write-heavy benchmarks show a regression in latency and throughput for latest releases in the 10.2, 10.3 and 10.4 series.

      Example:

      ----------------------------------------------------------------------------
      Test 't_oltp-innodb-multi' - sysbench OLTP read/write
      32 InnoDB tables with 10 mio rows total
      numbers are queries per second
       
      #thread count           1       8       16      32      64      128     256
      mariadb-10.2.32         5262.8  28886   53535   95712   158049  189909  192668
      mariadb-10.2.33         4866.7  22651   42352   78750   133580  173461  191125
       
      mariadb-10.3.23         5126.4  28716   52785   95466   156936  185473  187136
      mariadb-10.3.24         4772.0  22959   42713   80442   133492  171252  186200
       
      mariadb-10.4.13         5015.7  28263   52033   93771   154690  178179  178076
      mariadb-10.4.14         4980.9  23496   43145   79795   128977  166470  179285
      

      I examined this more closely for 10.2, bisecting the commits between 10.2.32 and 10.2.33. The first faulty commit is

      commit fe39d02f51b96536dccca7ff89faf05e13548877
      Author: Thirunarayanan Balathandayuthapani <thiru@mariadb.com>
      Date:   Thu Jul 23 16:23:20 2020 +0530
       
       MDEV-20638 Remove the deadcode from srv_master_thread() and srv_active_wake_master_thread_low()
      

      Attachments

        1. 10.2.bad.svg
          700 kB
        2. 10.2.good.svg
          684 kB
        3. 10.2.pdf
          21 kB
        4. 10.2-2nd.pdf
          22 kB
        5. 10.2-binlog.png
          10.2-binlog.png
          11 kB
        6. 10.2-binlog4.png
          10.2-binlog4.png
          9 kB
        7. 10.2-binlog-revert.png
          10.2-binlog-revert.png
          11 kB
        8. 10.2-revert.pdf
          23 kB
        9. 10.3.pdf
          20 kB
        10. 10.4.pdf
          21 kB
        11. 10.5-2nd.pdf
          20 kB
        12. 10.5-4th.pdf
          21 kB
        13. MDEV-23475.png
          MDEV-23475.png
          6 kB
        14. MDEV-23475-final.pdf
          86 kB

        Issue Links

          Activity

            axel Axel Schwenke added a comment -

            Attached: flame graphs for OLTP read/write @ 32 threads.

            bad=fe39d02f51b96536dccca7ff89faf05e13548877
            good=bad~

            axel Axel Schwenke added a comment - Attached: flame graphs for OLTP read/write @ 32 threads. bad=fe39d02f51b96536dccca7ff89faf05e13548877 good=bad~

            If I remember correctly, MDEV-20638 included also a small functional change.

            Because the master thread is what triggers periodic log checkpoints in InnoDB, a change to how often the master thread is woken up could affect the page flushing activity.

            thiru, please try to revert the parts that affect the way how often the master thread is woken up.

            marko Marko Mäkelä added a comment - If I remember correctly, MDEV-20638 included also a small functional change. Because the master thread is what triggers periodic log checkpoints in InnoDB, a change to how often the master thread is woken up could affect the page flushing activity. thiru , please try to revert the parts that affect the way how often the master thread is woken up.

            MDEV-20638 removed many calls to innobase_active_small() and srv_active_wake_master_thread() without any replacement. The calls to srv_wake_master_thread() were replaced with the equivalent srv_inc_activity_count().

            We must try to see if we can avoid introducing a replacement for the innobase_active_small() calls, and only add something in place of some srv_active_master_thread() calls. I think that a minimal change could be the following:

            diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
            index 0dbd985b6c3..cd2d381760a 100644
            --- a/storage/innobase/trx/trx0trx.cc
            +++ b/storage/innobase/trx/trx0trx.cc
            @@ -1804,6 +1804,7 @@ trx_commit_in_memory(
             		}
             
             		trx->commit_lsn = lsn;
            +		srv_inc_activity_count();
             	}
             
             	ut_ad(!trx->rsegs.m_noredo.undo);
            

            marko Marko Mäkelä added a comment - MDEV-20638 removed many calls to innobase_active_small() and srv_active_wake_master_thread() without any replacement. The calls to srv_wake_master_thread() were replaced with the equivalent srv_inc_activity_count() . We must try to see if we can avoid introducing a replacement for the innobase_active_small() calls, and only add something in place of some srv_active_master_thread() calls. I think that a minimal change could be the following: diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index 0dbd985b6c3..cd2d381760a 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -1804,6 +1804,7 @@ trx_commit_in_memory( } trx->commit_lsn = lsn; + srv_inc_activity_count(); } ut_ad(!trx->rsegs.m_noredo.undo);
            axel Axel Schwenke added a comment -

            marko I just tried that patch ontop of 10.2 HEAD. It restores the original performance. See attached MDEV-23475.png

            axel Axel Schwenke added a comment - marko I just tried that patch ontop of 10.2 HEAD. It restores the original performance. See attached MDEV-23475.png
            axel Axel Schwenke added a comment - - edited

            I just tried that patch ontop of 10.2 HEAD. It restores the original performance.

            Unfortunately not for all tests. It has no, or even the opposite effect for the test with synchronous binlog. All tests run with full durability (innodb_flush_log_at_trx_commit = 1) and have plenty of buffer pool and redo log space. The writes-binlog-multi test additionally enables the binlog with sync-binlog = 1. This probably takes another route in the redo-log flushing.

            Attached results: 10.2.pdf

            axel Axel Schwenke added a comment - - edited I just tried that patch ontop of 10.2 HEAD. It restores the original performance. Unfortunately not for all tests. It has no, or even the opposite effect for the test with synchronous binlog. All tests run with full durability (innodb_flush_log_at_trx_commit = 1) and have plenty of buffer pool and redo log space. The writes-binlog-multi test additionally enables the binlog with sync-binlog = 1 . This probably takes another route in the redo-log flushing. Attached results: 10.2.pdf

            axel, thank you. I think that I will next try removing all other invocations of srv_inc_activity_count() and ensure that we will invoke it once for each commit of a transaction that modifies persistent tables.

            marko Marko Mäkelä added a comment - axel , thank you. I think that I will next try removing all other invocations of srv_inc_activity_count()  and ensure that we will invoke it once for each commit of a transaction that modifies persistent tables.
            axel Axel Schwenke added a comment -

            In 10.4 this area of the code has changed a lot. I'm wondering why we see the same behavior, though the bad commit from 10.2 cannot be in there. Does it make sense to look for the bad commit in 10.4 too?

            axel Axel Schwenke added a comment - In 10.4 this area of the code has changed a lot. I'm wondering why we see the same behavior, though the bad commit from 10.2 cannot be in there. Does it make sense to look for the bad commit in 10.4 too?

            axel, in 10.3 or 10.4 you would probably find a larger merge that included many other changes than the 10.2 commit that caused the regression. Maybe the merge commit is canceling out some effects? I slightly revised the patch. I hope that it will work also with sync_binlog=1, but I did not test it. The patch should apply to 10.2, 10.3, 10.4, 10.5 without problems:

            diff --git a/storage/innobase/row/row0mysql.cc b/storage/innobase/row/row0mysql.cc
            index 7c61ad9b45b..7544f120284 100644
            --- a/storage/innobase/row/row0mysql.cc
            +++ b/storage/innobase/row/row0mysql.cc
            @@ -3820,8 +3820,6 @@ row_drop_table_for_mysql(
             
             	trx->op_info = "";
             
            -	srv_inc_activity_count();
            -
             	DBUG_RETURN(err);
             }
             
            diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
            index 0dbd985b6c3..117ff64761c 100644
            --- a/storage/innobase/trx/trx0trx.cc
            +++ b/storage/innobase/trx/trx0trx.cc
            @@ -1537,6 +1537,7 @@ trx_flush_log_if_needed_low(
             	case 1:
             		/* Write the log and optionally flush it to disk */
             		log_write_up_to(lsn, flush);
            +		srv_inc_activity_count();
             		return;
             	case 0:
             		/* Do nothing */
            

            I think that the call during DROP TABLE is unnecessary, because in that function we did invoke a transaction commit a little earlier. Other invocations did not seem to be that safe to remove.

            marko Marko Mäkelä added a comment - axel , in 10.3 or 10.4 you would probably find a larger merge that included many other changes than the 10.2 commit that caused the regression. Maybe the merge commit is canceling out some effects? I slightly revised the patch. I hope that it will work also with sync_binlog=1 , but I did not test it. The patch should apply to 10.2, 10.3, 10.4, 10.5 without problems: diff --git a/storage/innobase/row/row0mysql.cc b/storage/innobase/row/row0mysql.cc index 7c61ad9b45b..7544f120284 100644 --- a/storage/innobase/row/row0mysql.cc +++ b/storage/innobase/row/row0mysql.cc @@ -3820,8 +3820,6 @@ row_drop_table_for_mysql( trx->op_info = ""; - srv_inc_activity_count(); - DBUG_RETURN(err); } diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index 0dbd985b6c3..117ff64761c 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -1537,6 +1537,7 @@ trx_flush_log_if_needed_low( case 1: /* Write the log and optionally flush it to disk */ log_write_up_to(lsn, flush); + srv_inc_activity_count(); return; case 0: /* Do nothing */ I think that the call during DROP TABLE is unnecessary, because in that function we did invoke a transaction commit a little earlier. Other invocations did not seem to be that safe to remove.
            axel Axel Schwenke added a comment - - edited

            Hi marko. Indeed that new patch applies to both 10.2 and 10.5 (I didn't test the others). For 10.2 it behaves very similar to the first version. See 10.2-2nd.pdf

            On the positive side: the new build (green line) is fastest for all tests except t_writes-binlog-multi. On the other hand it's not better than the first patch for t_writes-binlog-multi and does not match the performance before commit fe39d02f51 (purple line) for medium thread counts.

            The graph shows data points for 1, 8, 16, 32, 64, 128 and 256 threads (on 32-thread capable hardware). The middle 3 data points is where I expect customer systems to run.

            axel Axel Schwenke added a comment - - edited Hi marko . Indeed that new patch applies to both 10.2 and 10.5 (I didn't test the others). For 10.2 it behaves very similar to the first version. See 10.2-2nd.pdf On the positive side: the new build (green line) is fastest for all tests except t_writes-binlog-multi . On the other hand it's not better than the first patch for t_writes-binlog-multi and does not match the performance before commit fe39d02f51 (purple line) for medium thread counts. The graph shows data points for 1, 8, 16, 32, 64, 128 and 256 threads (on 32-thread capable hardware). The middle 3 data points is where I expect customer systems to run.

            I think that the remaining regression should go away if we additionally apply a subset of the following patch:

            diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
            index 8b58ca57a74..d0d739f4ae5 100644
            --- a/storage/innobase/handler/ha_innodb.cc
            +++ b/storage/innobase/handler/ha_innodb.cc
            @@ -6607,6 +6607,8 @@ ha_innobase::close()
             
             	MONITOR_INC(MONITOR_TABLE_CLOSE);
             
            +	srv_inc_activity_count();
            +
             	DBUG_RETURN(0);
             }
             
            @@ -8302,6 +8304,8 @@ ha_innobase::write_row(
             	error_result = convert_error_code_to_mysql(
             		error, m_prebuilt->table->flags, m_user_thd);
             
            +	srv_inc_activity_count();
            +
             #ifdef WITH_WSREP
             	if (!error_result && trx->is_wsrep()
             	    && wsrep_thd_exec_mode(m_user_thd) == LOCAL_STATE
            @@ -8991,6 +8995,8 @@ ha_innobase::update_row(
             			error, m_prebuilt->table->flags, m_user_thd);
             	}
             
            +	srv_inc_activity_count();
            +
             #ifdef WITH_WSREP
             	if (error == DB_SUCCESS && trx->is_wsrep() &&
             	    wsrep_thd_exec_mode(m_user_thd) == LOCAL_STATE &&
            @@ -9045,6 +9051,7 @@ ha_innobase::delete_row(
             	error = row_update_for_mysql(m_prebuilt);
             
             	innobase_srv_conc_exit_innodb(m_prebuilt);
            +	srv_inc_activity_count();
             
             #ifdef WITH_WSREP
             	if (error == DB_SUCCESS && trx->is_wsrep()
            

            The ha_innobase::close() should be executed rather rarely if at all, because the SQL layer is caching table handles. FLUSH TABLES should empty that cache. I do not know why that call was there in the first place. Maybe it was more meaningful before MySQL 5.5, when the master thread was responsible for more tasks, such as running the purge of history?

            The other functions are invoked by INSERT, UPDATE, DELETE. Maybe we should instrument only those functions, and not add the call to trx_flush_log_if_needed_low(). That is, maybe we should apply only this patch (I would recommend to exclude ha_innobase::close()) and nothing else?

            marko Marko Mäkelä added a comment - I think that the remaining regression should go away if we additionally apply a subset of the following patch: diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 8b58ca57a74..d0d739f4ae5 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -6607,6 +6607,8 @@ ha_innobase::close() MONITOR_INC(MONITOR_TABLE_CLOSE); + srv_inc_activity_count(); + DBUG_RETURN(0); } @@ -8302,6 +8304,8 @@ ha_innobase::write_row( error_result = convert_error_code_to_mysql( error, m_prebuilt->table->flags, m_user_thd); + srv_inc_activity_count(); + #ifdef WITH_WSREP if (!error_result && trx->is_wsrep() && wsrep_thd_exec_mode(m_user_thd) == LOCAL_STATE @@ -8991,6 +8995,8 @@ ha_innobase::update_row( error, m_prebuilt->table->flags, m_user_thd); } + srv_inc_activity_count(); + #ifdef WITH_WSREP if (error == DB_SUCCESS && trx->is_wsrep() && wsrep_thd_exec_mode(m_user_thd) == LOCAL_STATE && @@ -9045,6 +9051,7 @@ ha_innobase::delete_row( error = row_update_for_mysql(m_prebuilt); innobase_srv_conc_exit_innodb(m_prebuilt); + srv_inc_activity_count(); #ifdef WITH_WSREP if (error == DB_SUCCESS && trx->is_wsrep() The ha_innobase::close() should be executed rather rarely if at all, because the SQL layer is caching table handles. FLUSH TABLES should empty that cache. I do not know why that call was there in the first place. Maybe it was more meaningful before MySQL 5.5, when the master thread was responsible for more tasks, such as running the purge of history? The other functions are invoked by INSERT , UPDATE , DELETE . Maybe we should instrument only those functions, and not add the call to trx_flush_log_if_needed_low() . That is, maybe we should apply only this patch (I would recommend to exclude ha_innobase::close() ) and nothing else?
            axel Axel Schwenke added a comment -

            The second patch looks good with 10.5: 10.5-2nd.pdf

            axel Axel Schwenke added a comment - The second patch looks good with 10.5: 10.5-2nd.pdf
            axel Axel Schwenke added a comment -

            For 10.2 the second patch seems to give the best performance in the t_writes-binlog-multi test. See 10.2-binlog.png

            axel Axel Schwenke added a comment - For 10.2 the second patch seems to give the best performance in the t_writes-binlog-multi test. See 10.2-binlog.png
            axel Axel Schwenke added a comment -

            Patch 4 does not help at all: 10.2-binlog4.png

            axel Axel Schwenke added a comment - Patch 4 does not help at all: 10.2-binlog4.png
            axel Axel Schwenke added a comment -

            Reverting the bad commit in 10.2 gives good resuts in all tests. And it's the only change that I have tested so far, that restores performance for t_writes-binlog-multi

            10.2-revert.pdf

            axel Axel Schwenke added a comment - Reverting the bad commit in 10.2 gives good resuts in all tests. And it's the only change that I have tested so far, that restores performance for t_writes-binlog-multi 10.2-revert.pdf
            axel Axel Schwenke added a comment -

            Patch 4 looks okeyish on 10.5. It's very similar to patch 2, but for the binlog test patch 2 still gives the best result. I think for 10.5 we should just apply patch 2.

            10.5-4th.pdf

            axel Axel Schwenke added a comment - Patch 4 looks okeyish on 10.5. It's very similar to patch 2, but for the binlog test patch 2 still gives the best result. I think for 10.5 we should just apply patch 2. 10.5-4th.pdf

            For the record, patch 4 was similar to this fix, with 2 differences:

            • In ha_innobase::write_row(), patch 4 added the call innobase_active_small() before and not after the func_exit: label
            • In row_drop_table_for_mysql(), patch 4 (similar to patch 2) removed the call to srv_inc_activity_count(). (This call, executed during DDL operations only, should be redundant due to the call in normal transaction commit, and I would prefer to remove it.)

            On a closer look, there only is one goto func_exit in ha_innobase::write_row(), which can only be executed if calc_row_difference() is returning an error. Errors can only be returned when using explicit FTS_DOC_ID column for FULLTEXT INDEX or when attempting to update a geometry column to length 0. I do not think that either case would apply in the benchmark.

            I maintain that patch 4 should revert the functional changes that were made in MDEV-20638. But, because that patch did not improve performance in the t_writes-binlog-multi benchmark, we must conclude that some of the dead code is helping the performance in the 10.2 series due to some unknown reason.

            I would propose that we revert MDEV-20638 in 10.2 to 10.4, but keep it in 10.5, and apply patch 2 to 10.5.

            marko Marko Mäkelä added a comment - For the record, patch 4 was similar to this fix , with 2 differences: In ha_innobase::write_row() , patch 4 added the call innobase_active_small() before and not after the func_exit: label In row_drop_table_for_mysql() , patch 4 (similar to patch 2) removed the call to srv_inc_activity_count() . (This call, executed during DDL operations only, should be redundant due to the call in normal transaction commit, and I would prefer to remove it.) On a closer look, there only is one goto func_exit in ha_innobase::write_row() , which can only be executed if calc_row_difference() is returning an error. Errors can only be returned when using explicit FTS_DOC_ID  column for FULLTEXT INDEX or when attempting to update a geometry column to length 0. I do not think that either case would apply in the benchmark. I maintain that patch 4 should revert the functional changes that were made in MDEV-20638 . But, because that patch did not improve performance in the t_writes-binlog-multi benchmark, we must conclude that some of the dead code is helping the performance in the 10.2 series due to some unknown reason. I would propose that we revert MDEV-20638 in 10.2 to 10.4, but keep it in 10.5, and apply patch 2 to 10.5.

            I reverted MDEV-20638 in 10.2. For 10.3 and 10.4, I think that it would be good to evaluate whether patch 2 would restore the performance.

            marko Marko Mäkelä added a comment - I reverted MDEV-20638 in 10.2. For 10.3 and 10.4, I think that it would be good to evaluate whether patch 2 would restore the performance.
            axel Axel Schwenke added a comment -

            For 10.3 the situation is very much like for 10.2. Patch 2 restores performance for all tests except t_writes-binlog-multi. If we want to fix performance for that scenario, we would have to rollback MDEV-20638

            10.3.pdf

            axel Axel Schwenke added a comment - For 10.3 the situation is very much like for 10.2. Patch 2 restores performance for all tests except t_writes-binlog-multi . If we want to fix performance for that scenario, we would have to rollback MDEV-20638 10.3.pdf
            axel Axel Schwenke added a comment -

            10.4 is again very similar to 10.2 and 10.3. In addition both patch 1 and 2 fix the 90:10 performance only partly. I suggest to rollback MDEV-20638 for 10.4 as well.

            10.4.pdf

            axel Axel Schwenke added a comment - 10.4 is again very similar to 10.2 and 10.3. In addition both patch 1 and 2 fix the 90:10 performance only partly. I suggest to rollback MDEV-20638 for 10.4 as well. 10.4.pdf

            I reverted MDEV-20638 in 10.2 and merged that up to 10.4. In 10.5, instead of reverting the changes, I applied a simple refinement that was called Patch 2:

            diff --git a/storage/innobase/row/row0mysql.cc b/storage/innobase/row/row0mysql.cc
            index 7c61ad9b45b..7544f120284 100644
            --- a/storage/innobase/row/row0mysql.cc
            +++ b/storage/innobase/row/row0mysql.cc
            @@ -3820,8 +3820,6 @@ row_drop_table_for_mysql(
             
             	trx->op_info = "";
             
            -	srv_inc_activity_count();
            -
             	DBUG_RETURN(err);
             }
             
            diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
            index 0dbd985b6c3..117ff64761c 100644
            --- a/storage/innobase/trx/trx0trx.cc
            +++ b/storage/innobase/trx/trx0trx.cc
            @@ -1537,6 +1537,7 @@ trx_flush_log_if_needed_low(
             	case 1:
             		/* Write the log and optionally flush it to disk */
             		log_write_up_to(lsn, flush);
            +		srv_inc_activity_count();
             		return;
             	case 0:
             		/* Do nothing */
            

            marko Marko Mäkelä added a comment - I reverted MDEV-20638 in 10.2 and merged that up to 10.4. In 10.5, instead of reverting the changes, I applied a simple refinement that was called Patch 2: diff --git a/storage/innobase/row/row0mysql.cc b/storage/innobase/row/row0mysql.cc index 7c61ad9b45b..7544f120284 100644 --- a/storage/innobase/row/row0mysql.cc +++ b/storage/innobase/row/row0mysql.cc @@ -3820,8 +3820,6 @@ row_drop_table_for_mysql( trx->op_info = ""; - srv_inc_activity_count(); - DBUG_RETURN(err); } diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index 0dbd985b6c3..117ff64761c 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -1537,6 +1537,7 @@ trx_flush_log_if_needed_low( case 1: /* Write the log and optionally flush it to disk */ log_write_up_to(lsn, flush); + srv_inc_activity_count(); return; case 0: /* Do nothing */
            axel Axel Schwenke added a comment -

            I tested the final fix for this issue on 10.2 .. 10.5. The fix looks very good for most test cases. In 10.2 there is still a slight regression for t_writes-binlog-multi. But all in all the problem is solved. Details: MDEV-23475-final.pdf

            axel Axel Schwenke added a comment - I tested the final fix for this issue on 10.2 .. 10.5. The fix looks very good for most test cases. In 10.2 there is still a slight regression for t_writes-binlog-multi . But all in all the problem is solved. Details: MDEV-23475-final.pdf

            Could that bug occur in 10.3.23?

            jaehyun1148.lee jaehyun1148.lee added a comment - Could that bug occur in 10.3.23?

            People

              marko Marko Mäkelä
              axel Axel Schwenke
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.