Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24948

thd_need_wait_reports() hurts performance

Details

    Description

      In MDEV-20612, MDEV-24671 and MDEV-24738, the concurrency of the InnoDB lock subsystem was improved, except for one aspect. Each time the refactored lock_wait() is invoked to handle a lock wait, a predicate must be evaluated. It is not only somewhat time-consuming to evaluate, but the predicate also unnecessarily holds even when replication is not set up, which causes another bottleneck. The code in question is highlighted in the following:

      diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc
      --- a/storage/innobase/lock/lock0lock.cc
      +++ b/storage/innobase/lock/lock0lock.cc
      @@ -1758,7 +1758,7 @@ dberr_t lock_wait(que_thr_t *thr)
           is not being used. We want to be allow the user to skip
           lock_wait_rpl_report(). */
           const bool rpl= !(type_mode & LOCK_AUTO_INC) && trx->mysql_thd &&
      -      innodb_deadlock_detect && thd_need_wait_reports(trx->mysql_thd);
      +      innodb_deadlock_detect && false && thd_need_wait_reports(trx->mysql_thd);
       #endif
           timespec abstime;
           set_timespec_time_nsec(abstime, suspend_time.val * 1000);
      

      If this condition holds (with the patch it will never hold), then lock_wait_rpl_report() will have to be invoked. That function will not only hold lock_sys.wait_mutex but also hold exclusive lock_sys.latch while traversing the lock graph. This prevents any concurrent registration or removal of locks, even if those locks were not part of any conflict.

      To highlight the impact of this, I created an extreme benchmark where 20 concurrent connections are accessing a table that contains 10 or 20 rows. To alleviate contention on the buffer pool page latches, I padded the PRIMARY KEY with some extra bytes, so that there should be at most a few records per index page. Here are the results:

      server rows log_bin=0/tps log_bin=1/tps
      plain 10 824.57 819.14
      patched 10 832.04 830.63
      plain 20 3305.20 3324.75
      patched 20 3361.45 3398.25

      There is some fluctuation in the numbers due to the random nature of the contention and due to the short 60-second run-time of the test. Most notably, both log_bin=1 cases for 20 rows on the unmodified (plain) server appeared faster than the log_bin=0 runs. There were faster log_bin=0 runs; I chose the slowest ones.

      At 10 rows and 20 threads, the call to thd_need_wait_reports() seems to incur a clear overhead even when --skip-log-bin is in effect.

      I believe that this API should move from 'push' notification to 'pull' notification. That is, the predicate thd_need_wait_reports() must be removed, and instead, whenever the replication layer actually needs to know the waits-for relations, it could invoke a new function handlerton::report_waits(), which InnoDB would basically implement with lock_wait_rpl_report(). We could still invoke thd_rpl_deadlock_check() to report the graph.

      Attachments

        Issue Links

          Activity

            Elkin Andrei Elkin added a comment - - edited

            Marko found out

            commit 30ead1f6966128cbcd32c7b6029ea2170aeef5f9
            Author: Jakub Łopuszański <jakub.lopuszanski@oracle.com>
            Date:   Thu Mar 25 12:53:07 2021 +0100
             
                Bug #32618301 INNODB SHOULD REPORT CHANGES IN WAIT-FOR GRAPH TO SERVER LAYER
                
                InnoDB's lock system only informed server about the first t2 -waits-for-> t1 edge which caused t2 to 
            go to sleep, but when t1 released the lock and t2 still had to wait for another transaction, say t2 -wait
            s-for-> t3, InnoDB did not inform the server about this.
                This could impact replication, as coordinator was not aware that t2 waits for t3, and thus could not 
            detect a conflict with the intended commit-oder (t1, t2, t3).
            

            Elkin Andrei Elkin added a comment - - edited Marko found out commit 30ead1f6966128cbcd32c7b6029ea2170aeef5f9 Author: Jakub Łopuszański <jakub.lopuszanski@oracle.com> Date: Thu Mar 25 12:53:07 2021 +0100   Bug #32618301 INNODB SHOULD REPORT CHANGES IN WAIT-FOR GRAPH TO SERVER LAYER InnoDB's lock system only informed server about the first t2 -waits-for-> t1 edge which caused t2 to go to sleep, but when t1 released the lock and t2 still had to wait for another transaction, say t2 -wait s-for-> t3, InnoDB did not inform the server about this. This could impact replication, as coordinator was not aware that t2 waits for t3, and thus could not detect a conflict with the intended commit-oder (t1, t2, t3).

            MySQL Bug #32618301 INNODB SHOULD REPORT CHANGES IN WAIT-FOR GRAPH TO SERVER LAYER builds upon an earlier change Bug #29882690 UNDETECTED DEADLOCK WITH GAP AND INSERT INTENTION LOCKS IS POSSIBLE. The ‘push’ notification interface thd_report_row_lock_wait() (which I think is a performance problem) appears to be unaffected by those changes.

            marko Marko Mäkelä added a comment - MySQL Bug #32618301 INNODB SHOULD REPORT CHANGES IN WAIT-FOR GRAPH TO SERVER LAYER builds upon an earlier change Bug #29882690 UNDETECTED DEADLOCK WITH GAP AND INSERT INTENTION LOCKS IS POSSIBLE . The ‘push’ notification interface thd_report_row_lock_wait() (which I think is a performance problem) appears to be unaffected by those changes.

            MDEV-31655 has been filed for the inadequate lock conflict avoidance.

            marko Marko Mäkelä added a comment - MDEV-31655 has been filed for the inadequate lock conflict avoidance.
            knielsen Kristian Nielsen added a comment - - edited

            Thanks Marko for doing the benchmark!

            Following up on the discussion on the developers list, here is some further background on this, and ideas for how we can improve.

            The thd_rpl_deadlock_check() is essential for optimistic and aggressive parallel replication. Here is how it works:

            We take a list of transactions to replicate T1, T2, T3, ...
            We start the transactions in any order in parallel, but we ensure that they commit in the same sequence. This ensures correctness, because then they binlog in the same sequence on the slave as on the master; and MySQL/MariaDB replication has the property that applying the binlog in sequence results in the same data as originally on the master.

            If T1 and T2 modify the same table row, we may end up with a deadlock where T1 is waiting for a row lock held by T2, and T2 is waiting for T1 to commit. To break this deadlock, thd_rpl_deadlock_check() is used to report all cases where one THD waits on another. Whenever there's a T_i that waits on T_j, i<j, we have a deadlock, and we resolve it by rolling back T_j so that T_i can commit and T_j can later be retried.

            This requires a push interface, as it is important for scaleability to break the deadlock as soon as possible and allow the earlier transactions to complete.

            The thd_need_wait_reports() function can be used to know which THDs need the call to thd_rpl_deadlock_check(). Ideally this could be used to only have the overhead for parallel replication threads where it is required. Note that thd_need_wait_reports() is a function of the THD, not of the transaction. It need only be called once for a THD and the result can then be cached across transactions.

            But in current code, thd_need_wait_reports() also returns true whenever the binlog is enabled, which will then apply to all THDs. I think the assumption was that if one thread anyway needs to wait for another, this is an expensive operation, and so the overhead of reporting it to the server layer would be negligible. If this is not true, maybe the additional use-cases should be re-considered, as they are not essential like for optimistic parallel replication. They are as follows:

            1. When a transaction commits (to the binlog), it checks if it ever waited on a row lock. If so, a flag FL_WAITED is logged in the GTID event. This is used as a heuristic on the slave in optimistic (but not aggressive) mode to not apply that transaction in parallel, hoping to avoid a conflict.

            This heuristic was never really tested to see how beneficial it is, to my knowledge. I only remember the extensive benchmarks of Jean Francois Gagné, which I think found that agressive mode was faster anyway. It might be an idea to question if this FL_WAITED flag is still useful, or if we could simply remove the functionality.

            Even if we keep it, the information if a transaction waited during execution is only needed at binlog commit time. So this could be replaced by a pull interface, if InnoDB would set a flag in the trx whenever a lock wait is done.

            2. The second use relates to conservative parallel replication, but in the binlog code on the master. The master marks in the GTID events in the binlog all transactions that group-committed together. The conservative mode slave runs in parallel (only) transactions that committed together and thus are (probably) without conflicts.

            The --binlog-commit-wait-usec option can be used on the master to make transactions wait a bit before committing, hoping to get more transactions to group-commit and improve conservative mode slave scalability. Then a heuristics exists so that if some transaction T1 is ready to commit but waiting for --binlog-commit-wait-usec, and another transaction T2 needs to wait for T1, then we cut short the wait and commit T1 immediately, hoping to reduce the scalability loss on the master from --binlog-commit-wait-usec.

            Again, I am not sure how much this heuristics has ever been tested for how effective it is. Conservative mode is also less important after optimistic mode was introduced, even more so the --binlog-commit-wait-usec option which must be hard to tune without negatively affecting master performance.

            So this use case could also be reconsidered if it is useful to have. Even if we keep it, we should at least make thd_need_wait_reports() return true only if --binlog-commit-wait-usec is non-zero, which will rarely be the case.

            So in summary:

            • thd_rpl_deadlock_check() really is needed for parallel replication threads, and there we must live with the overhead.
            • thd_need_wait_reports() can be called once per THD to know if thd_rpl_deadlock_check() is required.
            • If we introduce a pull interface to ask InnoDB if a transaction ever did a lock wait, we can make thd_need_wait_reports() return true only for parallel replication threads and for the special use-case of non-zero --binlog-commit-wait-usec .

            (As an aside, one thing InnoDB could do to improve concurrency is to buffer the list of THDs waited for in an array (while holding lock_sys.wait_mutex and lock_sys.latch), and then pass them to thd_rpl_deadlock_check() only after releasing those mutexes. But better if we can avoid thd_rpl_deadlock_check() completely in the common case. Also ensuring THDs do not go away during thd_rpl_deadlock_check() might be tricky.)

            knielsen Kristian Nielsen added a comment - - edited Thanks Marko for doing the benchmark! Following up on the discussion on the developers list, here is some further background on this, and ideas for how we can improve. The thd_rpl_deadlock_check() is essential for optimistic and aggressive parallel replication. Here is how it works: We take a list of transactions to replicate T1, T2, T3, ... We start the transactions in any order in parallel, but we ensure that they commit in the same sequence. This ensures correctness, because then they binlog in the same sequence on the slave as on the master; and MySQL/MariaDB replication has the property that applying the binlog in sequence results in the same data as originally on the master. If T1 and T2 modify the same table row, we may end up with a deadlock where T1 is waiting for a row lock held by T2, and T2 is waiting for T1 to commit. To break this deadlock, thd_rpl_deadlock_check() is used to report all cases where one THD waits on another. Whenever there's a T_i that waits on T_j, i<j, we have a deadlock, and we resolve it by rolling back T_j so that T_i can commit and T_j can later be retried. This requires a push interface, as it is important for scaleability to break the deadlock as soon as possible and allow the earlier transactions to complete. The thd_need_wait_reports() function can be used to know which THDs need the call to thd_rpl_deadlock_check() . Ideally this could be used to only have the overhead for parallel replication threads where it is required. Note that thd_need_wait_reports() is a function of the THD, not of the transaction. It need only be called once for a THD and the result can then be cached across transactions. But in current code, thd_need_wait_reports() also returns true whenever the binlog is enabled, which will then apply to all THDs. I think the assumption was that if one thread anyway needs to wait for another, this is an expensive operation, and so the overhead of reporting it to the server layer would be negligible. If this is not true, maybe the additional use-cases should be re-considered, as they are not essential like for optimistic parallel replication. They are as follows: 1. When a transaction commits (to the binlog), it checks if it ever waited on a row lock. If so, a flag FL_WAITED is logged in the GTID event. This is used as a heuristic on the slave in optimistic (but not aggressive) mode to not apply that transaction in parallel, hoping to avoid a conflict. This heuristic was never really tested to see how beneficial it is, to my knowledge. I only remember the extensive benchmarks of Jean Francois Gagné, which I think found that agressive mode was faster anyway. It might be an idea to question if this FL_WAITED flag is still useful, or if we could simply remove the functionality. Even if we keep it, the information if a transaction waited during execution is only needed at binlog commit time. So this could be replaced by a pull interface, if InnoDB would set a flag in the trx whenever a lock wait is done. 2. The second use relates to conservative parallel replication, but in the binlog code on the master. The master marks in the GTID events in the binlog all transactions that group-committed together. The conservative mode slave runs in parallel (only) transactions that committed together and thus are (probably) without conflicts. The --binlog-commit-wait-usec option can be used on the master to make transactions wait a bit before committing, hoping to get more transactions to group-commit and improve conservative mode slave scalability. Then a heuristics exists so that if some transaction T1 is ready to commit but waiting for --binlog-commit-wait-usec , and another transaction T2 needs to wait for T1, then we cut short the wait and commit T1 immediately, hoping to reduce the scalability loss on the master from --binlog-commit-wait-usec . Again, I am not sure how much this heuristics has ever been tested for how effective it is. Conservative mode is also less important after optimistic mode was introduced, even more so the --binlog-commit-wait-usec option which must be hard to tune without negatively affecting master performance. So this use case could also be reconsidered if it is useful to have. Even if we keep it, we should at least make thd_need_wait_reports() return true only if --binlog-commit-wait-usec is non-zero, which will rarely be the case. So in summary: thd_rpl_deadlock_check() really is needed for parallel replication threads, and there we must live with the overhead. thd_need_wait_reports() can be called once per THD to know if thd_rpl_deadlock_check() is required. If we introduce a pull interface to ask InnoDB if a transaction ever did a lock wait, we can make thd_need_wait_reports() return true only for parallel replication threads and for the special use-case of non-zero --binlog-commit-wait-usec . (As an aside, one thing InnoDB could do to improve concurrency is to buffer the list of THDs waited for in an array (while holding lock_sys.wait_mutex and lock_sys.latch), and then pass them to thd_rpl_deadlock_check() only after releasing those mutexes. But better if we can avoid thd_rpl_deadlock_check() completely in the common case. Also ensuring THDs do not go away during thd_rpl_deadlock_check() might be tricky.)

            knielsen, thank you very much for the explanation. For what it is worth, I was wondering if fixing MDEV-32096 would have any impact on this. The answer is no: if I disable the calls to thd_need_wait_reports(), there will be test failures, including the following:

            Too many failed: Failed 10/5416 tests, 99.82% were successful.
             
            Failing test(s): binlog.binlog_commit_wait rpl.rpl_parallel_optimistic_xa_lsu_off rpl.rpl_parallel_optimistic_nobinlog rpl.rpl_parallel_autoinc rpl.rpl_parallel_conflicts
            

            marko Marko Mäkelä added a comment - knielsen , thank you very much for the explanation. For what it is worth, I was wondering if fixing MDEV-32096 would have any impact on this. The answer is no: if I disable the calls to thd_need_wait_reports() , there will be test failures, including the following: Too many failed: Failed 10/5416 tests, 99.82% were successful.   Failing test(s): binlog.binlog_commit_wait rpl.rpl_parallel_optimistic_xa_lsu_off rpl.rpl_parallel_optimistic_nobinlog rpl.rpl_parallel_autoinc rpl.rpl_parallel_conflicts
            knielsen Kristian Nielsen added a comment - - edited

            I implemented a draft patch for this as described in my earlier comment: https://github.com/MariaDB/server/pull/3823

            This should mostly eliminate the overhead of thd_rpl_deadlock_check() on non-parallel-replication threads in most cases.

            This patch changes some behaviour of optimistic parallel replication and (non-default) option --binlog-commit-wait-count. These changes in behaviour need to be considered carefully (and may not be appropriate for old GA 10.6 release), but I think they are overall good.

            Another idea (that I did not implement) could be to make thd_rpl_deadlock_check() a batch interface. InnoDB could collect waited-for THDs in a local array and pass that array to the server layer to process multiple waits in one call. This would be more efficient by having less calls (only one in the common case where number of waits fits in a local array). But maybe this patch is good enough.

            Andrei, I've re-assigned this to me now that I have created a patch, hope that is ok.

            knielsen Kristian Nielsen added a comment - - edited I implemented a draft patch for this as described in my earlier comment: https://github.com/MariaDB/server/pull/3823 This should mostly eliminate the overhead of thd_rpl_deadlock_check() on non-parallel-replication threads in most cases. This patch changes some behaviour of optimistic parallel replication and (non-default) option --binlog-commit-wait-count. These changes in behaviour need to be considered carefully (and may not be appropriate for old GA 10.6 release), but I think they are overall good. Another idea (that I did not implement) could be to make thd_rpl_deadlock_check() a batch interface. InnoDB could collect waited-for THDs in a local array and pass that array to the server layer to process multiple waits in one call. This would be more efficient by having less calls (only one in the common case where number of waits fits in a local array). But maybe this patch is good enough. Andrei, I've re-assigned this to me now that I have created a patch, hope that is ok.

            I ran Marko's benchmark script before and after my draft patch. Like Marko, I see a small speedup, I got the most stable numbers with log_bin=1:

            server rows log_bin=1/tps
            plain 20 3305.20
            patched 20 3319.71
            knielsen Kristian Nielsen added a comment - I ran Marko's benchmark script before and after my draft patch. Like Marko, I see a small speedup, I got the most stable numbers with log_bin=1: server rows log_bin=1/tps plain 20 3305.20 patched 20 3319.71

            People

              knielsen Kristian Nielsen
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.