Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-20065

parallel replication for galera slave

Details

    Description

      Galera node when acts as the regular replication slave is limited to the single-threaded
      slave applier because the pre_commit transaction ordering by Galera may be set up
      to violate the BGC one so two transactions from the same binlog group end up in deadlock, like

      Gtid_seq_no= 2
      Thread 34 (Thread 0x7fcd966d2700 (LWP 23891)):
      #0  0x00007fcda6d56415 in pthread_cond_wait@@GLIBC_2.3.2 () from
      /usr/lib/libpthread.so.0
      #1  0x00005569d607d380 in safe_cond_wait (cond=0x7fcd854078e8,
      mp=0x7fcd85407838, file=0x5569d6240360
      "/home/sachin/10.1/server/include/mysql/psi/mysql_thread.h",
      line=1154) at /home/sachin/10.1/server/mysys/thr_mutex.c:493
      #2  0x00005569d5aec4d0 in inline_mysql_cond_wait (that=0x7fcd854078e8,
      mutex=0x7fcd85407838, src_file=0x5569d6240cb8
      "/home/sachin/10.1/server/sql/log.cc", src_line=7387) at
      /home/sachin/10.1/server/include/mysql/psi/mysql_thread.h:1154
      #3  0x00005569d5afeee5 in MYSQL_BIN_LOG::queue_for_group_commit
      (this=0x5569d692d7c0 <mysql_bin_log>, orig_entry=0x7fcd966cf440) at
      /home/sachin/10.1/server/sql/log.cc:7387
      #4  0x00005569d5aff5c9 in
      MYSQL_BIN_LOG::write_transaction_to_binlog_events (this=0x5569d692d7c0
      <mysql_bin_log>, entry=0x7fcd966cf440) at
      /home/sachin/10.1/server/sql/log.cc:7607
      #5  0x00005569d5afecff in MYSQL_BIN_LOG::write_transaction_to_binlog
      (this=0x5569d692d7c0 <mysql_bin_log>, thd=0x7fcd84c068b0,
      cache_mngr=0x7fcd84c72c70, end_ev=0x7fcd966cf5e0, all=true,
      using_stmt_cache=true, using_trx_cache=true) at
      /home/sachin/10.1/server/sql/log.cc:7290
      #6  0x00005569d5af0ce6 in binlog_flush_cache (thd=0x7fcd84c068b0,
      cache_mngr=0x7fcd84c72c70, end_ev=0x7fcd966cf5e0, all=true,
      using_stmt=true, using_trx=true) at
      /home/sachin/10.1/server/sql/log.cc:1751
      #7  0x00005569d5af11bb in binlog_commit_flush_xid_caches
      (thd=0x7fcd84c068b0, cache_mngr=0x7fcd84c72c70, all=true, xid=2) at
      /home/sachin/10.1/server/sql/log.cc:1859
      #8  0x00005569d5b045c8 in MYSQL_BIN_LOG::log_and_order
      (this=0x5569d692d7c0 <mysql_bin_log>, thd=0x7fcd84c068b0, xid=2,
      all=true, need_prepare_ordered=false, need_commit_ordered=true) at
      /home/sachin/10.1/server/sql/log.cc:9575
      #9  0x00005569d5a1ec0d in ha_commit_trans (thd=0x7fcd84c068b0,
      all=true) at /home/sachin/10.1/server/sql/handler.cc:1497
      #10 0x00005569d5925e7e in trans_commit (thd=0x7fcd84c068b0) at
      /home/sachin/10.1/server/sql/transaction.cc:235
      #11 0x00005569d5b1b1fa in Xid_log_event::do_apply_event
      (this=0x7fcd8542a770, rgi=0x7fcd85407800) at
      /home/sachin/10.1/server/sql/log_event.cc:7720
      #12 0x00005569d5743fa1 in Log_event::apply_event (this=0x7fcd8542a770,
      rgi=0x7fcd85407800) at /home/sachin/10.1/server/sql/log_event.h:1343
      #13 0x00005569d573987e in apply_event_and_update_pos_apply
      (ev=0x7fcd8542a770, thd=0x7fcd84c068b0, rgi=0x7fcd85407800, reason=0)
      at /home/sachin/10.1/server/sql/slave.cc:3479
      #14 0x00005569d5739deb in apply_event_and_update_pos_for_parallel
      (ev=0x7fcd8542a770, thd=0x7fcd84c068b0, rgi=0x7fcd85407800) at
      /home/sachin/10.1/server/sql/slave.cc:3623
      #15 0x00005569d597bfbe in rpt_handle_event (qev=0x7fcd85424770,
      rpt=0x7fcd85421c88) at /home/sachin/10.1/server/sql/rpl_parallel.cc:50
      #16 0x00005569d597ed57 in handle_rpl_parallel_thread
      (arg=0x7fcd85421c88) at
      /home/sachin/10.1/server/sql/rpl_parallel.cc:1258
       
      Gtid_seq_no= 1
      Thread 33 (Thread 0x7fcd9671d700 (LWP 23890)):
      #0  0x00007fcda6d56415 in pthread_cond_wait@@GLIBC_2.3.2 () from
      /usr/lib/libpthread.so.0
      #1  0x00007fcd9e7778ab in gu::Lock::wait (this=0x7fcd9671a0c0,
      cond=...) at galerautils/src/gu_mutex.hpp:40
      #2  galera::Monitor<galera::ReplicatorSMM::CommitOrder>::enter
      (this=this@entry=0x7fcda12d5da0, obj=...) at
      galera/src/monitor.hpp:124
      #3  0x00007fcd9e771f28 in galera::ReplicatorSMM::pre_commit
      (this=0x7fcda12d5000, trx=0x7fcd8507e000, meta=<optimized out>) at
      galera/src/replicator_smm.cpp:796
      #5  0x00005569d59864d0 in wsrep_run_wsrep_commit (thd=0x7fcd85006a70,
      all=true) at /home/sachin/10.1/server/sql/wsrep_hton.cc:492
      #6  0x00005569d5984d6a in wsrep_prepare (hton=0x7fcda583e270,
      thd=0x7fcd85006a70, all=true) at
      /home/sachin/10.1/server/sql/wsrep_hton.cc:208
      #7  0x00005569d5a1e1b0 in prepare_or_error (ht=0x7fcda583e270,
      thd=0x7fcd85006a70, all=true) at
      /home/sachin/10.1/server/sql/handler.cc:1196
      #8  0x00005569d5a1ea1c in ha_commit_trans (thd=0x7fcd85006a70,
      all=true) at /home/sachin/10.1/server/sql/handler.cc:1475
      #9  0x00005569d5925e7e in trans_commit (thd=0x7fcd85006a70) at
      /home/sachin/10.1/server/sql/transaction.cc:235
      #10 0x00005569d5b1b1fa in Xid_log_event::do_apply_event
      (this=0x7fcd8542a570, rgi=0x7fcd85407000) at
      /home/sachin/10.1/server/sql/log_event.cc:7720
      #11 0x00005569d5743fa1 in Log_event::apply_event (this=0x7fcd8542a570,
      rgi=0x7fcd85407000) at /home/sachin/10.1/server/sql/log_event.h:1343
      

      In above the 2nd of the two, BGC-ordered earlier though, trx turns into commit after the 1st which is BGC-ordered later. The 1st holds a galera resource while attempting to yield the BGC ordering control to the 2nd which is locked out of the galera resource.

      There are few ideas how to open up parallel slave applier for galera slave. In one we consider to introduce an interface to carry out an arbitrary action in specified BGC order. Such action could be wsrep_prepare as well.

      Attachments

        Issue Links

          Activity

            seppo Is this even valid request now that we have wsrep_slave_threads parameter ?

            jplindst Jan Lindström (Inactive) added a comment - seppo Is this even valid request now that we have wsrep_slave_threads parameter ?
            ltning Eirik Øverby added a comment -

            This is very much an issue on 10.11, and it is a business-critical showstopper for us. We would greatly appreciate if this issue is rekindled and prioritised accordingly. Our migration from MySQL to MariaDB on a 100+TB cluster is currently frozen due to this, as synchronisation between our two main clusters cannot keep up with transaction volumes.

            To be clear, this was not an issue with MySQL57+Galera.

            ltning Eirik Øverby added a comment - This is very much an issue on 10.11, and it is a business-critical showstopper for us. We would greatly appreciate if this issue is rekindled and prioritised accordingly. Our migration from MySQL to MariaDB on a 100+TB cluster is currently frozen due to this, as synchronisation between our two main clusters cannot keep up with transaction volumes. To be clear, this was not an issue with MySQL57+Galera.

            I was asked to give my analysis of this issue.

            The underlying issue is well described in the original bug report MDEV-6860. The MariaDB replication has a pre-determined commit order determined on the original master and recorded in the GTID sequence numbers. Galera certification likewise determines a specific commit order than must be consistent across all nodes in the cluster. These two commit orders need to match.

            The stack trace given in the description is from a pre-10.4 version of the code. The corresponding code since 10.4 is substantially changed, though the basic structure seems to be the same: the Galera certification happens in the prepare phase of the MariaDB two-phase commit. In particular, these two commits seem to address the specific issue described here with a hang between two parallel replication worker threads:

              commit 8d12dd8f503282179a078f2f883b88f6ccee5ebd
              Author: Daniele Sciascia <daniele.sciascia@galeracluster.com>
              Date:   Wed May 11 14:33:20 2022 +0200
             
                  MDEV-28053 Sysbench data load crashes Galera secondary node in async master slave setup
              ...
                To correct this behavior we now wait_for_prior_commit() before
                replicating changes though galera. As a consequence, parallel appliers
                may apply events in parallel until the galera replication step, which
                is now serialized.
            

              commit 304f75c97311a1b746d9bb6bc94de415b5daa21c
              Author: mkaruza <mario.karuza@galeracluster.com>
              Date:   Wed Feb 16 15:05:58 2022 +0100
             
                  MDEV-27568 Parallel async replication hangs on a Galera node
            

            The point of these commits is to do a wait_for_prior_commit() before certifying each transaction, thus ensuring that the replication commit order (enforced by wait_for_prior_commit()) matches the galera commit order.

            This fix is in from 10.4.25, 10.5.16, 10.6.8, 10.11.1. Eirik mentions still having problems in 10.11.1, but we need updated information about exactly what the problem is that is seen there. The stack traces in the description are no longer representative of the code, nor does the description match the current use of wait_for_prior_commit() in the Galera code.

            knielsen Kristian Nielsen added a comment - I was asked to give my analysis of this issue. The underlying issue is well described in the original bug report MDEV-6860 . The MariaDB replication has a pre-determined commit order determined on the original master and recorded in the GTID sequence numbers. Galera certification likewise determines a specific commit order than must be consistent across all nodes in the cluster. These two commit orders need to match. The stack trace given in the description is from a pre-10.4 version of the code. The corresponding code since 10.4 is substantially changed, though the basic structure seems to be the same: the Galera certification happens in the prepare phase of the MariaDB two-phase commit. In particular, these two commits seem to address the specific issue described here with a hang between two parallel replication worker threads: commit 8d12dd8f503282179a078f2f883b88f6ccee5ebd Author: Daniele Sciascia <daniele.sciascia@galeracluster.com> Date: Wed May 11 14:33:20 2022 +0200   MDEV-28053 Sysbench data load crashes Galera secondary node in async master slave setup ... To correct this behavior we now wait_for_prior_commit() before replicating changes though galera. As a consequence, parallel appliers may apply events in parallel until the galera replication step, which is now serialized. commit 304f75c97311a1b746d9bb6bc94de415b5daa21c Author: mkaruza <mario.karuza@galeracluster.com> Date: Wed Feb 16 15:05:58 2022 +0100   MDEV-27568 Parallel async replication hangs on a Galera node The point of these commits is to do a wait_for_prior_commit() before certifying each transaction, thus ensuring that the replication commit order (enforced by wait_for_prior_commit()) matches the galera commit order. This fix is in from 10.4.25, 10.5.16, 10.6.8, 10.11.1. Eirik mentions still having problems in 10.11.1, but we need updated information about exactly what the problem is that is seen there. The stack traces in the description are no longer representative of the code, nor does the description match the current use of wait_for_prior_commit() in the Galera code.

            Another remark on this, that was also mentioned in the original MDEV-6860:

            Galera should be integrated properly in the MariaDB replication architecture. The internal transaction coordinator interface was specifically designed to facilitate this, by Galera implementing the TC interface, replacing the normal binlog implementation.

            The wait_for_prior_commit() would seem to cause Galera certification to become serialised, which would seem to limit scalability according to the latency between nodes. With a proper TC implementation, Galera could implement log_and_order(), and in particular implement something like the queue_for_group_commit(). This would allow Galera to receive a list of replication transactions in their specified commit order; their certification could be done as a group, sharing any certification network latency among all those transactions. Maybe this can be less of an issue if the Galera cluster is configured with a single write node only and certification can be done locally on that node.

            knielsen Kristian Nielsen added a comment - Another remark on this, that was also mentioned in the original MDEV-6860 : Galera should be integrated properly in the MariaDB replication architecture. The internal transaction coordinator interface was specifically designed to facilitate this, by Galera implementing the TC interface, replacing the normal binlog implementation. The wait_for_prior_commit() would seem to cause Galera certification to become serialised, which would seem to limit scalability according to the latency between nodes. With a proper TC implementation, Galera could implement log_and_order(), and in particular implement something like the queue_for_group_commit(). This would allow Galera to receive a list of replication transactions in their specified commit order; their certification could be done as a group, sharing any certification network latency among all those transactions. Maybe this can be less of an issue if the Galera cluster is configured with a single write node only and certification can be done locally on that node.
            teemu.ollakka Teemu Ollakka added a comment -

            As a follow up to analysis from knielsen in https://jira.mariadb.org/browse/MDEV-20065?focusedCommentId=293438&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-293438

            The call to wait_for_prior_commit() before certification ensures that Galera respects the commit order of the original master. However, this makes the certification of transactions to happen one-by-one regardless of number of parallel slave workers, and the maximum commit rate will be limited by Galera replication latency. In other words, wait_for_prior_commit() which happens inside wsrep_before_prepare() call stats a critical section which is released only some time after the transaction is replicated and certified.

            In order to allow more concurrency during replication/certification phase, we can rely on the fact that in addition to total order for commits, the Galera provider will ensure sequential consistency for write sets originated from the same node. In order to ensure that the commit order of master is maintained, it is enough to release the critical section after the provider has queued the write set for replication. This way, the write sets from parallel slave workers will be batched/pipelined inside Galera in the same way as locally executed transactions, and the total throughput should be improved significantly along with increased number of slave workers.

            teemu.ollakka Teemu Ollakka added a comment - As a follow up to analysis from knielsen in https://jira.mariadb.org/browse/MDEV-20065?focusedCommentId=293438&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-293438 The call to wait_for_prior_commit() before certification ensures that Galera respects the commit order of the original master. However, this makes the certification of transactions to happen one-by-one regardless of number of parallel slave workers, and the maximum commit rate will be limited by Galera replication latency. In other words, wait_for_prior_commit() which happens inside wsrep_before_prepare() call stats a critical section which is released only some time after the transaction is replicated and certified. In order to allow more concurrency during replication/certification phase, we can rely on the fact that in addition to total order for commits, the Galera provider will ensure sequential consistency for write sets originated from the same node. In order to ensure that the commit order of master is maintained, it is enough to release the critical section after the provider has queued the write set for replication. This way, the write sets from parallel slave workers will be batched/pipelined inside Galera in the same way as locally executed transactions, and the total throughput should be improved significantly along with increased number of slave workers.
            ltning Eirik Øverby added a comment -

            Does any of this explain why we currently get a full stop in replication if we bring the replication worker thread count >1? Or is that a different problem?

            ltning Eirik Øverby added a comment - Does any of this explain why we currently get a full stop in replication if we bring the replication worker thread count >1? Or is that a different problem?
            seppo Seppo Jaakola added a comment -

            ltning replication stopping completely could be due to https://jira.mariadb.org/browse/MDEV-35465 . teemu.ollakka fixed this as part of the actual work to enable parallel replication worker execution in galera node.

            seppo Seppo Jaakola added a comment - ltning replication stopping completely could be due to https://jira.mariadb.org/browse/MDEV-35465 . teemu.ollakka fixed this as part of the actual work to enable parallel replication worker execution in galera node.
            seppo Seppo Jaakola added a comment -

            Assigned this to teemu.ollakka as he has worked on this and prepared a PR for Codership side review and testing. Extensive testing will be needed for this, so far mtr feature testing and sysbench stress testing has been done.

            The fix has changes in wsrep patch, wsrep-lib and galera library, but luckily no protocol changes. Galera side changes are already pushed in Codership's galera branch and will be in next release.

            seppo Seppo Jaakola added a comment - Assigned this to teemu.ollakka as he has worked on this and prepared a PR for Codership side review and testing. Extensive testing will be needed for this, so far mtr feature testing and sysbench stress testing has been done. The fix has changes in wsrep patch, wsrep-lib and galera library, but luckily no protocol changes. Galera side changes are already pushed in Codership's galera branch and will be in next release.

            People

              teemu.ollakka Teemu Ollakka
              Elkin Andrei Elkin
              Votes:
              6 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.