[MDEV-15423] Reduce contention on commit order monitor Created: 2018-02-26  Updated: 2019-11-04  Resolved: 2019-11-04

Status: Closed
Project: MariaDB Server
Component/s: Galera
Fix Version/s: N/A

Type: Task Priority: Minor
Reporter: Alexander Krizhanovsky (Inactive) Assignee: Julius Goryavsky
Resolution: Won't Fix Votes: 0
Labels: galera, galera_4


 Description   

The optimization is available at https://github.com/percona/percona-xtradb-cluster/commit/c01c75da8ec67197e5b2f86edbdd9541df717faa , but massive changes are required due to differences between PXC and MariaDB code affected by the patch. The value of the optimization is described in https://www.percona.com/blog/2017/04/19/how-we-made-percona-xtradb-cluster-scale/ and https://www.percona.com/blog/2017/04/19/performance-improvements-percona-xtradb-cluster-5-7-17/



 Comments   
Comment by Julius Goryavsky [ 2018-06-14 ]

In order to realize the changes needed to speed up the group commits, we needed to bring Galera into line with the latest changes from Codership and to supplement it with a whole series of commit-related changes that were originally made in percona xtradb server. There is two commits with intended to synchronize Maria's version of Galera subsystem with the latest version of Galera 3.26 from Codership and latest changes from pxc:

https://github.com/MariaDB/galera/commit/be6b061a106e337e616f72dc628c3e3a892b41d1
https://github.com/MariaDB/galera/commit/d86e6967da5455cfec0ef05236614694d0b80d59

Next, I will list the major changes and bug fixes that are introduced into this version of Galera in addition to the changes that add new stages to speed up processing of the group commits:

epoll_reactor.ipp:

Ffixed ASIO-related bug: we should read an 8-byte timeout expiration counter (using the "timerfd" descriptor) every time when the EPOLLIN event occurs. Otherwise, the timer stops to notifying the application about new events, and all activity associated with this timer will be frozen on some systems.

ist.cpp:

Introduced "interrupted_" flag to better error handling after interruption of the IST thread. Also it is necessary to push the loop in the run() method ahead - if it awaiting the signal.

replicator_smm.cpp:

Introduction of new applier_pre_commit(), applier_interim_commit and applier_post_commit() callbacks for fast commit handling.

replicator_str.cpp:

1) ReplicatorSMM::sst_received():

We need to check the state only after we signalized about completion of the SST - otherwise the request_state_transfer() function will be infinitely wait on the sst_cond_ condition variable, for which no one will call the signal() function.

2) ReplicatorSMM::process_state_req():

When new node joining the cluster, it may trying to avoid unnecessary SST request. However, the heuristic algorithm, which selects the donor node, does not give us a 100% guarantee that seqno will not move forward while new
node sending its request (to joining the cluster). Therefore, if seqno had gone forward, and if we have only the IST request (without the SST part), then we need to inform new node that it should prepare to receive full state and re-send the SST request (if the server supports it).

3) ReplicatorSMM::send_state_request():

Although the current state has lagged behind state of the group, we can save it for the next attempt of the joining cluster, because we do not know how other nodes will finish their work.

If in the future someone will change the code above (and the error handling at the GCS level), then the ENODATA error will no longer be fatal. Therefore, we will need here additional test for "ret != ENODATA". Since it is rare event associated with error handling, then the presence here of one additional comparison is not an issue for system performance.

4) ReplicatorSMM::request_state_transfer():

We must mark the state as the "unsafe" before SST because the current state may be changed during execution of the SST and it will no longer match the stored seqno (state becomes "unsafe" after the first modification of data during the SST, but unfortunately, we do not have callback or wsrep API to notify about the first data modification). On the other hand, in cases where the full SST is not required and we want to use IST, we need to save the urrent state - to prevent unnecessary SST after node restart (if IST fails before it starts applying transaction). Therefore, we need to check whether the full state transfer (SST) is required or not, before marking state as unsafe.

Also, we should not wait for completion of the SST or to handle it results if an error has occurred when sending the request.

5) ReplicatorSMM::apply_trx():

Added new commit monitor bypass: TOI action are fully serialized so it is make sense to enforce commit ordering at this stage. For non-TOI action commit ordering is delayed to take advantage of full parallelism.

6) ReplicatorSMM::sst_sent():

If sst-sent fails node should restore itself back to joined state. sst-sent can fail commonly due to n/w error where-in DONOR may loose connectivity to JOINER (or existing cluster) but on re-join it should restore the original state (DONOR->JOINER->JOINED->SYNCED) without waiting for JOINER. sst-failure on JOINER will gracefully shutdown the joiner.

Comment by Julius Goryavsky [ 2018-06-14 ]

replicator_smm.cpp:

7) ReplicatorSMM::process_conf_change():

If SST operation was canceled, we shall immediately return from the function to avoid hang-up in the monitor drain code and avoid restart of the SST.

8) ReplicatorSMM::process_conf_change():

Also, if the IST/SST request was canceled due to error at the GCS level or if request was canceled by another thread (by initiative of the server), and if the node remain in the S_JOINING state, then we must return it to the S_CONNECTED state (to the original state, which exist before the request_state_transfer started). In other words, if state transfer failed, then we need to move node back to the original state, because joining was canceled.

9) ReplicatorSMM::process_conf_change():

We should not try to joining the cluster at the GCS level in any state other than SST_WAIT (f.e. in the SST_NONE or SST_CANCELED).

galera/src/replicator_smm_stats.cpp:

Added some IST- and flow-control related statistics, last-commited reporting, gcache pool size statistics, certification pool size statistics.

galera/src/replicator_str.cpp:

1) ReplicatorSMM::state_transfer_required():

If local state sequence number is greater than group sequence number: states diverged on SST. We cannot move server forward (with local_seqno > group_seqno) to avoid potential data loss, and hence will have to shut it down. User must to remove state file and then restart server, if he/she wish to continue.

2) ReplicatorSMM::sst_received():

We need to check the state only after we signalized about completion of the SST - otherwise the request_state_transfer() function will be infinitely wait on the sst_cond_ condition variable, for which no one will call the signal() function. S_CONNECTED also valid here if sst_received() called just after send_state_request(), when the state yet not shifted to S_JOINING.

3) ReplicatorSMM::process_state_req():

When new node joining the cluster, it may trying to avoid unnecessary SST request. However, the heuristic algorithm, which selects the donor node, does not give us a 100% guarantee that seqno will not move forward while new node sending its request (to joining the cluster). Therefore, if seqno had gone forward, and if we have only the IST request (without the SST part), then we need to inform new node that it should prepare to receive full
state and re-send the SST request (if the server supports it).

4) ReplicatorSMM::send_state_request():

Although the current state has lagged behind state of the group, we can save it for the next attempt of the joining cluster, because we do not know how other nodes will finish their work.

5) ReplicatorSMM::send_state_request():

If in the future someone will change the code above (and the error handling at the GCS level), then the ENODATA error will no longer be fatal. Therefore, we will need here additional test for "ret != ENODATA". Since it is rare event associated with error handling, then the presence here of one additional comparison is not an issue for system performance.

6) ReplicatorSMM::request_state_transfer():

We must mark the state as the "unsafe" before SST because the current state may be changed during execution of the SST and it will no longer match the stored seqno (state becomes unsafe" after the first modification of data during the SST, but unfortunately, we do not have callback or wsrep API to notify about the first data modification). On the other hand, in cases where the full SST is not required and we want to use IST, we need to save the current state - to prevent unnecessary SST after node restart (if IST fails before it starts applying transaction). Therefore, we need to check whether the full state transfer (SST) is required or not, before marking state as unsafe.

7) ReplicatorSMM::request_state_transfer():

We should not do the IST when we left S_JOINING state (for example, if we have lost the connection to the network or we were evicted from the cluster) or when SST was failed or cancelled.

8) ReplicatorSMM::recv_IST():

If the current position is defined (for example, when there were no SST before IST), then we need to change it to an undefined position before applying the first transaction, since during the application of transactions (or after the IST) server may fail.

Comment by Julius Goryavsky [ 2018-06-14 ]

galera/src/wsdb.cpp and galera/src/wsdb.hpp:

Fixed problems caused by TrxMap structure, which doesn't take into consideration presence of two trx objects with same trx_id (2^64 - 1 which is default trx_id) belonging to 2 different connections. This eventually causes same trx object to get shared among 2 different unrelated connections which causes state in-consistency leading to crash (RACE CONDITION).

galera/src/key_set.cpp, galera/src/key_set.hpp and many other files:

Enforced 8-byte alignment in the writesets structures.

Comment by Julius Goryavsky [ 2018-07-29 ]

Current status: IN PROGRESS. During the debugging of this performance improvement, I removed a few changes that are not related to locking on the commit monitor, but require large changes in the server code (mysqld), since they are not relevant to this task.

Comment by Julius Goryavsky [ 2018-08-06 ]

Current status: IN PROGESS: made a similar cleanup of the server code so that it contains only those changes that are necessary to avoid the delay on the commit monitor, in particular added new stages to the transaction processing in the WSREP API.

Comment by Julius Goryavsky [ 2018-08-14 ]

Current status: IN PROGESS: Continuation of previous work - trying to make branch (based on the 10.3 branch of the server), which compiles without errors and which bypass the locking on the commit monitor (using updated WSREP API with new commit stages).

Comment by Julius Goryavsky [ 2018-08-19 ]

Current status: IN PROGRESS: Splitting wsrep_run_wsrep_commit() into two separate functions: wsrep_replicate() and wsrep_pre_commit() and other related changes, mainly affecting the wsrep_hton.cc file.

Comment by Julius Goryavsky [ 2018-08-27 ]

Current status: IN PROGESS: finished the changes in the source text, now doing merge with the head revision to go to the debugging stage (with Galera)

Comment by Julius Goryavsky [ 2018-09-10 ]

Current status: DEBUGGING: proceeded to complex debugging of Galera + improved mysql. Eliminated two regressions in the tests from the "parts" set. Now I'm eliminating the regression associated with auto-increment, which does not yet work with the delayed capture of the commit monitor.

Comment by Julius Goryavsky [ 2018-09-24 ]

Current status: Finished debugging changes to Galera without including a breakdown of the commit/replication callbacks into two stages and posted fixes for Galera on github (for now, without publishing to upstream). Currently, I'm testing the changes in mysqld - now with the new commit logic is enabled.

Comment by Julius Goryavsky [ 2018-09-30 ]

Current status: I study the organization of the transaction queue for group commit in MariaDB and the interaction between handler.cc and log.cc. In MariaDB group commit is implemented quite differently in comparison with oracle mysql and PXC. Therefore, I need to select a new point for inserting new replication and interim_commit callback, which are needed to implement accelerated transaction processing in accordance with the new Galera interface (new wsrep API).
+ merged Galera with latest ASIO release and latest codership/PXC patches related to communication layer bugs, certification bugs and group commit processing.

Comment by Julien Fritsch [ 2019-05-17 ]

sysprg please indicate the fixversion targeted

Comment by Jan Lindström (Inactive) [ 2019-11-04 ]

Decided not to do massive changes to GA-releases and similar improvements are on newer release.

Generated at Thu Feb 08 08:21:11 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.