[MDEV-17565] Sporadic Galera failures when testing MariaDB with mtr Created: 2018-10-30  Updated: 2021-06-24  Resolved: 2021-06-24

Status: Closed
Project: MariaDB Server
Component/s: Galera, Galera SST
Affects Version/s: 10.1.36, 10.2.18, 10.3
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Julius Goryavsky Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: Galera, galera


 Description   

Several bugs in Galera lead to sporadic failures when testing MariaDB server with the mtr due to false error messages (or warnings) that are not related to the new fixes:

1) Some mtr tests sometimes fails due to

[Warning] WSREP: gcs_caused() returned -1

warnings.

2) Some mtr tests sometimes fails due to

[Warning] WSREP: Failed to report last committed <number>

warnings.

3) Some mtr test sometimes fails when node is evicted from the cluster in middle of SST.

4) If SST fails due to a network error, the node that acted as a donor sometimes does not return to its original state, which leads to failure due to the inability to continue the test execution (due to a timeout).



 Comments   
Comment by Julius Goryavsky [ 2018-10-30 ]

https://github.com/MariaDB/galera/pull/4

The patch includes a some changes taken from the latest
versions of Galera from Codership, as well as some changes
taken from PXC version of Galera, as well as some corrections
made by me.

1) Some mtr tests sometimes fails due to
"[Warning] WSREP: gcs_caused() returned -1" warnings:

Currently gcs_.caused() function works only when the group
is primary, and fails if the group is non-primary or even if
the group in a transient state (during configuration changes).

Instead of failing immediately, this patch changes gcs_.caused()
to return EAGAIN error code when function was called while
group in a transient state. On receiving EAGAIN error code
ReplicatorSMM::causal_read() retries to obtain a new seqno
(by calling gcs_.caused() again).

2) Some mtr tests sometimes fails due to
"[Warning] WSREP: Failed to report last committed <number>"
warnings:

This is because when processing cluster configuration changes,
the GCS layer does not always timely update the group->last_applied
variable.

To correct this error, I added an additional call to the
group_redo_last_applied() function. In addition, to protect
against other similar situations, I added a cycle to re-call
gcs_.set_last_applied() in case of failure due to interruption
of internal operations in the current Galera implementation.

3) Some mtr test sometimes fails when node is evicted from
the cluster in middle of SST.

Even when node evicted, the SST script may completes normally.
After this, the node calls the gcs_join() function and tries
to join the cluster. However, this is impossible, because the
node is already evicted. Therefore, the _join() function
(which called from gcs_join) fails. Then node does IST
(which also fails), after/during which it is aborted.

To fix this, we should avoid joining the cluster through
gcs_join function if node is evicted. To do this, we should
check the current connection state in the gcs_join() function
to return from it immediately if the node's communication
channel was closed.

4) If SST fails due to a network error, the node that acted
as a donor sometimes does not return to its original state,
which leads to failure due to the inability to continue
the test execution (due to a timeout).

If sst_sent() fails node should restore itself back to joined
state. The sst_sent function can fail. commonly due to network
errors, where DONOR may lose connectivity to JOINER (or existing
cluster). But on re-join it should restore the original state
without waiting for transition to JOINER state. SST failure
on JOINER will gracefully shutdown the joiner.

Comment by Jan Lindström (Inactive) [ 2019-12-05 ]

Yurchenko Can you also review the changes please.

Comment by Julius Goryavsky [ 2020-01-21 ]

julien.fritsch I transferred these changes to current versions (after review), now I check regressions and then I commit changes on github

Comment by Jan Lindström (Inactive) [ 2021-06-24 ]

SST issues were fixed on major script cleanuup.

Generated at Thu Feb 08 08:37:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.