[MDEV-20996] Maxscale auto-failover with semi-sync replication is not providing a true HA solution - Jira

XML

Word

Printable

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Duplicate
Fix Version/s: N/A
Component/s: Replication
Labels:
None

Description

We have be using maxscale-2.3 with mariadbmon monitor and auto-failover for our HA solution with 3 database nodes - Master/Slave/Slave. With traffic, realistically, you MUST use semi-sync replication to make this viable, otherwise near 100% of the time a Master failed server will not come back as slave w/o 1236 error due to transactions committed to storage engine that have not yet been replicated to any slave.

Therefore, we use semi-sync replication with wait_point AFTER_SYNC. Now given this, see https://mariadb.com/kb/en/library/semisynchronous-replication/#configuring-the-master-wait-point. There are known issues with semi-sync replication after master failure/crash which will result in the same issue, Master not coming back as Slave due to a prepared transaction that is committed by automatic crash recovery. We had tried working around this by performing an automatic "Manual heuristic recovery rollback" but that did not prevent the transaction from going through after the failed master came back and we still got the 1236 replication error.

I am aware of MENT-203 (resulting from MDEV-19733), but this is in the queue as a feature request, which may have been fine before maxscale starting supporting auto-failover as an HA solution. However, supporting an HA solution with maxscale, this is now a bug and prevents maxscale with auto-failover from truely being a robust HA solution.

Maybe a short term solution would be to allow the user to disable auto-crash recovery? Not sure if this would be a viable long term solution but we are also looking for a way to make this more reliable before a true solution to this is provided.

Attachments

Issue Links

causes

MXS-2775 Document that a crashed master can break auto_rejoin with semisynchronous replication

Closed

duplicates

MDEV-21117 refine the server binlog-based recovery for semisync

Closed

relates to

MXS-2542 Add rebuild server to MariaDB Monitor

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Richard Lane

Votes:: 2 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 2019-11-06 15:42

Updated:: 2024-07-07 22:33

Resolved:: 2021-04-08 13:39

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.