[MDEV-20996] Maxscale auto-failover with semi-sync replication is not providing a true HA solution Created: 2019-11-06  Updated: 2021-09-24  Resolved: 2021-04-08

Status: Closed
Project: MariaDB Server
Component/s: Replication
Fix Version/s: N/A

Type: Task Priority: Major
Reporter: Richard Lane Assignee: Unassigned
Resolution: Duplicate Votes: 2
Labels: None

Issue Links:
Duplicate
duplicates MDEV-21117 refine the server binlog-based recove... Closed
Problem/Incident
causes MXS-2775 Document that a crashed master can br... Closed
Relates
relates to MXS-2542 Add rebuild server to MariaDB Monitor Closed

 Description   

We have be using maxscale-2.3 with mariadbmon monitor and auto-failover for our HA solution with 3 database nodes - Master/Slave/Slave. With traffic, realistically, you MUST use semi-sync replication to make this viable, otherwise near 100% of the time a Master failed server will not come back as slave w/o 1236 error due to transactions committed to storage engine that have not yet been replicated to any slave.

Therefore, we use semi-sync replication with wait_point AFTER_SYNC. Now given this, see https://mariadb.com/kb/en/library/semisynchronous-replication/#configuring-the-master-wait-point. There are known issues with semi-sync replication after master failure/crash which will result in the same issue, Master not coming back as Slave due to a prepared transaction that is committed by automatic crash recovery. We had tried working around this by performing an automatic "Manual heuristic recovery rollback" but that did not prevent the transaction from going through after the failed master came back and we still got the 1236 replication error.

I am aware of MENT-203 (resulting from MDEV-19733), but this is in the queue as a feature request, which may have been fine before maxscale starting supporting auto-failover as an HA solution. However, supporting an HA solution with maxscale, this is now a bug and prevents maxscale with auto-failover from truely being a robust HA solution.

Maybe a short term solution would be to allow the user to disable auto-crash recovery? Not sure if this would be a viable long term solution but we are also looking for a way to make this more reliable before a true solution to this is provided.



 Comments   
Comment by Andrei Elkin [ 2019-11-20 ]

rvlane (ralf.gebhardt@mariadb.com): I am sorry, I scrapped my text as it inferred something like sending to the slaves beyond prior binlogging (feasible, but not something we should do). I am thinking over this matter..

Comment by Richard Lane [ 2019-11-20 ]

I understand that binlog logging after slave ack is something you may not want to do, but that type of solution is exactly what I am looking for in this ticket (not logging on failed master when it will never get to a semi-sync repl slave). I'll wait for further discussion.

Comment by Andrei Elkin [ 2019-11-20 ]

rvlane, (ralf.gebhardt@mariadb.com):

The slaves in your setup, as I understand, go ahead to elect a new Master which won't be having
some sequence of transactions that "lost" in network. Those transactions were not committed so I gather it's fine (though not perfect) that their users may find out that the transactions have to be retried against the new master.

The old master then needs a mode of recovery with two things combined:

  • --tc-heuristic-recover=rollback and
  • wipe out the binlogged-but-not-committed range from its binlog

Upon recovery the restarted old master will be consistent with the new one.

How this looks like for you?

Comment by Andrei Elkin [ 2019-11-20 ]

rvlane, (ralf.gebhardt@mariadb.com): there could be another option to store acknowledgment into Engine. Innodb already provides a similar service on the binlog-less slave to maintain file:pos of the last commit trx. I need to speak to marko whether it's realistic to squeeze one more such pair into the Engine's recoverable area.

Comment by Richard Lane [ 2019-11-20 ]

Yes, that sounds like it would work. We actually always do a --tc-heuristic-recover=rollback first before the mysqld is started to try to force rollback of uncommitted transactions, we have found that without this, there is a very high chance that the master will not come back to re-join as slave. However this does not cover all cases (I believe your #2 point).

Our setup is maxscale with mariadbmon and auto-failover (master/slave/slave). An option to allow us to perform auto-rollback on mysqld start (without us having to run it first before starting the service) and whatever else is necessary to ensure that transactions that never made it to the semi-sync slave are rolled back is needed so the failed master can automatically re-join as a slave (via maxscale).

Comment by Marko Mäkelä [ 2019-11-20 ]

Elkin, we’d have to discuss this some time next week. Before MariaDB Server 10.3 implemented MDEV-15158, InnoDB stored the latest binlog position in the TRX_SYS page in the system tablespace. Since then, it is being written to the rollback segment header page on transaction commit.

I don’t know if it is feasible to store the acknowledgements inside InnoDB rollback segment or undo pages. At the very least, the semantics should be clarified. And in any case, this should be tested with innodb_undo_log_truncate=ON.

Comment by Geoff Montee (Inactive) [ 2019-11-20 ]

If MXS-2542 were implemented, then this could also be fixed in MaxScale by automatically rebuilding the crashed master.

Comment by Andrei Elkin [ 2019-11-21 ]

https://jira.mariadb.org/browse/MDEV-20996?focusedCommentId=138318&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-138318 suggestion is in other words to turn --tc-heuristic-recover=rollback replication safe. Its current behaviour is not as rolled back transaction may remain in binlog.
Find MDEV-21117 for more.

Comment by Max Mether [ 2020-04-28 ]

For a real lossless HA solution you need MDEV-19140
The current asnych or semi-synch replication solutions cannot provide this.

Comment by Ralf Gebhardt [ 2021-04-08 ]

This Issue is addressed as part of MDEV-21117

Generated at Thu Feb 08 09:03:48 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.