[MDEV-31239] Second-level slave hangs when master crashes between START and COMMIT ALTER - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.8.1
Fix Version/s: None
Component/s: Replication
Labels:
- replication

Description

Consider a replication topology A -> B -> C.

Master A runs ALTER TABLE with @@binlog_alter_two_phase=ON but crashes between START and COMMIT ALTER. The binlog of A thus contains START ALTER without a matching COMMIT ALTER.

Server B replicates START ALTER in an SA thread which waits for COMMIT ALTER to be replicated. There is no COMMIT ALTER, but I think B will abort the START ALTER because it detects that A crashed from the FORMAT_DESCRIPTION_EVENT or something like that (need to look a bit more to fully understand that code here).

But server C cannot see what happened on A, it just sees a dangling START ALTER from B without a matching COMMIT ALTER. Thus it will hang forever waiting for a signal from the COMMIT ALTER event that never arrives.

I have in my branch knielsen_start_alter extended the test case rpl_start_alter_restart_master with a third-level slave, and this produces the above-described hang.

How to fix? I think it's not trivial. Before START ALTER (and perhaps new XA PREPARE/COMMIT replication semantics), it is a central assumption that every transaction is logged as a single event group and never spans multiple binlog files. Changing this requires carefully considering every corner of replication to see how it will be affected by the change and how to make it work. It looks like this has not been done wrt. START ALTER.

It seems quite fragile how START ALTER will leave a worker thread ("SA thread") waiting with locks held until COMMIT ALTER is replicated. If anything happens so that the COMMIT ALTER event is not received, or something ends up unexpectedly waiting for the locks held by the SA thread, the slave will hang.

A couple of ideas come to my mind:

1. Somehow in B log in the binlog stream an event that the replication connection A -> B experienced a crash or restart, and replicate that downstream to C (but which GTID to associate with this event?).

2. Maybe we need a mechanism to detect lock conflicts on DDL locks held by ALTER TABLE, similar to how row locks conflicts are detected in InnoDB for optimistic parallel replication? Then if an unexpected conflict occurs with something waiting for the START ALTER locks, we can conservatively abort the SA thread and revert to direct_commit_alter.

But these are just rough ideas at this point, need more thought.

- Kristian.

Attachments

Activity

Jimmy Hú added a comment - 2025-03-23 04:09 - edited

1. Somehow in B log in the binlog stream an event that the replication connection A -> B experienced a crash or restart, and replicate that downstream to C (but which GTID to associate with this event?).

Log an explicit ROLLBACK in B? This extra statement might violate GTID Strict Mode though.
The same goes for A explicitly logging ROLLBACK when it reboots (like with MEMORY tables, MDEV-29796) if there’s a server @ where @ -> A.

Then if an unexpected conflict occurs with something waiting for the START ALTER locks, we can conservatively abort the SA thread and revert to direct_commit_alter.

Add a force mode to STOP REPLICA SQL_THREAD?

Jimmy Hú added a comment - 2025-03-23 04:09 - edited 1. Somehow in B log in the binlog stream an event that the replication connection A -> B experienced a crash or restart, and replicate that downstream to C (but which GTID to associate with this event?). Log an explicit ROLLBACK in B? This extra statement might violate GTID Strict Mode though. The same goes for A explicitly logging ROLLBACK when it reboots (like with MEMORY tables, MDEV-29796 ) if there’s a server @ where @ -> A. Then if an unexpected conflict occurs with something waiting for the START ALTER locks, we can conservatively abort the SA thread and revert to direct_commit_alter. Add a force mode to STOP REPLICA SQL_THREAD ?

People

Assignee:: Unassigned

Reporter:: Kristian Nielsen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2023-05-10 18:41

Updated:: 2025-03-23 04:15

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server