A user testing parallel replication is seeing the server hang.
From debugging, it appears that the following happens.
One batch of group-committed transactions from the master all reach their
commit phase, say T1 and T2.
This causes the transactions in the following batch, say T3, T4, and T5, to
wake up and start running.
At this point, T2 (say) gets a deadlock or other temporary error, and needs to
be rolled back and retried. This involves unmark_start_commit(), decrementing
the count of transactions that have already reached their commit step.
Now when T3, T4, and T5 reach their commit step, they do mark_start_commit().
But a following T6 is not woken up, because T2 has not yet done
Then when T2 has been successfully retried and done mark_start_commit(), the
bug is that T6 is not woken up. The wakeup is lost. This is because T2 only
considers the batch with T3-T5 for wakeup, not following batches.
It seems fairly certain that this is the scenario that the user
experienced. It is however unknown at this point how it is possible for T2 to
get a deadlock error, after it has run all its containing queries and has
started the commit step. (The idea is that if there was a deadlock with T1,
then T1 will be blocked from proceeding to mark_start_commit() until T2 has
done rollback; and T2 does unmark_start_commit() before its rollback).
The fix should be to make sure that this case, of T2 retrying after T3-T5 have
started running, is handled correctly: When T2 completes its retry, all
following and possibly waiting transactions should be considered, so the
wakeup is not lost.
The user-visible effect in this hang is that at least one replication worker
threads are stuck in state "Waiting for prior transaction to start commit
before starting next transaction", and all other threads are stuck in this
state or the state "Waiting for prior transaction to commit". (As seen in SHOW
PROCESSLIST). Killing the worker threads will stop replication, and it can
then be re-started successfully.