Looks like there is a regression in parallel replication in 10.1 due to a race
condition introduced with the patch for optimistic parallel replication.
The gco->installed flag was changed to be a bit in the gco->flags field. This
hoever is a serious error, as gco->flags is changed by the SQL driver thread
without locking. This can race with worker threads updating the INSTALLED bit.
The result is that modifications to the gco->flags can be lost, causing
corruption of the internal state. This could result in various problems from
the user point-of-view, depending on exact timing and so on.
One user reported seeing maximum replication retries exceeded, followed by the
slave hanging. Similar to this in the error log and then all slave worker
threads stuck with replication no longer progressing:
I believe (though I cannot know for sure) that this was caused by this bug. It
is also possible that the bug could manifest itself in other ways, probably
related to transactions running in the wrong order and possibly conflicting,
or to slave worker threads hanging waiting for the wrong transaction.