This bug happens on the slave, when a binlog from a master ends with a
partially written event group (that has BEGIN but is missing COMMIT, eg).
Such partial event group occurs if the master crashes in the middle of writing
to the binlog.
The slave detects this when the restart format description event in the
following binlog file is received. A worker thread that is in the middle of
replicating the partial event group must be notified so that it can roll back
The bug was that this notification could be lost, depending on thread
scheduling. If lost, the worker thread would then wait indefinitely for the
rest of the transaction to arrive, and the SQL thread in turn would wait for
the worker thread to complete the rollback, deadlocking the slave.
This bug is likely what was seen by a user in a hard-to-reproduce hang.
It is also the cause of the sporadic failure in Buildbot in