Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.8.1
Description
When START ALTER event is applied, it registers in mi->start_alter_list.
However, mi is the wrong place to have this list. There is no guarantee that
the matching COMMIT ALTER event will be applied in the context of the same
mi.
In multi-source replication A<->B, A->C, B->C with --gtid-ignore-duplicates,
C will receive duplicates of all events on the A and B master connection,
and it is random which one will be applied and which one ignored. If START
ALTER runs in the context A, but COMMIT ALTER runs in B, then COMMIT ALTER
will not find the start_alter_info, and will try to do the full ALTER TABLE.
But this deadlocks, because the SA thread of START ALTER is holding the locks on
the table, waiting to be signalled from COMMIT ALTER.
I have a testcase on my knielsen_start_alter branch on github (Jira<->Github
integration will hopefully keep it referenced over rebases). It needs some
more work currently, but does manage to reproduce the issue if run
sufficient number of times.
Suggested fix: The start_alter_list needs to be global shared between all
mi's (and rli's). Then some thought need to be given to which pending START
ALTERs (and SA threads) to abort when one multi-source slave connection is
stopped. I think it is reasonable to stop those START ALTERs that originated
from the master connection that is stopped. This is overly conservative in
scenarios like the one described above, but that's probably ok.
I am working on a fix, might require some time as it's not entirely trivial.
- Kristian.