stephane@skysql.com, my test is intended to cover exactly the scenario you describe.
rdem This is the setup I try to cover, I just omitted srv2 as it is not involved in the failover. Using named slave->master connections should not matter I think. The test uses domain_id=1 for srv1 and domain_id=0 for srv3/srv4.
My understanding is the issue is with CHANGE MASTER on srv3 to replicate from srv4?
In my test, GTID 1-1-7 is filtered. So srv4 has in its binlog 1-1-6,1-1-8, it is missing GTID 1-1-7. The srv3 has gtid_slave_pos="1-1-7", and it gets this error:
'Error: connecting slave requested to start from GTID 1-1-7, which is not in the master's binlog'
|
If this is not the error you are describing, let me know which error it is.
In sql/sql_repl.cc, there is code to disable exactly this error:
if (info->slave_gtid_ignore_duplicates && domain_gtid.seq_no < slave_gtid->seq_no) {
|
continue;
|
That is why setting --gtid-ignore-duplicates=1 is needed. With this setting, your scenario is valid and should work. The errors are only to help users with incorrect domain_id configuration.
When domain_id is configured correctly, --gtid-ignore-duplicates=1 should not be scary and not lead to events being lost. It only ignores events that have the same domain_id but a smaller seq_no than the previous event.
To explain the wrong configuration the errors are there to prevent, imagine a user with your setup that did not configure different domain id (maybe upgrade from 5.5 to 10.0). The events from srv1 and srv3 will be duplicating each other's seq_no, e.g.:
0-1-10, 0-1-11, 0-3-9, 0-3-12, 0-1-12, 0-3-13, ...
Now imagine that srv3 and srv4 filter out event 0-1-12. There is no way on srv3 and srv4 to know if 0-1-12 should come before or after 0-3-12. Therefore the server code acts safe and throws an error.
The --gtid-ignore-duplicate=1 means the user did configure domains correctly, and sequence numbers will always be strictly increasing in each domain_id. Then this problem can not occur, and the error can be safely silenced.
I'm not sure this is documented anywhere outside the server source code, so your question/concerns are very valid.
Also I'm not sure if --gtid-ignore-duplicates will allow to connect a missing GTID if there's a switchover from srv1 to srv2 at the same time (eg. if 1-1-7 is followed by 1-4-8 in my test, not by 1-1-8). If not, that may be a bug that should be fixed.
I'd agree with such concept. The filtered out info should not affect the slave state. We need a more clear policy for that. Thank you for pointing to this issue!