Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.0.2
-
None
-
None
Description
I've attached the test exposing the problem.
What test does: sets up rpl topology 1->2,1->3, executes some statements, shuts down server 3, performs failover to make topology 2->1, executes some more statements, rotates binlogs, connects server 3 as a slave to 2. At this point server 2 hits ASSERT at sql/sql_repl.cc:1001.
The problem is GTID implementation has an inconsistent view of what is replication domain and what data constitutes the replication domain state. E.g. function contains_all_slave_gtid() assumes that Gtid_list contains only one gtid by domain yet compares seq_no only if gtid contains the same server_id as requested by slave which is wrong. And this "wrong" is what I wanted to catch with my test but discovered that Gtid_list actually contains one gtid for each domain-server pair which seems to be even more wrong – for how long non-existent old server_id will be stored and passed from one Gtid_list to another on each binlog rotation?
mysqld hits ASSERT in the test because gtid_find_binlog_file() makes the same assumptions as contains_all_slave_gtid() but is written in a way prohibiting any work if domain_id is repeated in Gtid_list.
I think the fix should be to remove server_id comparisons from the places like gtid_find_binlog_file() and contains_all_slave_gtid() and make sure that Gtid_list contains only one gtid per domain. After all server_id exists only for de-duplication of events with the same seq_no belonging to alternate futures.
Hm, that assert is wrong:
const rpl_gtid *gtid= state->find(glev->list[i].domain_id);
{ /* contains_all_slave_gtid() would have returned false if so. */ DBUG_ASSERT(0); continue; }if (!gtid)
It is correct in the first iteration of the loop, but as we delete from the
`state' hash in the loop, the condition may no longer hold in subsequent
iterations of the loop. So the assert should just be removed.
I will check that everything else is ok with the supplied test case (thanks
for supplying that, made it very easy to reproduce the problem).