[MDEV-4473] GTID doesn't work in face of master failovers Created: 2013-05-03 Updated: 2013-05-27 Resolved: 2013-05-03 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | None |
| Affects Version/s: | 10.0.2 |
| Fix Version/s: | 10.0.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Pavel Ivanov | Assignee: | Kristian Nielsen |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Description |
|
I've attached the test exposing the problem. What test does: sets up rpl topology 1->2,1->3, executes some statements, shuts down server 3, performs failover to make topology 2->1, executes some more statements, rotates binlogs, connects server 3 as a slave to 2. At this point server 2 hits ASSERT at sql/sql_repl.cc:1001. The problem is GTID implementation has an inconsistent view of what is replication domain and what data constitutes the replication domain state. E.g. function contains_all_slave_gtid() assumes that Gtid_list contains only one gtid by domain yet compares seq_no only if gtid contains the same server_id as requested by slave which is wrong. And this "wrong" is what I wanted to catch with my test but discovered that Gtid_list actually contains one gtid for each domain-server pair which seems to be even more wrong – for how long non-existent old server_id will be stored and passed from one Gtid_list to another on each binlog rotation? mysqld hits ASSERT in the test because gtid_find_binlog_file() makes the same assumptions as contains_all_slave_gtid() but is written in a way prohibiting any work if domain_id is repeated in Gtid_list. I think the fix should be to remove server_id comparisons from the places like gtid_find_binlog_file() and contains_all_slave_gtid() and make sure that Gtid_list contains only one gtid per domain. After all server_id exists only for de-duplication of events with the same seq_no belonging to alternate futures. |
| Comments |
| Comment by Kristian Nielsen [ 2013-05-03 ] |
|
Hm, that assert is wrong: const rpl_gtid *gtid= state->find(glev->list[i].domain_id); It is correct in the first iteration of the loop, but as we delete from the I will check that everything else is ok with the supplied test case (thanks |
| Comment by Pavel Ivanov [ 2013-05-03 ] |
|
Note though contains_all_slave_gtid() returns true for slave_log.00003 when server 3 tries to connect to server 2 which is wrong because some events become skipped in slave_log.00002. |
| Comment by Kristian Nielsen [ 2013-05-03 ] |
|
Yes, I've noticed, thanks. I'll fix. |
| Comment by Kristian Nielsen [ 2013-05-03 ] |
|
BTW, I noticed that your test has --sync_with_master in server_3, but not server_2 (for the initial events on server 1). I assume this is just an oversight, because without it the test becomes non-deterministic (it will fail if server 2 slave threads happen to stop before they replicated everything from server 1) |
| Comment by Kristian Nielsen [ 2013-05-03 ] |
|
Ok, so there was a deeper issue here. Suppose binlog file X has in its Gtid_list_event: 0-1-3,0-2-5, and suppose the In this case the bug was that master would start sending events from the start On the other hand, if the slave requested 0-2-5, then it is correct to start Fixed by checking if the gtid requested by slave that matches a gtid in the I pushed the fix to 10.0-base. |
| Comment by Pavel Ivanov [ 2013-05-03 ] |
|
Kristian, And am I right that if some server was master, then failed over to another master and then ceased to exist altogether gtid with its server_id will exist indefinitely in the gtid_list in the binlogs of every server? Don't you think it's a problem and a way to accumulate a lot of unnecessary garbage? How will everything behave if gtid_list contains 1000 or 10,000 different gtid records? |
| Comment by Pavel Ivanov [ 2013-05-03 ] |
|
OK, I see your response on maria-developers list. I'll move the discussions there. |