[MDEV-9861] Network outage can break replication Created: 2016-04-01 Updated: 2020-10-20 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.1.11 |
| Fix Version/s: | 10.1 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Igor Pashev | Assignee: | Andrei Elkin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Description |
|
We use parallel row-based replication for a few channels (20 ¯_(ツ)_/¯). Almost every time we reboot a master server (MySQL 5.6 @ RDS) or have a network outage, we get some HA_ERR_FOUND_DUPP_KEY errors right after replication resumes (or "a foreign key constraint fails" in statement-based replication). Skipping those errors produces more errors, and idempotent mode lead to another errors. Since network outage is a kind of expected event, I'd consider this behaviour a bug. Our configuration:
|
| Comments |
| Comment by Elena Stepanova [ 2016-05-02 ] | |||||||||||||||||||||||
|
Could you please attach the slave error log which covers the time interval at least from the server startup till the replication failure (which means it will also naturally include the moment of the network outage or master restart)? By "a few channels" do you mean multi-master replication? Do you ensure that the replication channels are independent and do not conflict with each other? Thanks. | |||||||||||||||||||||||
| Comment by Igor Pashev [ 2016-05-02 ] | |||||||||||||||||||||||
|
> By "a few channels" do you mean multi-master replication? > Do you ensure that the replication channels are independent and do not conflict with each other? I haven't seen this issue for a while (with 10.1.13), probably because there were no reboots or network outages, or the problem is gone. | |||||||||||||||||||||||
| Comment by Elena Stepanova [ 2016-05-02 ] | |||||||||||||||||||||||
|
The white list might be tricky if you are getting errors specifically on foreign key constraints, since foreign key cascades are not really replicated, but executed separately on the slave. Anyway, even although you haven't had the problem for a while, you probably still have the error log from previous occasions? If so, please attach them. | |||||||||||||||||||||||
| Comment by Igor Pashev [ 2016-05-02 ] | |||||||||||||||||||||||
|
No, the old logs are gone ¯_(ツ)_/¯ > foreign key cascades are not really replicated, but executed separately on the slave Anyway, the problem is gone since at least 12th of April. | |||||||||||||||||||||||
| Comment by Igor Pashev [ 2016-06-10 ] | |||||||||||||||||||||||
|
It finally happened:
| |||||||||||||||||||||||
| Comment by Igor Pashev [ 2016-09-15 ] | |||||||||||||||||||||||
|
I wonder if this is caused by Amazon RDS switching between availability zones. | |||||||||||||||||||||||
| Comment by Igor Pashev [ 2016-09-21 ] | |||||||||||||||||||||||
|
With {slave_exec_mode = IDEMPOTENT}, the issue happens for 10.1.17, but does not happen for 10.0.25 | |||||||||||||||||||||||
| Comment by Igor Pashev [ 2016-09-24 ] | |||||||||||||||||||||||
|
I just saw it with MariaDB -> MariaDB replication (both 10.1.17). Rebooted master -> slaves failed with Error 'Duplicate entry '12345' for key 'PRIMARY''. |