[MDEV-16685] Replication data drift Created: 2018-07-04 Updated: 2022-10-08 Resolved: 2022-10-08 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.0.26, 10.0.30, 10.0.33 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Gabriel Garrido (Inactive) | Assignee: | Andrei Elkin |
| Resolution: | Incomplete | Votes: | 1 |
| Labels: | None | ||
| Environment: |
CentOS 7.2-7.4 |
||
| Description |
|
We've configured 4 mariadb 10.0.xx slaves, the master also has the same major-minor version and about the same hardware (32 cores, 128Gbs ram, 4.4Tb of data on ssd raid), everything was running fine until we noticed some data was missing in the slaves (that's when we read all over data drift in mysql/mariadb replication), after reading thousands of articles and keep trying to find what happened we noticed that the data is missing (in the slaves) but the schemas are being updated, for example if I created a new db with some tables it will show up in the slaves but not the data that I could potentially insert, the same it's been happening to the rest of the databases/tables (and there are thousands), however we see that no errors are being displayed with `SHOW SLAVE STATUS\G`, but the replication is partially working, we were analyzing the files using mysqlbinlog in the slaves to get to the conclusion that data wasn't being sent to the slaves, as there is a lot of data it's a big effort to reconfigure/copy and start one slave again (also knowing that it could get into the same issue in a few weeks) Here are some config parameters of interest: Master Config:
Slave config (only different values, skipping server-id, log-bin, relay-log)
Master status (Note that the master status was added several hours after the slave status):
Slave status:
We also investigated percona tools like pt-table-checksum and pt-table-sync but they are useless if the data is not getting inserted in the binlog, it could be a bug? The config for the slaves is similar to what's displayed here, so my questions are: Why mysql keeps replicating schemas but not data? Thanks. |
| Comments |
| Comment by Elena Stepanova [ 2018-07-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
You said you looked at the slaves using mysqlbinlog and discovered that the data wasn't sent to the slaves. | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gabriel Garrido (Inactive) [ 2018-07-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
This is from the master's binlog and the insert it's present, what makes this issue so odd is that the schemas are being updated in the slaves, but not the data. Below you can see the extracted insert from the master (that was a test db created while checking this issue):
| |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gabriel Garrido (Inactive) [ 2018-07-05 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Let me know if you need anything else from me, I should be able to provide any additional information if needed. | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2018-07-05 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks. | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gabriel Garrido (Inactive) [ 2018-07-05 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
I just tested it and dumped the relay log and it is present in the relay log, in the previous comments I was only talking about the binary log. This is from the relay log in the slave:
Thanks. | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gabriel Garrido (Inactive) [ 2018-07-10 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Last night we stopped the master server, and all 4 slaves, after starting the master again we started 2 slaves which throwed this error:
After that, as a test I added slave_skip_errors = 1032 in the other 2 slaves and started the slaves, these 2 slaves started to sync correctly and also the data started to get inserted in the tables (only new data), of course this is a totally corrupted slave at this point, but it took a master restart and a slave_skip_errors = 1032 param to recover from the no data issue thing, the problem here is the silent replication break, if we could at least tell when it breaks without having to implement any custom script or checking on a daily basis. And also as we use binlog-format = ROW we cannot make use of pt-table-checksum and pt-table-sync to resync in such case. Thanks. | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gabriel Garrido (Inactive) [ 2018-12-28 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello, it's been a while since this was reported and we were wondering if the root cause of the issue was found, or if you were able to replicate it?, if there is any more information that I can provide please just ask me, We still live with this bug Thank you and Happy holidays. | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2021-03-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
gabriel.garrido: Perhaps it is already long time irrelevant, but it looks your issue related to the fact the slaves were configured to ignored duplicate errors - 1062. It's a normal thing to do, consequences of such configuration may generally lead to data loss on slave. |