Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.0.20
-
Ubuntu 14.04.2 LTS
-
Can result in unexpected behaviour
Description
Last night, we had an odd issue where we suddenly got "duplicate key" on a multi source replication setup that always have been working perfectly...
Our multi source slave runs on ZFS and a snapshot is created each minute - before the snapshot, all slaves are stopped and locks with a global read lock to make sure the snapshot is as clean as possible.
Firstly we tried to find the source for the duplicate key errors, but the query were only to find once in the binary logs on the masters, so we skipped a few queries, but no luck in getting the slave replication restarted, we started to scratch our heads...
We disabled snapshotting temporarily and did a zfs rollback to before the problem occured and started the the mysql up again. It went right past the point were it stopped initially and we were happy and thought the problem was solved...
We ran our script to perform a snapshot and instantly we got a duplicate key error! We looked at the relay logs and found out that it had file number 999.999 when it worked and 1.000.000 when it failed - we had to look into this...
We ended up doing a "RESET SLAVE ALL" on the connection that failed and setting the slave up again (with correct master log and position) to get a clean relay log and now everything just worked, we got past the point at where it stopped initially and snapshotting just worked from here on and still does 10 hours later...
The move from file number 999.999 to 1.000.000 is our only suspect in this, but the odd thing is that one other server has a file number above 1.010.500 and we didn't hit a problem when it rolled over. The rest of the servers are around 600.000 (2 servers) and 700.000 (2 servers), so they will reach 1.000.000 in the foreseeable future :/