[MDEV-13448] Slave should reconnect less quickly when being disconnected due to duplicate server id Created: 2017-08-04  Updated: 2018-10-30

Status: Confirmed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.2.7
Fix Version/s: 10.2

Type: Bug Priority: Minor
Reporter: Hartmut Holzgraefe Assignee: Andrei Elkin
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-11177 mysqlbinlog exits silently without er... Closed

 Description   

With MDEV-11177 being fixed we are now getting duplicate server IDs reported properly in the slaves error log.

Affected slaves will still try to reconnect immediately though, so flooding their error log with several log lines per second.

As it's clear that a duplicate server id requires a configuration fix that wouldn't happen within a fraction of a second slaves should throttle their reconnect attempts to one per second max., maybe even less.



 Comments   
Comment by Elena Stepanova [ 2017-08-04 ]

I would actually think that it makes no sense at all to try to reconnect automatically in this case, but instead let the administrator solve the problem and resume replication manually. Automatic reconnect makes both slave non-functional anyway, because they keep going in a loop and only burn CPU cycles and disk: slave1 connects, then slave2 connects and kicks out slave1, slave1 reconnects and kicks out slave2, slave2 reconnects and kicks out slave1, etc.

But I'll leave it to Elkin to decide what's the best way to handle it.

Comment by Andrei Elkin [ 2017-08-05 ]

I agree with
elenst in that
automated reconnecting must be limited. In fact --master-retry-count
is for that just it has dubiously large default of 86400. We should
consider to lower it.

As to the fighting of two identically numbered slaves a way to prevent
that could be to use an unique identifier instead of the number. Such
unique id could be associated with the slave server at least for its
runtime so the reconnecting slave would trigger kicking out only its
former thread handle.

Comment by Elena Stepanova [ 2017-08-05 ]

Just to clarify, I meant that automatic reconnect doesn't make sense in this particular situation, when identical slave IDs are detected, not in general. For example, automatic reconnect after a temporary loss of connection to the master, e.g. due to network issues, master restart and such, makes all sense (although, I have no strong opinion on the right default value of retries).

Comment by Andrei Elkin [ 2017-08-07 ]

Let also expand my response idea, the identical slave IDs may appear without in sense two master side may appear with a single slave server, to remind about a zombie dump thread.
A zombie gets exterminated by its successor. However a successor does not have full proof
of that what it kills is indeed its ancestor as kill_zombie_dump_threads() considers only the numeric server_id.
We could make the slave server, and then its IO thread, to identify itself more uniquely, and refine the kill function() to identify the actual ancestor to kill.

Generated at Thu Feb 08 08:05:40 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.