[MDEV-13448] Slave should reconnect less quickly when being disconnected due to duplicate server id - Jira

Details

Type: Bug
Status: Confirmed (View Workflow)
Priority: Minor
Resolution: Unresolved
Affects Version/s: 10.2.7
Fix Version/s: 10.2(EOL)
Component/s: Replication
Labels:
None

Description

With ~~MDEV-11177~~ being fixed we are now getting duplicate server IDs reported properly in the slaves error log.

Affected slaves will still try to reconnect immediately though, so flooding their error log with several log lines per second.

As it's clear that a duplicate server id requires a configuration fix that wouldn't happen within a fraction of a second slaves should throttle their reconnect attempts to one per second max., maybe even less.

Attachments

Issue Links

relates to

MDEV-11177 mysqlbinlog exits silently without error when another instance connects to server

Closed

Activity

Ascending order - Click to sort in descending order

Elena Stepanova added a comment - 2017-08-04 22:26

I would actually think that it makes no sense at all to try to reconnect automatically in this case, but instead let the administrator solve the problem and resume replication manually. Automatic reconnect makes both slave non-functional anyway, because they keep going in a loop and only burn CPU cycles and disk: slave1 connects, then slave2 connects and kicks out slave1, slave1 reconnects and kicks out slave2, slave2 reconnects and kicks out slave1, etc.

But I'll leave it to Elkin to decide what's the best way to handle it.

Elena Stepanova added a comment - 2017-08-04 22:26 I would actually think that it makes no sense at all to try to reconnect automatically in this case, but instead let the administrator solve the problem and resume replication manually. Automatic reconnect makes both slave non-functional anyway, because they keep going in a loop and only burn CPU cycles and disk: slave1 connects, then slave2 connects and kicks out slave1, slave1 reconnects and kicks out slave2, slave2 reconnects and kicks out slave1, etc. But I'll leave it to Elkin to decide what's the best way to handle it.

Andrei Elkin added a comment - 2017-08-05 10:10 - edited

I agree with
elenst in that
automated reconnecting must be limited. In fact --master-retry-count
is for that just it has dubiously large default of 86400. We should
consider to lower it.

As to the fighting of two identically numbered slaves a way to prevent
that could be to use an unique identifier instead of the number. Such
unique id could be associated with the slave server at least for its
runtime so the reconnecting slave would trigger kicking out only its
former thread handle.

Andrei Elkin added a comment - 2017-08-05 10:10 - edited I agree with elenst in that automated reconnecting must be limited. In fact --master-retry-count is for that just it has dubiously large default of 86400. We should consider to lower it. As to the fighting of two identically numbered slaves a way to prevent that could be to use an unique identifier instead of the number. Such unique id could be associated with the slave server at least for its runtime so the reconnecting slave would trigger kicking out only its former thread handle.

Elena Stepanova added a comment - 2017-08-05 10:18

Just to clarify, I meant that automatic reconnect doesn't make sense in this particular situation, when identical slave IDs are detected, not in general. For example, automatic reconnect after a temporary loss of connection to the master, e.g. due to network issues, master restart and such, makes all sense (although, I have no strong opinion on the right default value of retries).

Elena Stepanova added a comment - 2017-08-05 10:18 Just to clarify, I meant that automatic reconnect doesn't make sense in this particular situation, when identical slave IDs are detected, not in general. For example, automatic reconnect after a temporary loss of connection to the master, e.g. due to network issues, master restart and such, makes all sense (although, I have no strong opinion on the right default value of retries).

Andrei Elkin added a comment - 2017-08-07 09:50

Let also expand my response idea, the identical slave IDs may appear without in sense two master side may appear with a single slave server, to remind about a zombie dump thread.
A zombie gets exterminated by its successor. However a successor does not have full proof
of that what it kills is indeed its ancestor as kill_zombie_dump_threads() considers only the numeric server_id.
We could make the slave server, and then its IO thread, to identify itself more uniquely, and refine the kill function() to identify the actual ancestor to kill.

Andrei Elkin added a comment - 2017-08-07 09:50 Let also expand my response idea, the identical slave IDs may appear without in sense two master side may appear with a single slave server, to remind about a zombie dump thread. A zombie gets exterminated by its successor. However a successor does not have full proof of that what it kills is indeed its ancestor as kill_zombie_dump_threads() considers only the numeric server_id. We could make the slave server, and then its IO thread, to identify itself more uniquely, and refine the kill function() to identify the actual ancestor to kill.

People

Assignee:: Andrei Elkin

Reporter:: Hartmut Holzgraefe

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2017-08-04 10:20

Updated:: 2024-07-08 00:49

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server