Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-13448

Slave should reconnect less quickly when being disconnected due to duplicate server id

Details

    • Bug
    • Status: Confirmed (View Workflow)
    • Minor
    • Resolution: Unresolved
    • 10.2.7
    • 10.2(EOL)
    • Replication
    • None

    Description

      With MDEV-11177 being fixed we are now getting duplicate server IDs reported properly in the slaves error log.

      Affected slaves will still try to reconnect immediately though, so flooding their error log with several log lines per second.

      As it's clear that a duplicate server id requires a configuration fix that wouldn't happen within a fraction of a second slaves should throttle their reconnect attempts to one per second max., maybe even less.

      Attachments

        Issue Links

          Activity

            I would actually think that it makes no sense at all to try to reconnect automatically in this case, but instead let the administrator solve the problem and resume replication manually. Automatic reconnect makes both slave non-functional anyway, because they keep going in a loop and only burn CPU cycles and disk: slave1 connects, then slave2 connects and kicks out slave1, slave1 reconnects and kicks out slave2, slave2 reconnects and kicks out slave1, etc.

            But I'll leave it to Elkin to decide what's the best way to handle it.

            elenst Elena Stepanova added a comment - I would actually think that it makes no sense at all to try to reconnect automatically in this case, but instead let the administrator solve the problem and resume replication manually. Automatic reconnect makes both slave non-functional anyway, because they keep going in a loop and only burn CPU cycles and disk: slave1 connects, then slave2 connects and kicks out slave1, slave1 reconnects and kicks out slave2, slave2 reconnects and kicks out slave1, etc. But I'll leave it to Elkin to decide what's the best way to handle it.
            Elkin Andrei Elkin added a comment - - edited

            I agree with
            elenst in that
            automated reconnecting must be limited. In fact --master-retry-count
            is for that just it has dubiously large default of 86400. We should
            consider to lower it.

            As to the fighting of two identically numbered slaves a way to prevent
            that could be to use an unique identifier instead of the number. Such
            unique id could be associated with the slave server at least for its
            runtime so the reconnecting slave would trigger kicking out only its
            former thread handle.

            Elkin Andrei Elkin added a comment - - edited I agree with elenst in that automated reconnecting must be limited. In fact --master-retry-count is for that just it has dubiously large default of 86400. We should consider to lower it. As to the fighting of two identically numbered slaves a way to prevent that could be to use an unique identifier instead of the number. Such unique id could be associated with the slave server at least for its runtime so the reconnecting slave would trigger kicking out only its former thread handle.

            Just to clarify, I meant that automatic reconnect doesn't make sense in this particular situation, when identical slave IDs are detected, not in general. For example, automatic reconnect after a temporary loss of connection to the master, e.g. due to network issues, master restart and such, makes all sense (although, I have no strong opinion on the right default value of retries).

            elenst Elena Stepanova added a comment - Just to clarify, I meant that automatic reconnect doesn't make sense in this particular situation, when identical slave IDs are detected, not in general. For example, automatic reconnect after a temporary loss of connection to the master, e.g. due to network issues, master restart and such, makes all sense (although, I have no strong opinion on the right default value of retries).
            Elkin Andrei Elkin added a comment -

            Let also expand my response idea, the identical slave IDs may appear without in sense two master side may appear with a single slave server, to remind about a zombie dump thread.
            A zombie gets exterminated by its successor. However a successor does not have full proof
            of that what it kills is indeed its ancestor as kill_zombie_dump_threads() considers only the numeric server_id.
            We could make the slave server, and then its IO thread, to identify itself more uniquely, and refine the kill function() to identify the actual ancestor to kill.

            Elkin Andrei Elkin added a comment - Let also expand my response idea, the identical slave IDs may appear without in sense two master side may appear with a single slave server, to remind about a zombie dump thread. A zombie gets exterminated by its successor. However a successor does not have full proof of that what it kills is indeed its ancestor as kill_zombie_dump_threads() considers only the numeric server_id. We could make the slave server, and then its IO thread, to identify itself more uniquely, and refine the kill function() to identify the actual ancestor to kill.

            People

              Elkin Andrei Elkin
              hholzgra Hartmut Holzgraefe
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.