[MXS-1198] Interface retry bind interval (of a listener) increases by ten seconds every time it fails (10,20,30,....) it should be a fixed interval (and maybe configurable) Created: 2017-03-22  Updated: 2020-08-25  Resolved: 2017-03-24

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: 2.0.4
Fix Version/s: 2.2.0

Type: Bug Priority: Major
Reporter: Claudio Nanni Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None


 Description   

There are high availability scenarios where there are more MaxScale instances and they are bound to VIPs, even the standby ones.
In such scenario the standby MaxScale instances can't bind to the VIP which is on the active node, so the listener retries to bind to the VIP periodically, which is good.
The anomaly is that the retry interval increases every time by 10 seconds, which doesn't seem to have any rationale behind.
The consequence is that after long time the standby MaxScale instance is basically unable to re-bind in due time in case of need.
The code part should be server/core/service.c:

432 else if (service->retry_start)
433

{ 434 /** Service failed to start any ports. Try again later. */ 435 service->stats.n_failed_starts++; 436 char taskname[strlen(service->name) + strlen("_start_retry_") + 437 (int) ceil(log10(INT_MAX)) + 1]; 438 int retry_after = MIN(service->stats.n_failed_starts * 10, SERVICE_MAX_RETRY_INTERVAL); 439 snprintf(taskname, sizeof(taskname), "%s_start_retry_%d", 440 service->name, service->stats.n_failed_starts); 441 hktask_oneshot(taskname, service_internal_restart, 442 (void*) service, retry_after); 443 MXS_NOTICE("Failed to start service %s, retrying in %d seconds.", 444 service->name, retry_after); 445 446 /** This will prevent MaxScale from shutting down if service start is retried later */ 447 listeners = 1; 448 }

449 }

I think line 438 should just be:

438 int retry_after = MIN(10, SERVICE_MAX_RETRY_INTERVAL);

Where 10 = 10 seconds, or eventually a settable variable



 Comments   
Comment by markus makela [ 2017-03-22 ]

The maximum for the retry interval is 3600 seconds. This should be made configurable in a future release.

Comment by markus makela [ 2017-03-24 ]

The maximum interval is now configurable.

Comment by Claudio Nanni [ 2017-03-29 ]

Currently the retry time is a value increasing by 10 seconds each time limited by max_retry_time.
I'm not sure about the logic behind increasing the retry time and increasing it by 10 seconds each time, it quickly leads to a non useful interval if we consider HA failover scenarios.
Even with the new options to set the max retry time such setting would probably be a few seconds to be useful anyway always quite small.
This to say that I see probably more useful simply a unique variable (and value) for retry time, not an increasing time and a max time.
What is your opinion?

Generated at Thu Feb 08 04:04:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.