[MXS-3298] DNS server failure crashes Maxscale Created: 2020-11-13  Updated: 2024-01-29  Resolved: 2021-09-02

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: 2.4.13
Fix Version/s: 2.5.16

Type: Bug Priority: Minor
Reporter: Kyle Joiner (Inactive) Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None


 Description   

During a primary DNS failure Maxscale does not roll over to secondary and crashes with a system timeout.

cofigured:
GATEWAY=10.x.x.10
DNS1=10.x.x.190
DNS2=10.x.x.180

When DNS1 went offline:

2020-11-11 21:09:10 notice : Server 'server-2' version: 10.3.27-MariaDB-log
2020-11-12 18:39:18 error : (13633037) [mariadbbackend] getpeername()' failed on connection to 'server-2' when forming proxy protocol header. Error 107: 'Transport endpoint is not connected'
2020-11-12 18:39:18 error : (13633037) Write to Backend DCB ::ffff:X.X.X.X in state DCB_STATE_POLLING failed: 104, Connection reset by peer
2020-11-12 18:39:26 alert : Fatal: MaxScale 2.4.13 received fatal signal 6. Commit ID: faaf7f483eeb7afd75a5ca08fa258fae0d8c1456 System name: Linux Release string: NAME="CentOS Linux"
2020-11-12 18:39:26 alert : Statement currently being classified: none/unknown
2020-11-12 18:39:26 alert :
/lib64/libc.so.6(epoll_wait+0x33): :?
/usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0xd0): maxutils/maxbase/src/worker.cc:795
/usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x53): maxutils/maxbase/src/worker.cc:559
/usr/bin/maxscale(main+0x295a): server/core/gateway.cc:2339
/lib64/libc.so.6(__libc_start_main+0xf5): ??:?
/usr/bin/maxscale(): ??:?



 Comments   
Comment by markus makela [ 2020-11-24 ]

This happens because getaddrinfo is a blocking system call. The proper way to do this would be to either do it via the asynchronous getaddrinfo_a or use the MaxScale threadpool for it.

Comment by markus makela [ 2021-08-30 ]

Looking at this StackOverflow issue, we see that this can actually be fixed by configuring a lower timeout for the address resolver.

If the default timeout of five seconds and two attempts is used, this should not cause problems. If the timeout is set to a higher value, the SystemD watchdog timeout should probably can be adjusted to avoid a timeout.

Comment by markus makela [ 2021-08-31 ]

Even if the timeout is configured to a high value, the watchdog timeout can be avoided with a few changes in code.

Generated at Thu Feb 08 04:20:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.