[MXS-4711] Maxscale crashes on network errors Created: 2023-08-10 Updated: 2023-09-26 Resolved: 2023-09-26 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | Core |
| Affects Version/s: | 22.08.1, 23.02.3 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Minor |
| Reporter: | Presnickety | Assignee: | markus makela |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | crash | ||
| Environment: |
RHEL v8.2 |
||
| Attachments: |
|
| Description |
|
Hi There, We're experiencing regular load balancer crashes that seem to occur while vCenter commences vMotion activities on the same physical hosts where the MariaDB/Maxscale load balancer VMs reside. We have pinned the MariaDB/Maxscale load balancer VMs, as well as the application VMs that connect to the cluster, to the physical host they reside on, but to no avail. We have a three node cluster, and at least one of them experiences this event daily. Memory consumption varies between 50-75% (256Gb total), CPU 30-40% (16 vcores). Network connections immediately preceed the crashes. Upstream Java applications interface the cluster, DB connections vary between 20 to 80. Output below logs the core dump produced, please advise where the dumps are located as they don't appear in the system default location. Would gdb need to be configured to enable this?; 2023-04-05 23:06:12 error : (871) Network error in connection to server 'viexh-session-usage-mdb-01', session in state 'Stopping session' (DCB::State::POLLING): 104, Connection reset by peer 2023-04-05 23:08:52 alert : MaxScale 22.08.1 received fatal signal 6. Commit ID: 2a533b7bce81e767ef5b263b0b32ebb509dbfe4c System name: Linux Release string: Red Hat Enterprise Linux release 8.2 (Ootpa) Thanks. |
| Comments |
| Comment by markus makela [ 2023-08-10 ] |
|
Have you tried upgrading to the latest release? |
| Comment by Presnickety [ 2023-08-12 ] |
|
Hi, Yesterday we upgraded to v23.02.3. Issue re-occurred this morning, please refer to the attached log & and server state changes graph. Thanks. |
| Comment by markus makela [ 2023-08-12 ] |
|
One possibility where this can happen if a reverse DNS lookup done during authentication is taking a long time. To rule this out, you can add skip_name_resolve=true under the [maxscale] section. At the same time, I think it'd be a good idea to enable GDB stacktraces by adding debug=gdb-stacktrace in the same [maxscale] section. After that you'll need to install GDB on the MaxScale server which it will then use to generate a very detailed stacktrace from all threads at the time the problem occurs. This way if the problem does reoccur, we'll have all the information we need to fix it. |
| Comment by Presnickety [ 2023-08-14 ] |
|
Hi There, We updated the config as per recommendation, in place for +30 hours. A few "lost connection" events occurred this morning and a master DB failover, but no load balancer crash. Will continue to monitor. Thanks. |
| Comment by Presnickety [ 2023-08-16 ] |
|
Hi There, We had a crash on stand-by load balancer #2, (#1 is/has been active for the last few days). Please see attached, let us know if you need further info. Thanks. |
| Comment by markus makela [ 2023-08-17 ] |
|
Odd, it looks like all of the worker threads are idle and systemd should not kill the MaxScale process in that case. Can you verify that the systemd journal has a message for the maxscale service about the watchdog timeout being exceeded? |
| Comment by Presnickety [ 2023-08-21 ] |
|
Hi Markus, Yes we see several messages. Please refer to the attached. FYI you will see a few keepalived messages, we have configured keepalived as a VIP on each node in the cluster. Thanks. MXS-4711_maxscale-logs-04.txt |
| Comment by markus makela [ 2023-08-21 ] |
|
If this is the standby MaxScale instance that's causing problems, it could be that the kafkacdc router is the source of the problems as it'll process data even if there's no client traffic. You can confirm this by removing the service and seeing if the watchdog timeouts stop on the standby MaxScale. Given the nature of the kafkacdc router, I wouldn't expect you to need two instances to be up at all times since they'll both push duplicate events to Kafka. Has the other MaxScale with skip_name_resolve=true had any problems so far? If this seems to make the problem go away, the explanation could be a slow reverse name lookup but this would not explain why the standby MaxScale is behaving the way it is. |
| Comment by Presnickety [ 2023-08-21 ] |
|
Node viexh-session-usage-mdb-01 is the active node for the load balancer, this latest issue is occurring on the active node. We've experienced issues across all three nodes. We are aware of the KafkaCDC repeated records produced across the cluster, we can live with for now. We configured skip_name_resolve=true across the cluster, we're seeing far fewer load balancer crashes since then. |
| Comment by markus makela [ 2023-08-21 ] |
|
There's an internal thread in MaxScale that monitors the state of all other threads in MaxScale to make sure they aren't stuck. The upcoming 22.08.8 release will contain some improvements that will log the name of the thread that is stuck if a stuck thread is detected. Once the release is out, you could upgrade the MaxScale instances and we should see which thread is stuck. |
| Comment by Presnickety [ 2023-08-25 ] |
|
Hello Markus, We'll upgrade to that version when available. The issue we have is both vSAN & vMotion traffic share the same allocated bandwidth through the physical host NICs, so whenever a vMotion occurs the associated data transfer chokes everything else. We're currently running the two host 10gb NICs as active/standby, we will configure them as active/active thereby doubling throughput to 20gb as see if that helps. Please close the the ticket if you need to. Thanks. |
| Comment by markus makela [ 2023-08-25 ] |
|
OK, I think that means that it's most likely a DNS request that's blocking MaxScale and it's just not fast enough to respond to the SystemD watchdog requests to be considered alive. Do you notice a slowdown in the client applications whenever this happens? If you do, this would be supporting evidence to the theory of DNS lookups causing it. |
| Comment by Presnickety [ 2023-08-26 ] |
|
Hello Markus, It's more of a complete stop after these events, when the failover usually occurs connections flip to the next available load balancer, Unsure if this is related to a DNS issue, what we observe after most failover events is the Java apps are unable to get to the DB backends and report "Exhausted to Serve". At this point, most times the connection count remains low, and even when connection count does remain high, the only way to resolve this is by restarting the apps; {{2023-08-26 07:19:33,991 ERROR com.telstra.mds.extrahop.kafka.consumer.AccountingListener [mds-cpvnf-extrahop-10-C-1] All Retry-Attempts=201 Exhausted to Serve AcctRecord= {"username":"XYZ","acct_status_type":"Interim","acct_session_id":"13413309","event_timestamp":1692998370,"acct_input_octets":946925639,"acct_output_octets":1633297247,"acct_session_time":43200,"acct_delay_time":0,"frame_ip":"100.70.152.54","acct_input_gigawords":0,"nas_identifier":"XYZ","nas_port":3145,"nas_port_id":"ae1.demux0.3222004542:202-3145","nas_port_type":"Ethernet(15)","acct_output_gigawords":4,"nas_ip":"XYZ","last_hb":1692998371062.749}}} Thanks |
| Comment by markus makela [ 2023-09-26 ] |
|
I'll close this ticket now that most of the problems have been solved. There's an open issue ( |