[MXS-5300] Maxscale aborts when a server is put into maintenance - Jira

XML

Word

Printable

Details

Type: Bug
Status: Needs Feedback (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 22.08.8
Fix Version/s: None
Component/s: None
Labels:
- triage
Environment:
- 2 Maxscale 22.08.8 servers, `dbproxy1` and `dbproxy2` sharing a VIP using `keepalived` with binlog router enabled for external replication
- 3 MariaDB 10.6.12-7 servers, `db1`, `db2`, and `db3` with replication managed by Maxscale

Sprint:
MXS-SPRINT-219

Description

After placing a third server in a cluster into maintenance mode, the Maxscale process is terminated with signal 6 (Abort). This is not guaranteed to happen every time, and generally happens on systems with more connections.

This is a critical bug that the customer's management is very aware of, and has been happening frequently during patching for some time. They would like a custom build of Maxscale with more debugging for abort signals.

Configuration and log files from the time of the incidents for Maxscale, MariaDB, and keepalived for all systems are attached in a hidden comment. The Maxscale crashes happen around 2024-09-24 22:48:20

Timeline of issue on 2024-09-24 (UTC):

1. db2 is put into maintenance mode, wait for connections to drain - 21:42:36
2. db2 has its OS patched and is rebooted - 22:02:03
3. db2 taken out of maintenance mode - 22:19:54
4. dbproxy1 has its OS patched and is rebooted - 22:23:39
5. db3 is put into maintenance mode in preparation for patching - 22:30:20
6. db1 and db2 start reporting errors reading communication packets - 22:36:40
7. dbproxy1 aborts with these log entries, keepalived does not switchover - 22:48:20

2024-09-24 22:47:50   warning: Thread 'Worker-08' has not reported back in 30 seconds.

2024-09-24 22:48:20   warning: Thread 'Worker-10' has not reported back in 30 seconds.

2024-09-24 22:48:20   warning: Thread 'Worker-08' has not reported back in 30 seconds.

2024-09-24 22:48:20   warning: Thread 'Worker-26' has not reported back in 30 seconds.

2024-09-24 22:48:20   warning: Thread 'Worker-27' has not reported back in 30 seconds.

2024-09-24 22:48:20   warning: Thread 'Worker-28' has not reported back in 30 seconds.

8165099:alert  : MaxScale 22.08.8 received fatal signal 6. Commit ID: 2f16a515391ac530a7280334dff5334f489d884e System name: Linux Release string: Ubuntu 20.04.6 LTS

26	../sysdeps/unix/sysv/linux/read.c: No such file or directory.

Running systemctl restart maxscale brings Maxscale back up without issue.

Attachments

Issue Links

is blocked by

MXS-5363 GDB stacktraces may hang

Closed

Activity

People

Assignee:: markus makela

Reporter:: Paul Rothrock

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2024-09-25 21:05

Updated:: 3 days ago 13:26

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.