[MXS-2057] Watchdog for MaxScale Created: 2018-09-17  Updated: 2020-08-25  Resolved: 2018-11-14

Status: Closed
Project: MariaDB MaxScale
Component/s: N/A
Affects Version/s: None
Fix Version/s: 2.3.1

Type: New Feature Priority: Major
Reporter: Dipti Joshi (Inactive) Assignee: Niclas Antti
Resolution: Fixed Votes: 0
Labels: None

Sub-Tasks:
Key
Summary
Type
Status
Assignee
MXS-2149 Create REST-API watchdog Sub-Task Closed Johan Wikman  
Sprint: MXS-SPRINT-69, MXS-SPRINT-70

 Description   

Provide a watchdog utility for MaxScale that runs as a process on the MaxScale server and continuously monitors MaxScale process
1. It should detect that MaxScale is hung and crash with signal 6 (using watchdog?)
2. It should have the option to generate a core dump upon the crash
3. When there are multiple MaxScale nodes in HA configuration, other Maxscale nodes watchdog should detect if the Maxscale node with the setting of active (i.e. passive=0) went down and elect one of the other MaxScale nodes to be active and assign the elected MaxScale node with passive=0



 Comments   
Comment by Hartmut Holzgraefe [ 2018-09-17 ]

Core dumps actually work just fine already, when starting maxscale from a shell where "ulimit -c unlimited" is set I am getting a core file in the /var/log/maxscale directory just fine when doing "killall -6 maxscale"

Some modifications to the systemd service file may be needed to produce core dumps when maxscale is running under systemd control though.

Comment by markus makela [ 2018-09-18 ]

An idea for 3 would be to have MaxScale's communicate with each other via the REST API so that they would form a cluster. A manually assigned priority number would allow the user to tell the order of promotion which would also serve as the basis on which conflict resolution could be build.

Comment by Johan Wikman [ 2018-09-20 ]

There are two problems here:

  1. Detect when MaxScale is hung and kill it.
  2. When the MaxScale in the active role has gone down, a MaxScale not in the active role should be made active.

The first one can probably be handled with systemd's watchdog functionality.
The second can be handled with keepalived.

Generated at Thu Feb 08 04:11:22 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.