[MDBF-415] Monitor SSH status for libvirt workers - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: N/A
Fix Version/s: N/A
Component/s: Buildbot
Labels:
- monitoring

Description

When the libvirt master starts, it creates a ssh connection to the worker machine for each defined worker. If for any reason that ssh connection drops, then there will be build failures. The master doesn't handle at all this and a master restart is needed.

Ways to reproduce:
1. Look into the running processes on hz-bbm1 and there should be several entries like

buildma+ 3576703 3576701  0 14:45 ?        00:00:00 ssh -p 65001 -l buildbot -T -e none -- 100.64.100.20 sh -c 'which virt-ssh-helper 1>/dev/null 2>&1; if test $? = 0; then     virt-ssh-helper 'qemu:///system'; else    if 'nc' -q 2>&1 | grep "requires an argument" >/dev/null 2>&1; then ARG=-q0;else ARG=;fi;'nc' $ARG -U /var/run/libvirt/libvirt-sock; fi'

2. Libvirt restart on the worker machine (hz-bbw5)
3. Go back to 1 and there should be no active ssh connection
4. Master restart

Ideas for monitoring:
faust would it be possible to monotor if there is at least one ssh connection similar to the one from above?

Attachments

Issue Links

is part of

MDBF-41 Milestone 5: Desirable fixes

Closed

Activity

Ascending order - Click to sort in descending order

Faustin Lammler added a comment - 2022-05-19 10:18

Based on:

I have implemented a watchdog service, see:

/srv/buildbot/master/master-libvirt/watchdog_libvirt.sh
/etc/systemd/system/buildbot-master-libvirt.service

It will restart the libvirt master automatically if the number of ssh connection is different from what is defined in the master.cfg file.

I have tested by restarting the libvirtd daemon on hz-bbw5 and it seems to be working perfectly fine.

Warning, since the watchdog service compares the number of occurrence of "qemu+ssh" in the configuration and the actual ssh open connection, we can not have commented "qemu+ssh" lines in the master.cfg file. I could filter comments in the grep command but I prefer to use grep -c than grep | wc -l.

Faustin Lammler added a comment - 2022-05-19 10:18 Based on: http://0pointer.de/blog/projects/watchdog.html https://www.medo64.com/2019/01/systemd-watchdog-for-any-service/ I have implemented a watchdog service, see: /srv/buildbot/master/master-libvirt/watchdog_libvirt.sh /etc/systemd/system/buildbot-master-libvirt.service It will restart the libvirt master automatically if the number of ssh connection is different from what is defined in the master.cfg file. I have tested by restarting the libvirtd daemon on hz-bbw5 and it seems to be working perfectly fine. Warning , since the watchdog service compares the number of occurrence of "qemu+ssh" in the configuration and the actual ssh open connection, we can not have commented "qemu+ssh" lines in the master.cfg file. I could filter comments in the grep command but I prefer to use grep -c than grep | wc -l.

Faustin Lammler added a comment - 2022-05-19 10:20

TODO:

monitor when services are restarted (this is also needed for the master-web service);
find a way to monitor if workers are not available.

Faustin Lammler added a comment - 2022-05-19 10:20 TODO: monitor when services are restarted (this is also needed for the master-web service); find a way to monitor if workers are not available.

Faustin Lammler added a comment - 2022-06-16 07:36 - edited

This https://github.com/MariaDB/buildbot/pull/2 will fix it.

TODO becomes:

find a way to be alerted if workers are not available

Faustin Lammler added a comment - 2022-06-16 07:36 - edited This https://github.com/MariaDB/buildbot/pull/2 will fix it. TODO becomes: find a way to be alerted if workers are not available

Faustin Lammler added a comment - 2022-09-19 14:08

The watchdog systemd unit script will stop the buildbot master libvirt process if one of the builders is not accessible after 3 restart attempt.

We receive an alert if the master libvirt process is down.

This can probably be improved in the future by having an automatic way of restarting the buildbot libvirt master if this occurs (or use libvirt worker more reliable).

Faustin Lammler added a comment - 2022-09-19 14:08 The watchdog systemd unit script will stop the buildbot master libvirt process if one of the builders is not accessible after 3 restart attempt. We receive an alert if the master libvirt process is down. This can probably be improved in the future by having an automatic way of restarting the buildbot libvirt master if this occurs (or use libvirt worker more reliable).

People

Assignee:: Faustin Lammler

Reporter:: Vlad Bogolin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2022-05-18 15:29

Updated:: 2025-01-16 08:57

Resolved:: 2022-09-19 14:08

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

MariaDB Foundation Development