[MDBF-415] Monitor SSH status for libvirt workers Created: 2022-05-18 Updated: 2022-09-19 Resolved: 2022-09-19 |
|
| Status: | Closed |
| Project: | MariaDB Foundation Development |
| Component/s: | Buildbot |
| Affects Version/s: | N/A |
| Fix Version/s: | N/A |
| Type: | Task | Priority: | Major |
| Reporter: | Vlad Bogolin | Assignee: | Faustin Lammler |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | buildbot, monitoring | ||
| Remaining Estimate: | 0d | ||
| Time Spent: | 3h | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Description |
|
When the libvirt master starts, it creates a ssh connection to the worker machine for each defined worker. If for any reason that ssh connection drops, then there will be build failures. The master doesn't handle at all this and a master restart is needed. Ways to reproduce:
2. Libvirt restart on the worker machine (hz-bbw5) Ideas for monitoring: |
| Comments |
| Comment by Faustin Lammler [ 2022-05-19 ] |
|
Based on:
I have implemented a watchdog service, see:
It will restart the libvirt master automatically if the number of ssh connection is different from what is defined in the master.cfg file. I have tested by restarting the libvirtd daemon on hz-bbw5 and it seems to be working perfectly fine. Warning, since the watchdog service compares the number of occurrence of "qemu+ssh" in the configuration and the actual ssh open connection, we can not have commented "qemu+ssh" lines in the master.cfg file. I could filter comments in the grep command but I prefer to use grep -c than grep | wc -l. |
| Comment by Faustin Lammler [ 2022-05-19 ] |
|
TODO:
|
| Comment by Faustin Lammler [ 2022-06-16 ] |
|
This https://github.com/MariaDB/buildbot/pull/2 will fix it. TODO becomes:
|
| Comment by Faustin Lammler [ 2022-09-19 ] |
|
The watchdog systemd unit script will stop the buildbot master libvirt process if one of the builders is not accessible after 3 restart attempt. We receive an alert if the master libvirt process is down. This can probably be improved in the future by having an automatic way of restarting the buildbot libvirt master if this occurs (or use libvirt worker more reliable). |