Details
-
Task
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
N/A
Description
When the libvirt master starts, it creates a ssh connection to the worker machine for each defined worker. If for any reason that ssh connection drops, then there will be build failures. The master doesn't handle at all this and a master restart is needed.
Ways to reproduce:
1. Look into the running processes on hz-bbm1 and there should be several entries like
buildma+ 3576703 3576701 0 14:45 ? 00:00:00 ssh -p 65001 -l buildbot -T -e none -- 100.64.100.20 sh -c 'which virt-ssh-helper 1>/dev/null 2>&1; if test $? = 0; then virt-ssh-helper 'qemu:///system'; else if 'nc' -q 2>&1 | grep "requires an argument" >/dev/null 2>&1; then ARG=-q0;else ARG=;fi;'nc' $ARG -U /var/run/libvirt/libvirt-sock; fi'
|
2. Libvirt restart on the worker machine (hz-bbw5)
3. Go back to 1 and there should be no active ssh connection
4. Master restart
Ideas for monitoring:
faust would it be possible to monotor if there is at least one ssh connection similar to the one from above?
Attachments
Issue Links
- is part of
-
MDBF-41 Milestone 5: Desirable fixes
-
- Closed
-
Based on:
I have implemented a watchdog service, see:
It will restart the libvirt master automatically if the number of ssh connection is different from what is defined in the master.cfg file.
I have tested by restarting the libvirtd daemon on hz-bbw5 and it seems to be working perfectly fine.
Warning, since the watchdog service compares the number of occurrence of "qemu+ssh" in the configuration and the actual ssh open connection, we can not have commented "qemu+ssh" lines in the master.cfg file. I could filter comments in the grep command but I prefer to use grep -c than grep | wc -l.