I have multiple DCE VM's running simultaneously on separate clusters. Had them for 5 years or more without much trouble. Recently a cluster has been failing and we have lost a DCE server for a period of time (obviously all the monitoring is lost or not collected for this period too). The first time it happened I asked the monitoring team to implement a simple ping check to see if the server is still operating. I was happy at this point.
But barely had I washed the sand out of my toes from holiday and I come back to work learning that the cluster has failed again and the ping test failed because the VM was still responding to a ping even though the service had crashed.
My question then is : How can I get an alert to tell me a DCE VM is no longer online or working? (There is something ironic about my monitoring service not being monitored itself).
Here is the guff from the virtualisation team:
Many operating systems will still respond to a ping, even if the service has crashed. This is a linux feature where is makes the file system read-only, but keep the network stack up to allow you to connect to it.
This is where a service check is required. Something that test for a running service, or a port that’s only open when the service is running.
A port scan of the o/s might be able to pick up what is open, and that could be requested as a check.
The other alternative is to ask the vendor if there is something that can alert from within the o/s to advise there has been an issue.
Thanks in advance
Hi @NF-London ,
You could try polling the server using the web API, information on how that works is here:
This is merely another way to get data from DCE. Assuming this answers a polling request I still can not say 100% however that this will tell you that all functions of the system are operational. There are multiple processes that must be running for the entire system to be working. Still, it's one of the few ways we can verify data is still being received and available
We do have an enhancement request already in the system for what you're asking. Basically a health check of the system. I am adding your post to that request as the more requests that we get on any issue, the more likely it is that engineering and product management will look into adding or fixing any features such as this.
Discuss challenges in energy and automation with 30,000+ experts and peers.
Find answers in 10,000+ support articles to help solve your product and business challenges.
Find peer based solutions to your questions. Provide answers for fellow community members!