Coming in to manage a virtual environment that’s already up and running, you guess it’s set up correctly for the most part. As time goes on, you may pick up a few things here and there to improve it. But what got me recently was the business’s interpretation of VMware’s HA.
Know your environment, understand the options
During a switch failure causing network isolation of a host, the business wanted to know why their VM’s weren’t restarted on the remaining hosts. Although it was working as expected according to VMware’s Host Isolation Response setting, but the business didn’t see it the same way.
The build of the cluster was outsourced several years ago, and the host isolation response setting was documented as being set to “Leave powered on”, but there was no further explanation for the business as to what that meant.
Read the documentation
Explained fully in the VMware vSphere documentation and also VMware KB 1030320 says:
Leave powered on – When a network isolation occurs on the host, the state of the virtual machines remain unchanged and the virtual machines on the isolated host continue to run even if the host can no longer communicate with other hosts in the cluster. This setting also reduces the chances of a false positive. A false positive in this case is an isolated heartbeat network, but a non-isolated virtual machine network and a non-isolated iSCSI/NFS network. Should the host become unresponsive or fail and can no longer access/run the virtual machines, the virtual machines will be registered and powered on by another running host in the cluster. By default, the isolated host leaves its virtual machines powered on.
Power off – When a network isolation occurs, all virtual machines are powered off. It is a hard stop. A power off response is initiated on the fourteenth second and a restart is initiated on the fifteenth second.
Shut down – When a network isolation occurs, all virtual machines running on that host are shut down via VMware Tools. If this is not successful within 5 minutes, a power off response type is executed.
Duncan Epping has an easy to read matrix on the VMware vSphere Blog.
In this environment of FC storage and a single top of rack switch for management and VM traffic, it would probably have been better to choose “Shut down”. A week later the CR was approved and 10 seconds later, it’s set to “Shut down”.
The network team now get found out when trying to sneak in a 5-10 second network outage for maintenance.