Wednesday, 4 June 2014

vSphere Client Storage Views tab not showing any infomation

The storage views tab in the vSphere client disappeared, and vCenter System Services displayed some of the following errors:
unable to retrieve health data from https://localhost:443/vsm/health.xmlunable to retrieve health data from https://localhost:443/eam/eamService-web/health.xmlunable to retrieve health data from https://localhost:443/SMS/health.xml
VMware KB article 2016177 (vCenter Server Health status reports the error: Error retrieving health from url (2016177)) had the fix. This issue & kb is only for vCenter 5.0.x.

As soon as OfficeScan had finished uninstalling, the assortment of errors in System Services went away and the Storage Views tab worked. (No need to restart services/reboot)

This was a test to see if the kb article would fix it, so I wasn't about to leave vCenter with no AV. I re-installed OfficeScan, and the errors didn't re-appear.

So the fix was to uninstall/re-install OfficeScan.

TrendMicro support suggested to disable the OfficeScan client services one at time to identify which one is causing the conflict.

TrendMicro support:
Kindly configure the Privileges and Other Settings for the VCenter server in the OfficeScan web console page under Networked Computer>Client Management. Then under Other tab, please uncheck Protect client services.
Then stop the services for the OfficeScan client in the Vcenter server. Then start one service at a time to isolate this.
• OfficeScan NT Listener (TmListen.exe)
• OfficeScan NT RealTime Scan (NTRtScan.exe) • OfficeScan NT Proxy Service (TmProxy.exe) • OfficeScan NT Firewall (TmPfw.exe); if the firewall was enabled during installation • Trend Micro Unauthorized Change Prevention Service (TMBMSRV.exe)

Although they got back to me pretty quickly, I'd already uninstalled/re-installed OfficeScan, and now i'm unable to replicate the problem, but something to try if it happens again.

Sunday, 1 June 2014

Is Host Isolation Response set right for you?

Coming in to manage a virtual environment that's already up and running, you guess it's set up correctly for the most part. As time goes on, you may pick up a few things here and there to improve it. But what got me recently was the business's interpretation of VMware's HA.

Know your environment, understand the options

During a switch failure causing network isolation of a host, the business wanted to know why their VM's weren't restarted on the remaining hosts. Although it was working as expected according to VMware's Host Isolation Response setting, but the business didn't see it the same way.

The build of the cluster was outsourced several years ago, and the host isolation response setting was documented as being set to "Leave powered on", but there was no further explanation for the business as to what that meant.

Read the documentation

Explained fully in the VMware vSphere documentation and also VMware KB 1030320 says:
Leave powered on – When a network isolation occurs on the host, the state of the virtual machines remain unchanged and the virtual machines on the isolated host continue to run even if the host can no longer communicate with other hosts in the cluster. This setting also reduces the chances of a false positive. A false positive in this case is an isolated heartbeat network, but a non-isolated virtual machine network and a non-isolated iSCSI/NFS network. Should the host become unresponsive or fail and can no longer access/run the virtual machines, the virtual machines will be registered and powered on by another running host in the cluster. By default, the isolated host leaves its virtual machines powered on. 
Power off – When a network isolation occurs, all virtual machines are powered off. It is a hard stop. A power off response is initiated on the fourteenth second and a restart is initiated on the fifteenth second.
Shut down – When a network isolation occurs, all virtual machines running on that host are shut down via VMware Tools. If this is not successful within 5 minutes, a power off response type is executed.

Duncan Epping has an easy to read matrix on the VMware vSphere Blog.

In this environment of FC storage and a single top of rack switch for management and VM traffic, it would probably have been better to choose "Shut down". A week later the CR was approved and 10 seconds later, it's set to "Shut down".

The network team now get found out when trying to sneak in a 5-10 second network outage for maintenance.