Calculate host failure requirements
When you are working out host failure requirements for a ESXi cluster, it is important to account for the combined resource utilisation of the virtual machines that are to run in the cluster, in order to leave enough unused resource to support the virtual machines in the event of a host failure. For example, if you have a cluster with 5 ESXi hosts (with identical hardware specifications), the resource utilization for each (if the load was spread evenly) should not exceed 75 – 80%, in order to account for a single host failure. Admission Control is the cluster feature that helps you ensure there is enough spare resource in the cluster, should a host (or multiple hosts) fail.
Configure customized isolation response settings
It’s possible to override the cluster’s isolation response setting on a per virtual machine basis:
In the example above, the cluster setting is set to ‘Leave powered on’, however I have set a number of virtual machine’s to power off.
Configure HA redundancy
Multiple physical NICs (preferably attached to different physical hardware).
Datastore heartbeating is used when the cluster master is no longer exchanging network heartbeats with a slave host. If the slave has also stopped sending datastore heartbeats it is deemed to have suffered a failure, and the virtual machines will be restarted on other hosts in the cluster.
Datastore heartbeating is configured in the HA cluster settings:
vCenter will select the datastores to be used for datastore heartbeating. As shown above, you can manually override this and select preferred datastores. In addition to the settings here, you can use the ‘das.heartbeatdsperhost‘. This lets you configure the number of datastores each host will use for heartbeating. The default is 2 datastores, though it can be increased to 5.
Datastore heartbeating uses the .vSphere-HA directory, located in the root of each datastore.
A network partition describes the situation when a subset of a HA cluster’s hosts cannot communicate with other hosts in the cluster, over the management network. A partitioned cluster will mean that the cluster will be unable to protect virtual machines effectively. The cluster can only protect virtual machines that are running on hosts in the same network partition as the cluster master. The master host must also be able to communicate with vCenter. Network partitions should be corrected as soon as possible.
Configure HA related alarms and monitor an HA cluster
You can monitor your cluster for HA related issues by setting up alarms. The are a number of default alarms, including these cluster alarms
- Cannot find master – Default alarm to alert when vCenter Server has been unable to connect to a vSphere HA master agent for a prolonged period
- Insufficient failover resources – Default alarm to alert when there are insufficient cluster resources for vSphere HA to guarantee failover
- Failover in progress – Default alarm to alert when vSphere HA is in the process of failing over virtual machines
There are also these virtual machine alarms:
- VM monitoring error – Default alarm to alert when vSphere HA failed to reset a virtual machine
- VM monitoring action – Default alarm to alert when vSphere HA reset a virtual machine
- Failover failed – Default alarm to alert when vSphere HA failed to failover a virtual machine
And there is the following host related alarm:
- HA status – Default alarm to monitor health of a host as reported by vSphere HA
To configure these default alarms, using the vSphere client, highlight the vCenter server object, then click onto the Alarms tab:
Create a custom slot size configuration
Configuring das.slotCpuInMHz and das.slotMemInMB in the HA advanced settings allow you to set a custom upper limite value for the CPU and memory slot sizes, das.vmMemoryMinMB and das.vmCpuMinMHz allow you to set the minimum slot size:
When the advanced runtime info for the cluster are viewed these new configs can be seen (after reconfiguring HA):
Understand interactions between DRS and HA
When both HA and DRS is enabled, the result will be a more load balanced cluster in the event of a failover, than there would be from HA alone. vSphere HA’s job is to get virtual machines back up and running as soon as possible following a host failure. The result of this is that certain hosts may become more heavily loaded than others. HA will use the virtual machines CPU and memory reservations to determine whether a particular host will be able to power on the virtual machine, but beyond that will not be interested in attempting to evenly balance the load across the cluster.
Analyze vSphere environment to determine appropriate HA admission control policy
One of the issues to consider is resource fragmentation. A cluster is considered to be suffering from resource fragmentation when it has enough total/aggregate resources to provide for the cluster’s failover requirements, but those resources are spread across the hosts in the cluster. This could result in the inability to power on certain VMs, with larger resource requirements, as there may not be enough available resource on a single host to satisfy those requirements. This can be a problem when ‘Percentage of Cluster Resources’ is used as the admission control policy, though DRS will help by moving virtual machines around to free up resource. Resource fragmentation isn’t an issue if you are using the ‘Host failures cluster tolerates’ admission control policy, however the downside of that is that you may end up reserving too much resource on your hosts, resulting in poor consolidation ratios. The ‘percentage’ policy is more flexible and will result better consolidation ratios when a range of virtual machine reservations are in use.
Another thing to be aware of when deciding upon which admission control policy to use is the hardware specifications of the hosts in the cluster. If the hosts have different amounts of CPU or physical memory resource then the ‘percentage’ policy will be the best choice.
Analyze performance metrics to calculate host failure requirements
It’s important that the virtual machines in the cluster are sized correctly. Virtual machine reservations are an important concept to be aware of here. If reservations are used, it’s important that the are set to appropriate values. For example, if a virtual machine is over allocated in terms of memory, and a reservation is in place, then resource will be wasted as, firstly, that memory will be used on the host where the virtual machine is running, and will be reserved in the cluster in case of failover Be aware that if you are using slots (number of failures cluster can tolerate) admission control policy, then large reservations will result in much lower virtual machine consolidation ratios, unless a custom slot size is applied.
With a good understanding of how much resource your virtual machines need, and how much they will consume, you can then determine the optimum cluster size to accommodate those workloads, whilst providing enough spare resource to handle any host(s) failure.
Performance data should take into account any busy periods (like month end) or backup windows etc.
Analyze Virtual Machine workload to determine optimum slot size
The same tools as above can be used to determine an optimum slot size aloow for a custom slot size to be used, however I don’t think this is best practise. HA should be allowed to dynamically set HA slot sizes.
Analyze HA cluster capacity to determine optimum cluster size
Trying to right-size a HA cluster can be challenging, especially in a fluid environment. Above all it will come down to availability requirements.
- What VMs do you need available even when a failover occurs?
- What is their resource utilization?
- How many hosts are currently in your cluster?
- Does this meet your availability requirements?
- How does your availability requirements match-up in terms of scaling up within the cluster based on the number of hosts in the cluster. A better way of asking the question; how many more VMs can with my current cluster resources while still maintaining required resource availability?
- What is your current cluster utilization and availability and how does that matchup against availability requirements?
- What admission control policy are you using?