A four node cluster with a single VM (with a single VMDK less that 255GB), deployed with a Storage Policy of “Failures to Tolerate = 1” and “Disks to Stripe = 1” would look like this:
In this example you can see there are two mirrors copies of the VMDK (these are both identical, both of these serve reads and writes at any given moment) and a witness. The witness will always be placed on a different host from the two RAID 1 components to act as a quorum in the event of a host isolation or split brain.
We can also see that the VM is running on a different ESXi host from its underlying disks.
When a VM writes the IO is mirrored by VSAN and will not be acknowledged back to the VM until all have components have completed. Meaning that in the example above both the acknowledgement from ESXi Host 1 and ESXi Host 2 will need to have been received before the write is acknowledge to the VM.
All writes will go and will be acknowledged by the the SSD (the Write Buffer). VSAN will then de-stage the data to your magnetic disks without the Virtual Machine having any part in the de-stage process. This allows VSAN to combat the “IO-Blender” effect other SAN devices suffer from when host virtual workloads.
VSAN Disk Failure
There can be two types of disk failure (SSD and Magnetic) and both have a slightly different impact on the VSAN cluster. If a SSD fails the disk group it front becomes unusable and all components on all magnetics disks within the disk group are marked as “degraded”
If a magnetic disk fails the disk group will continue to function however all components on the failed magnetic disk are marked as degraded.
When VSAN detects a disk failure it will immediately create a new mirror copy or witness on a different ESXi host or different disk group (subject to there being sufficient resources to store this new copy). If there are insufficient resources to create that mirror copy , VSAN will wait until resources are added. Once you have added a new disk, or even a host, the recovery will begin. VSAN will not alert you to this issue and the VMs will be at risk (Failures to Tolerate will be 0 until the components are rebuilt)
It should be noted that if the component cannot be rebuilt the VM would be impacted from a performance point of view as it will have less components to read from i.e. only 1 disk component instead of 2 before the failure.
If a disk is removed (i.e. not a SMART disk failure) VSAN will wait for 60 minutes before rebuilding the component on another node. The 60 minutes wait is a configured on each ESXi host (advanced settings > VSAN.ClomRepairDelay) and can be changed if required. The setting must be the same on each ESXi host in the VSAN cluster.
During the 60 minute delay the VM (if using a FTT=1 policy) would be at risk from a second host failure. The 60 minute delay is in place to allow for host patching or human error (pulling the wrong disk in a maintenance operation.
When a disk that is removed (but did not report a failure) in a VSAN cluster, all components on that disk or disk group will show as “absent”.
VSAN Host Failure
This scenario is slightly different from a disk failure. In the case of the disk failure VSAN would have been aware of the disk failure and would have known the disk was not coming back (or would have waited 60 minutes to allow for the rebuild delay)
As soon as VSAN realises the ESXi host is absent the VSAN.ClomRepairDelay timer will start. If the ESXi host comes back within those 60 minutes VSAN will synchronize the mirror copies and normal operations will resume.
If the component doesn’t come back then VSAN will create a new mirror copy (component).
Note if for whatever reason the component returns after VSAN has started resyncing then VSAN will try to assess if it makes more sense to bring the existing but outdated component in sync or continue the creation of the new component. On top of that, VSAN also has a “rebuild throttling / QoS” mechanism, which will throttle back replication traffic during rebuild when this can impact virtual machine performance.