Dedupe and compression happens during destaging from the caching tier to the capacity tier. You enable “space efficiency” on a cluster level and deduplication happens on a “per disk group” basis. Bigger disk groups will result in a higher deduplication ratio. After the blocks are deduplicated they will be compressed. A significant saving already, combined with deduplication and the results achieved can be up to 5x space reduction, dependent on the workload and type of VMs.
Compression (LZ4) is performed during destaging from the caching tier to the capacity tier. 4 KB is the block size for deduplication. For each unique 4 KB block compression would be performed and if the output block size is less than or equal to 2 KB, a compressed block would be saved in place of the 4 KB block. If the output block size is greater than 2 KB, the block would be written uncompressed and tracked as such. The reason is to avoid block alignment issues, as well as reduce the CPU hit for decompressing the data which is greater than compression for data with low compression ratios. All of this data reduction is after the write acknowledgement.
Deduplication domains are within each disk group. This avoids needing a global lookup table (which carry a significant resource overhead), and allows those resources to be put towards tracking a smaller and more meaningful block size. VMware purposefully avoid dedupe of “write hot data” in the cache, or decompressing uncompressible data which avoids significant CPU/memory resources being wasted.
Sometimes RAID 5 and RAID 6 over the network is also referred as erasure coding. In this case RAID-5 requires 4 hosts at a minimum as it uses a 3+1 logic. With 4 hosts 1 can fail without data loss. This results in a significant reduction of required disk capacity. Normally a 20GB disk would require 40GB of disk capacity to achieve FTT=1, but in the case of RAID-5 over the network the requirement is only ~27GB for the same level of protection.
Erasure codes offer “guaranteed” capacity reduction unlike deduplication and compression. For customers who have no thin provisioning policies, have data that is already compressed and deduplicated or have encrypted data RAID 5 offers “known” capacity gains.
This can be applied on a granular basis (Per VMDK) using the Storage Policy Based Management system.
Note: All Flash VSAN only.
Note: Not supported with stretched clusters
Note: this does not require the cluster size be a multiple of 4, just 4 or more.
With RAID 6 two host failures can be tolerated, similar to FTT=2. In the traditional scenario for a 20GB disk the required disk capacity would be 60GB, but with RAID-6 over the network this is just 30GB. Note that the parity is distributed across all hosts and there is no dedicated parity host or anything like that.
IOP limits can be enabled per VMDK using a SPBM. Service providers can use this to create differentiated service offerings using the same cluster/pool of storage or can also be used to solve a “noisy neighbor” scenario where one VM is hammering the underlying storage.
This is by default normalized down at a 32 KB block size, so a 500 IOP limit will result in only 250 64 KB blocks, while you would expect only 500 IOPS of 8/16/32 KB blocks.
This is a cluster wide setting and is on by default, it can be disabled on a per object basis using storage policies.
Software checksum will enable us to detect the corruptions that could be caused by hardware/software components including memory, drives, etc during the read or write operations. In case of drives, there are two basic kinds of corruption. The first is “latent sector errors”, which are typically the result of a physical disk drive malfunction. The other type is silent corruption, which can happen without warning. Undetected or completely silent errors could lead to lost or inaccurate data and significant downtime. There is no effective means of detection without end-to-end integrity checking.
During the read/write operations VSAN will check for the validity of the data based on checksum. If the data is not valid then it should take the necessary steps to either correct the data or report it to the user to take action. These actions could be:
- Fetch the data from other copy of the data for RAID1, RAID5/6, etc. (This is what we call recoverable data)
- If there is no valid copy of the data the error SHALL be returned (This is what we call Non-recoverable errors)
- In case of errors the issues will be reported in the UI and logs. This will include impacted blocks and their associated VMs.
- We will be able to see the list of the VMs/Blocks that are hit by non-recoverable errors.
- We will be able to see the historical/trending errors on each drive
CRC32 is the algorithm used (CPU offload support reduces overhead)
There will be two level of scrubbing:
- Component level scrubbing: every block of each component is checked. If checksum mismatch, the scrubber tries to repair the block by reading other components.
- Object level scrubbing: for every block of the object, data of each mirror (or the parity blocks in RAID-5/6) is read and checked. For inconsistent data, mark all data in this stripe as bad.
Repair can happen during normal I/O at DOM Owner or by scrubber.
The repair path for mirror and RAID-5/6 are different. When checksum verification fails, the scrubber or DOM Owner will read the other copy of the data (or other data in the same stripe in case of RAID-5/6), rebuild the correct data and write it out to the bad location.
This will replace the 1MB cache lines used for read ahead, with a larger cache (.4% of host memory up to 1GB). Preliminary testing with VDI show some impressive numbers and this will compliment CBRC. Data locality will be used for the memory cache (as we do with CBRC) as this is a read only cache (so no need for network ACK).
Sparse swap will be an advanced host level option (Swap is not managed by SPBM but the kernel). This will enable the reclaiming of space dedicated to memory. On a cluster with 256GB per host, this would yield TB’s of capacity savings at scale. This should benefit linked clone VDI storage utilization.
Performance Monitoring Service allows from vCenter to be able to monitoring existing workloads. Performance monitor includes macro level views (Cluster latency, throughput, IOPS) as well as granular views (per disk, cache hit ratios, per disk group stats) without needing to leave vCenter.
The performance monitor allows aggregation of states across the cluster into a “quick view” to see what load and latency look like as well as share that information externally directly to 3rd party monitoring solutions by API.
The Performance monitoring service runs on a distributed database that is stored on VSAN and NOT vCenter (will use up to ~255GB).
Health and performance monitoring are not plugins, they are native features now. The Key thing to expect is regular iteration of this outside of normal releases.
- Event based alarm triggers, not just periodic checks
- Lots more health checks
- Detailed space reporting
All management capabilities delivered as fully integrated features
The second biggest gap so far has been performance monitoring. We actually have two ways of addressing this. For VROps customers, the upcoming VROps management pack (in Beta right now) will deliver performance monitoring, trending and alerting for VSAN. We believe this will be a great way for customers to operate their VSAN environments. However, we have also heard loud and clear that VROps is not used by all customers, and not used by all admins within the customers. Reasons for this can be plenty, ranging from familiarity with and learning curve of VROps to operational complexities due to deploying and managing a separate product and appliance, to basic licensing concerns.