With the rise of Azure Stack HCI and now Azure Local, we see that yet again a Hypervisor battle is going on. Many are considering moving to Azure Local because of all the licensing changes and features that you now get on top of Azure Local (some for free) but ill get back to that afterwards. Some questions I get is how does Azure Local actually handle storage on their underlaying storage fabric?
Without going into the details of interleaves and columns which are done also on the Storage Spaces direct level. If we have a virtual machine with its own virtual hard drives (which is part of a VHDX file) This VHDX file is stored on a Cluster Shared ReFS volume accessible via the SMB protocol.
Then depending on what kind of redundancy level you have on the cluster which can be using parity, mirroring one or two day. All write operations will be spread across the nodes in the cluster. So with 3 way mirror all data is replicated three times in the cluster. To ensure that you are not “slowed” down by the network to handle write operations you should have a RDMA based network. Azure Local / Stack supports RDMA with either the Internet Wide Area RDMA Protocol (iWARP) or RDMA over Converged Ethernet (RoCE) protocol implementations.
The design of these nodes can be either using ToR leaf-spine topology or you can use switchless fully-meshed deployments which is also supported for smaller deployments.
This here also shows the difference in terms of performance between regular TCP and RDMA
RDMA allows more direct communication bypassing the TCP/IP stack and offloading more from the CPU. Hence ensuring lower latency and higher throughput. Which is required for how Azure Local handles storage writes. Another example as listed below.
While this means that data needs to be written across other nodes, there is also different ways that depending on what kind disks that data will be cached on the drives. In case of Azure Stack HCI / Azure Local, Cache is configured automatically depending on the disk layout. We also have cache features outside of SPD which are features like CSV Cache which is using memory of the physical host to cache read operations. This is by default the CSV cache 1 GB and works together with SPD Caching mechanisms. It is important to note that the CSV Cache is only for read operations and is the first caching layer on the hosts.
By default, the cache behavior is optimized based on the drive configuration:
- All-flash systems: Only writes are cached
- Hybrid systems: Both reads and writes are cached
For a read operations, chuck of data can either be fetched from the CSV cache on the host where the virtual machine resides or from the Storage Spaces Direct Cache. For a write operation, a data chuck must be written to other nodes in the cluster to honor the redundancy.