With the uprise in Hyperconverged technology it is interesting to see how the different vendors in the market has embraced different architectures in their implementation . I’ve previously blogged about how VMware, Microsoft and Nutanix has done this –> http://msandbu.org/storage-warshci-edition/
Now therefore I decided to dig a bit deeper into the way Microsoft has implemented their hyperconverged infrastructure solution called Storage Spaces Direct.
Now for those that are not aware of that new feature, it is a further improvement of Storage Spaces which came in 2012, which allowed us to create a virtual disk based upon one of more physical disks using different “RAID”like features like mirroring, parity or striping. In 2012 R2 they came with improved performance and new features like tiering for instance.
Fast forward to 2016, and we have seen that the market has changed with VMware, Nutanix been in the hyperconverged space for a while and Microsoft soon to be taking the step into this market. Now unlike VMware and Nutanix, Microsoft recommends that if you use their hyperconverged solution that you have a RDMA based backend network.
Why? The issue is the way that Microsoft stores data on the Storage Pools, and that is that storage on a virtual machine can be placed anywhere on the cluster. So what’s the issue with that?
Imagine a virtual machine running on top of a SPD (Storage Spaces Direct) cluster, for a single virtual machine all the blocks in writes down to the CSV storage can be placed on any node within the cluster (Now of course there are some rules where data will be placed depending on fault tolerance rules. But if we look at the traffic generated here, all writes much be written twice (depending on the resillency defined) which means one or two remote hosts to where the virtual machine resides before the VM is allowed to continue operating. Now a virtual disk consists of multiple extents which the default value is 1GB per extent which will then be placed upon the different hosts in the cluster.
Now the issue here is latency placed in the network layer. So let’s think of a tradisional ethernet TCP/IP network.
Quote from the VMware VSAN network design
”The majority of customers with production Virtual SAN deployments (and for that matter any hyper-converged storage product) are using 10Gigabit Ethernet (10GbE). 10GbE networks have observed latencies in the range of 5 – 50 microseconds. (Ref: Qlogic’s Introduction to Ethernet Latency – http://www.qlogic.com/Resources/Documents/TechnologyBriefs/Adapters/Tech_Bri ef_Introduction_to_Ethernet_Latency.pdf) “
So data would need to travel from the virtual machine, to the VMswitch, to the filesystem, to the virtual disk, ClusterPort, Block over SMB, Processed by TCP/IP, Storage Controller, then the disks eventually, and this has to happen twice before the VM can continue its operations (This is to ensure availability of data)
Now with storage devices becoming faster and faster you might argue that the backend network will become the bottleneck, since it can operate between 5 – 50 MS latency. while all NVMe or SSD flash devices can operate within the sub 10 microseconds range, you would not be able to leverage the speed properly because of the higher latency on the network.
Now Microsoft has some READ based host cache implemented using CSV cache (Which will offer some form of data locality for virtual machines, since some reads will be served from memory from the hosts that they reside on, and memory delivers very low latency, high-troughput, but this will not help write operations only on reads.
This is where RDMA comes in!
For those that don’t know what RMDA is it technology that allows direct memory access from one computer to another, bypassing TCP layer, CPU , OS layer and driver layer. Allowing for low latency and high-troughput connections. This is done with hardware transport offloads on network adapters that support RDMA.
Now Microsoft has been working with RDMA since server 2003, and with 2016 there are multiple improvements such as SET (Switch Embedded Teaming) where NIC teaming and the Hyper-V switch is a single entity and can now be used in conjunction with RDMA NICs, where in 2012 you needed to have seperate NIC teams for RDMA and Hyper-V Switch.
Configuring SET with RDMA: https://technet.microsoft.com/en-us/library/mt403349.aspx
Now the interesting thing about this technology is that it makes remote NVMe, SSD devices behave like local devices to the physical host in terms of latencty as an example Mellanox tested Storage Spaces Direct with and without RDMA to display the latency and troughput differences
Another test that Mellanox did was to test RDMA over RoCE (RDMA over Converged Ethernet) which was using the NVMf (pre-standard)
NOTE: 1000 us = 1 ms
Which shows you that is has an tremendous improvement in troughput and reducing CPU overhead, which is crucial in a hyperconverged setup where you have storage and compute merged together.
So to summerize:
Is data locality important? Yes it still it, but to a certain degree, it is usefull with READS to have the data stored as close to the machine as possible, and making sure that you ahve tiering features to make sure that the hot data is stored on the fastest tier close to the virtual machine. For WRITES you cannot escape the fact that you need to have write the data twice for resilliency, and that has to be stored on two different hosts and for these WRITES having a backend networking solution like ROCE / iWARP will drastically improve the performance of Storage Spaces Direct because of the architecture, but it should be noted that it comes with a cost and for many that might mean they would need to reinvest into new network equipment: And with that I present the latest storage spaces direct benchmark from the Storage Direct PM: https://blogs.technet.microsoft.com/filecab/2016/07/26/storage-iops-update-with-storage-spaces-direct/
Now there are some requirements now if you want to implement RMDA on Windows Server 2016
For iWARP capable NICs, same as non-RDMA capable NICs .
For RoCE capable NICs, the network switches must provide Enhanced Traffic Selection (802.1Qaz), Priority Based Flow Control (802.1p/Q and 802.1Qbb)
If RDMA capable NICs are used, the physical switch must meet the associated RDMA requirements
Mappings of TC class markings between L2 domains must be configured between switches that carry RDMA traffic.
You need to have DCB installed
RDMA/RoCE Considerations for Windows 2016 on ConnectX-3 Adapters: