After working with Microsoft Azure for over 10 years now, think this must be the one question that I’ve received the most “why is my VM running so slow in Azure?”. Companies that have been using Azure as a pure IaaS based platform and moved or built new virtual machines on the cloud platform would certainly notice the difference compared to traditional virtual infrastructure.
The big picture of things that can get in the way
However, I wanted to use this blog post to elaborate on why certain things might be a bit slower and other factors that can impact performance secondly features that you can use to enhance performance, and lastly how to troubleshoot if something is not working as intended.
NOTE: Much of the content in this blog post is also applicable for the other cloud providers, just replace Azure names with other features or services from the other vendors.
1: Moving from on-premises to a cloud-based platform
The jump from having virtual machines running on a private cloud or existing data center with some form of virtualization (or even modern HCI) where you have high-end storage capacity to a cloud-based platform is going to be a painful jump. With most modern HCI vendors a VM runs with much of the required data running either cached on the local physical host it is running on or stored nearby on an SSD/NVMe drive. If we go even further back with traditional SAN’s there were also high-end connections using FC/iSCSI which connected to the virtualization layer to a backend SAN with either some caching mechanisms or SSD capabilities. In most environments, you do not have any QoS on storage traffic, so it means that when a VM requires higher bandwidth and or I/O it gets access to it when needed. This might affect other VM’s, but we don’t really care since it might not be noticeable. Now try and convert both the architecture and access to I/O in a cloud-based platform, well it won’t work. First of when Microsoft (and others started building cloud-based platforms) they needed to create a storage layer where it can scale both in performance/capacity regardless of the compute and not dependant on SAN limitations. Secondly, if you want to provide a cloud-based model with standardized services, and lastly to ensure the same experience/performance for all customers you need to place QoS in place to ensure consistent performance. That way they calculate how much storage traffic they need to handle at any given time. So, when you provision a VM with storage in Azure you get a predefined limit on I/O and bandwidth, unlike what you usually get on-premises (Yes you have the option to define Storage QoS on-premises)
So much of the definition is providing services in a standardized manner that can scale across multiple data centers and secondly provide high levels of redundancy. Just as a comparison I can set up an Azure VM today with a standard SSD drive that can only provide me with 500 IOPS, while my SSD drive on my laptop can go anywhere from 1500 – 5000 IOPS. Important to understand that while some drives only provide 500 IOPS there are a lot of different SKUs that will provide higher levels of performance.
2: Understanding the building blocks
Many tend to compare Azure or other cloud platforms like any other virtualization platform, but there is a lot of differences and especially when it comes to automation. Therefore, I wanted to showcase some of the building blocks in Azure.
When you use any CLI/SDK/API/Web Portal to interact with Azure you are interacting with a global front-end (used to be called Red Dog Global Front end) but now also called Azure Resource Manager. If I go into the Azure Portal and provision a virtual machine there are a lot of steps that will happen in the background.
The front-end will send API calls to the main fabric controller which will do a couple of things.
1: Allocate virtual machine capacity resources based upon region/availability zone and VM SKU that is needed on a specific Fabric Host. If it is part of any availability zone or proximity group, it will also honor those settings. This task is delegated via Microsoft.Compute resource provider.
2: Delegate a task to Microsoft.Storage resource provider to provision a managed disk according to size and performance layer needed. This also means provisioning replicas to provide redundancy defined.
3: Delegate a task to Microsoft.Network to provision a new virtual network (a VNI) for this subscription, then also provision a virtual NIC and add that to the VNI) then attach the Storage disk to the VM using the Fabric Agent.
4: The Fabric Controller requires feedback from each resource provider as successful before the deployment is complete. Then it can provide feedback to the front-end complete!
Also, as part of this process, the different providers will also enforce QoS for the different building blocks to ensure consistency for others that also will deploy resources.
3: Provisioning resources
Unlike provisioning a VM in a virtualized environment like VMware or even Hyper-V, there are a lot of steps that are involved in provisioning a VM in Microsoft Azure. Also using like vCenter is a more simplified two-way step between vCenter and ESXi. Within Azure there are multiple components to ensure high availability and scalability. Now you might argue, now why is Google Cloud a lot faster compared to Azure when provisioning resources? That might be since Google Cloud internally is using gRPC while Azure is using REST.
The result however is that a VM is going to be placed on a hypervisor and the VM will only be shown as available and ready once the Azure VM agent is responding back to the fabric controller, but again the storage performance will also impact how quickly the VM will boot.
4: Network always matters!
I’ve been talking a lot about the provisioning of resources and other components, but once the VM is in place on the hypervisor, why is it still slow? there are multiple factors that can impact network performance in Azure. (I also suggest that you read this blog post as well) –> Troubleshoot Networking in Microsoft Azure | Marius Sandbu (msandbu.org)
In most cases, a VM’s network performance in Azure behaves much like any other VM elsewhere. There are a couple of distinct differences
1: It runs within an encapsulated network (GENEVE) protocol which adds some latency when decapsulating traffic egress
2: All traffic within Azure runs with 1,400 MTU adds some overhead to the egress traffic to handle reordering of packets
3: By default, SR-IOV (Accelerated Networking) is not enabled for virtual machines, which means that traffic needs to be processed by the vSwitch on the host instead of working as a network function on the host.
4: Distance! everything is impacted by the speed of light. Many do not take into consideration moving virtual infrastructure to a new platform which might be many miles away which will affect the performance of the VM regarding responding to a request from end-users. In addition, depending on which direction the traffic goes there might be obstacles along the way which might start with your local ISP. This is also highly dependent on what kind of protocol is used for communication to the VM as well.
5: If you have a lot of services in Azure but only one way out (one public IP) you can also suffer from port exhaustion if you do not plan properly.
6: The thing I said about QoS for Storage also applies to networking as well. Depending on what kind of VM size you are using this will also impact two things. 1: How many networks flows a VM can have 2: The amount of bandwidth the VM can have per NIC.
For instance, bandwidth for a VM the sum of incoming and outgoing bandwidth, the same applies for bandwidth.
If traffic from your internal services is using Azure Firewall for outbound connectivity as well, there might also be that Azure Firewall is scaling which will impact traffic for 10 – 15 minutes (each Firewall instance can handle between 1,5 – 3 GBps). You also must consider which direction the traffic will flow; do you have force tunneling enabled to an on-premises gateway or do you have direct traffic from the public IP of the VM?
Network latency will also affect internal communication. Azure regions consist of multiple data centers and many again are split into availability zones which are independent datacenters with dedicated cooling/power/network. Now like West Europe Azure region alone is probably the equivalent of 20 soccer fields. If you have services that are dependent on low latency connections between a front-end and backend. Then you can’t have those two services scattered across various parts of those twenty fields. You also have options to be able to place those VM’s close together within the same data center, this feature is called Proximity Groups
7: Since Microsoft is building services on a global scale and handling a lot of traffic daily it also means that they can optimize traffic flow and have multiple Points of Presence around the world. One new capability that more services in Azure are using now (and has been used by Office 365 for some time) is routing preferences. This feature ensures that traffic that is a destination somewhere in Azure will be routed to the closest Azure PoP and then routed using the Azure backbone, instead of having traffic routed via ISP to a bunch of other non-optimized traffic flows.
5: Things that can get in the way of performance
While the network can be a big issue, one of the things that have always troublesome is storage. As mentioned earlier the storage layer in Azure is a separate layer from computing and hence the QoS mechanisms come into play. Secondly, there are other mechanisms that might be it difficult to understand why storage can behave differently sometimes as well. Local disks and temporary disks are always locally on the physical server using SSD capacity (but using replication to mirror storage) but OS and Data disks can also have a caching mechanism that can cache read/write operations. This cache is also wiped each time the VM is rebooted.
In addition, you also have storage tiers now that can provide bursting. With bursting, an instance will startup at a faster rate. For example, the default OS disk for premium-enabled VMs is the P4 disk, which is a provisioned performance of up to 120 IOPS and 25 MB/s. With bursting, the P4 can go up to 3500 IOPS and 170 MB/s allowing for a startup to accelerate by up to 6X. There are two different models here,
- An on-demand bursting model (preview), where the disk bursts whenever its needs exceed its current capacity. This model incurs additional charges anytime the disk bursts. Noncredit bursting is only available on disks greater than 512 GiB in size.
- A credit-based model, where the disk will burst only if it has burst credits accumulated in its credit bucket. This model does not incur additional charges when the disk bursts. Credit-based bursting is only available on disks 512 GiB and smaller.
If you are using shared storage services such as Azure Files it also means that you need to understand the performance is linked to the size of the share.
So, if you have applications or other services that are dependent on a shared folder, and you haven’t properly sized the share you will get crappy performance.
Another issue that can also impact the performance of services in Azure. That is if services are degraded. This can be that Microsoft has rolled out a new configuration change that impacts services. While issues will happen from time to time, the focus on the services in a degraded state is always on availability and can affect performance.
6: Size does matter! (On the VM..)
Now if you have a proper network configuration in place, you have a large enough storage service that should provide enough power. Why are still things going slow? There might still be configuration issues related to the VM that can impact performance.
One is what kind of VM SKU that you use. Within Azure you have different VM SKU’s which use different CPU architectures (and some with GPU) and they also have different horsepower as well.
Here for instance is a difference between different VM SKU’s (all with 16 vCPU)
Full view –> https://i.imgur.com/oFR4L2M.png
Also, there are many components that can also impact the performance of a VM in Azure, considering that there are many agents/services that can impact it on an OS level.
- Backup (VSS snapshot)
- Security Integrations that do real-time processing of changes
- File Tracking
- DNS and lookup time
- Active Directory and Sites defined for the VM (if applicable)
- Also, the type of VM SKU also affects the bandwidth the network gets and the different types of storage disks that are supported.
7: Troubleshoot all the things!
So how do we troubleshoot if something is going sluggish in Azure? First, we can consult the all-knowing flow chart which can be used as an indication depending on if it is VM-VM or VM-Service, or VM-User connection flow.
For instance, if you have a VM in Azure that you want to connect to using an RDP connection and it is running as a single service in a simple network configuration. Chances are that this will work like any other VM just the latency difference in terms of performance.
Now the second part, to be able to monitor what is going on and to be able to detect “slowness” or other issues there are a lot of different logs and metrics that we need to investigate. Logs are also useful to monitor the availability of services.
1: Has anything changed recently? if something suddenly starts to become slow, verify changes in Azure. This can be done using Application Change Analysis
Secondly, check Service health under Azure Monitor.
2: Then we should check the VM. Within Azure, there are different mechanisms that can be used to collect metrics from the host VM layer such as CPU / Disk / Memory / Data Disk metrics. These are useful to get information about platform QoS metrics as well.
As one example you can compare the Target IOPS vs the actual IOPS as part of the Azure Metrics.
Now if you have some service inside a VM that for some reason is hogging resources and even network bandwidth you have no way of seeing this using the VM metrics. Then you would need to use VM Insights (Service Map+++) that will install an agent and collecting metrics from inside the OS, but track traffic flow and service communication.
This also creates a topology map showing services and communication flow.
There is also a connection overview which can show the overall latency, bandwidth usage of each process.
You can also use an extension called Network Watcher that you can monitor using a new service called connection monitor. It requires that the network watcher be installed, and the network watcher extension is installed on the VM.
8: So, what is the difference?
To summarize, not much is that different when running a VM in Azure compared to a local installed VM. Sure, Microsoft needs to make sure that all “residents” are treated equally. That also means that we need to understand the throughput/limitations for our VM, Network, Storage, and other services that we use. There are unfortunately a lot of if/then/what scenarios with such as read/write cache, bursting, and other hidden magic which can also make it difficult to take into consideration when we are troubleshooting. Hopefully, this post gave away some tips that you can use to understand the mechanisms in Azure and what you can do to troubleshooting why things are going too slow in Azure (in most cases they are not)