This blog post is essentially a long recap from my session at the Nordic Infrastructure Conference in November (which you can find the presentation here https://github.com/msandbu/nic2024)
In this session I talked about
* The core networking stack and some numbers in Azure
* Features and the software-defined networking stack and how to troubleshoot
* How many services in Azure are built
* Tools and logs to monitor network traffic
* Then go into different use-cases on how to troubleshoot different scenarios.
It is important to note that when stuff stops working on the network stack and after working many years with Azure it usually boils down to these usual suspects
- Firewall
- DNS
- Routing (often misconfiguration)
- Non-supported feature often not documented.
- Some one that screwed up.
- Some one that screwed up.
Which I now written down in this blog post. So, if you want to learn more about how Azure Networking is, then I hope I’ve got it covered here.
Some core facts about the Azure Network (NOTE: Might already be outdated)
- Consists of 165,000 miles (2/3 of the distance to the moon) of private fiber spanning more than 60 regions and more than 170 network points of presence (POPs)
- <2 MS latency Inter-AZ performance <95μs VM-VM TCP latency with Accelerated Networking. With Accelerated Networking (80% reduction latency, 40% performance increase)
- Microsoft uses Custom Azure SmartNICs with FPGA allowing for more programable network cards.
- Offloading to the NIC in terms of routing, firewall, switching & encryption, essentially virtual network functions
- Uses a software defined network = Bypass 4096 VLANs limit in a traditional network and can in theory have 16 million virtual networks, 16 million virtual network limit is per region (using VXLAN)
- Uses SoNIC which is a custom Debian OS made for their network appliances, which supports many different hardware vendors. Makes it easy for them to use many different types of hardware vendors. Also uses Dash which provides a consistent set of APIs on top of Sonic Hosts (https://github.com/sonic-net/DASH)
- Within an Azure Virtual Network, there are limitations to what kind of protocols that you can use. 1: Layer 2 features like GARP/RARP is not supported, not supported with multicast and Packets OOO are dropped by the VNET.
- Azure is using a specific OS version of Windows Server called OneCore, which is a very minimalistic version of Windows Server.
- All resource creation/updates/delete is done by the resource provider Microsoft.Network which is responsible for all network related changes within a region (most likely running on a service fabric cluster)
When you create a virtual network in Azure, it is creating a VNI (Virtual Network Indicator) and if we for some reason want to create a VNET Peering with another network, those networks would exchange routes to allow them to automatically be able to communicate with each other. While this seems simple to do in the portal, there is a lot of things happening underneath the hood that we don’t see. Also its important to understand the traffic flow, since for instance outbound NSG rules are processed before routing rules.
So when it comes to actually monitoring networking in Azure what kind of tools / services do we have?
- For Virtual Machines
- Virtual Machine Insight (Agent collected data about processes inbound/outbound connectivity – Data stored in Log Analytics
- Defender for Endpoint / Servers (EDR Agent collects data about processes / URL inbound/outbount – Data stored in Defender
- Connection Monitor (Agent that can be used to monitor TCP based endpoints or public URLs) – Data stored in Log Analytics and Network Watcher.
- Effective Routes on a NIC are useful to see which routes have been advertised to a virtual machine. This will display Private Endpoints, Peered VNETS, Mesh networks and service endpoints.
- PaaS services
- Most services store data in Diagnostic Logs, such as the Azure Firewall but not all services have good network logs. Also, if. You have Private Endpoints, they are essentially black holes so you have very limited insight into what kind of traffic is going on there, expect if they have some traffic logs. The interesting part is that after Microsoft Ignite, where Microsoft announced a new service called Network Security Perimeter which adds an additional firewall layer on top of PaaS services. While the service is still in preview and few PaaS services are supported, it has also additional logs for private endpoints being accessed via the perimeter
We also have VNET Flow logs, which is a new version of the previous NSG Flow logs. Which combined with Traffic Analytics can provide some great insight into the traffic traversing in your virtual networks. VNET Flow logs do not collect all network data as you would with RSPAN, but it is just the metadata of the traffic.
NOTE: That this either does not support private endpoints and or VWAN deployments…
However, with VNET flow logs compared to NSG flow logs which will be deprecated you can also use this against API management and VPN Gateway as well. But you can use Kusto queries see how much traffic is going back and forth between your VNETs and Subnets which can then also include protocol and ports.
It should be noted that this feature should be used to
1: Troubleshooting purposes or
2: Compliance reasons
Then we have this last fellow, which is the Azure Firewall that is sometimes misunderstood, if you don’t know how it works. When you provision an Azure Firewall it uses a combination of different other features which it wraps into a nice view.
When you open the Firewall resource in Azure you get these different rules and also a private IP. Underneath when you provision an Azure firewall it will create a couple of things, 2x load balancers (one for east-west traffic and one for north-south traffic) which is then connected to a virtual machine scale set that handles the actual rules.
It is also important to note that when you define rules in Azure Firewall which are NOT FQDN based either using network rules (with FQDN) or application rules the Firewall does not do SNAT, which has some implications a bit later.
Also, when you try and “update” rules in Azure Firewall you can get frustrated from time to time why it takes so long… This is because it is updating the VM scale set to use the new rules. Since the Azure firewall consists of two instances, it needs to create a new instance, then remove the old one and continue on.
You should also note that Azure firewall has a weird rule processing engine, since when you use Application rules it uses DNAT by default, but not when you use regular IP based network rules. DNAT means that the Firewall is setting up a new TCP session where the source is the firewall, which will make sense a bit later.
Azure Firewall also has support not for Flow Trace logs, which allows us to see the TCP three-way handshake, since the firewall logs only give us “Allow/Deny” rules. These logs require that we first enable the feature using the Network resource provider using Azure CLI.
Register-AzProviderFeature -FeatureName AFWEnableTcpConnectionLogging -ProviderNamespace Microsoft.Network
Then we need to enable the log in the diagnostics settings on the firewall, and then we can see logs like this
Another important feature to understand is Private Endpoints. Private Endpoints are read-only NICs which are deployed inside a subnet. By default they inject themselves with a /32 route which you can override using Network Security Policies (https://msandbu.org/network-policies-for-private-endpoints-with-udr-and-nsg/)
They cannot be “monitored” directly using Azure Monitor, Flow logs or even diagnostics logs. In some cases, you have logs for the service it is connected to, or you can use the new capabilities from Network Security Perimeter if the PaaS service is supported there.
So now we have covered some of the basics in terms of the networking part, let us go into some scenarios where stuff broke, what the issues were and how we solved them.
Case 1:
This first, was a typical hub and spoke deployment where DNS lookup from a Kubernetes cluster stopped working to external APIs. When looking at the Azure Firewall DNS table I noticed that the local DNS server on the Azure Firewall was generating this error message.
While by default whenever the client uses port UDP 65330 for DNS lookup it will always fail by default since this is a reserved port on the Azure virtual network. However this error was becoming more and more frequent.
Eventually we saw that the DNS timeouts and DNS lookups were happening since the Azure Firewall was configured as a DNS proxy and should be able to communicate with the backend domain controllers for DNS. However the NSG rules configured on the domain controllers was a bit to restrictive and did not allow traffic from the Azure Firewall instances (VMSS IP) and therefore was not allowed to forward DNS requests which also killed Applications Rules on the Azure Firewall.
Case 2:
The issue here was that web applications deployed with Private Endpoints were not working, while Storage Accounts were working with the same configuration. Using a traditional hub and spoke topology. However when a virtual machine waws deployed in the same virtual network as the private endpoint stuff worked. Therefore we understand that it was either routing or the firewall.
After some digging we saw that traffic was configured using traditional Network rules on the Azure Firewall which does not use SNAT. Using network rules meant that the private endpoint was trying to communicate directly back to the requesting IP instead of the firewall and therefore creating asymmetric traffic flow. Coverting the rules to an application rule on the Azure Firewall solved this issue.
The reason why it worked for Storage Accounts is that for some PaaS services (while not documented) some of them solve SNAT on their own, and secondly Microsoft has now added a feature flag called disableSnatOnPL which solves this issue for 3.party NVAs but not Azure Firewall.
Case 3:
This was a tricky one, however about 7 years ago. We were in the process of setting up a file service in Azure to be used for our Kubernetes enviroment which seemed to work fine, but in the upgrade process suddenly the NFS share stopped worked.
We eventually saw that there was a limitation of 1,000 Virtual IPs in the connected VNET which broke the NFS share from NetApp Files. While this limitation has now been lifted, it was NOT documented when we first saw this, but is now documented. Fortunately we solved this with another solution since the Standard network feature was not available when we tried to fix it.
Case 4:
This is a case where we had company mergers and systems needed to communicate with each other. The problem is that customers that have Azure environments from before tend to use the 10.0.0.0/16 as a default. This does not allow us to use VNET Peering since it has IP overlap, so we needed to solve this using another service.
Fortunately this was just when Virtual Network Manager can with support for VNET Peering that supports even if IP addresses overlap using a mesh topology. It still requires that services are on ACTUAL different subnets to allow communication, but this solved our need. If I were to solve this now I would rather use Subnet Peering Azure Subnet Peering – Cloudtrooper or use NAT features on the VPN Gateway to hide the entire VNET behind a NAT address. Also since Mesh is not the same as regular VNET peering we needed to deploy a Azure route server to handle routing.
Case 5:
This was something that just came out of the blue, where suddenly Office stopped working in an Azure Virtual Desktop environment and we had no idea why. We discovered that a new workload was recently deployed in AKS, but other then that no changes.
What we then noticed was that more users have started to use AVD and more use of Office applications, and since Office uses ALOT of Ports it eventually led to SNAT port exhaustion on the Azure Firewall.
SNAT Port Exhaustion on the Azure Firewall can happen since it only can handle upwards to 2495 active sessions (this is a metric you can view on the Azure Firewall) now you can add more PIPs to the service, but this just added another 2495 SNAT ports. However another supported topology is to use NAT Gateway on top of the Azure Firewall PIP. Which added up to 64512 SNAT ports, problem solved.
Case 6:
Another case where customers were deploying private endpoints at a large scale! which was then being forced trough an NVA, and when we were doing a FinOps assessment we noticed that the cost for the private endpoints were becoming quite high. Since Private Endpoints for this customer were up to 5000 USD a month. Since most of these were storage account endpoints and the NVA did not provide much capabilities that we could not solve using Service Endpoint Policies.
We also did a test just to verify the performance difference between the two options which were pretty much the same.
The customer decided to move to use Service Endpoints with Policies. Services are still only available from certain Azure virtual networks and reduced the overhead of deploying everything with DNS configuration.
Case 7:
This final case in this blog post was a customer that wanted to have a service in Azure that could handle
- Load balancing
- WAF features
- Security features on L4 and L7
- Included the original IP of the source
- PCI-DSS compliance (which that is mostly just common security features)
Which none of the Azure services have….Therefore we needed to deploy using NVAs in that case here we used Palo Alto.
We ended up using Azure Load Balancer Gateway service which was then chained to the NVAs running Palo Alto where traffic was sent directly in a VXLAN header. The VXLAN traffic hits the NVA which then inspects the traffic and forwards it to the virtual machines backend which gets the data (VXLAN decapsulated)
This was a “quick” summary of some of the network experiences I’ve had with Azure the last 13 years working with it. I have some more as well which I will share in the second part, but I wanted to publish this first since it was part of my presentation. Hope you enjoyed it!