Everything can be found with Wireshark!
I was recently involved in a scenario where I was tasked at troubleshooting why some features were not working in Azure, related to traffic flow between a service in Azure and virtual machine on-premises. For those that are used to “Cisco” troubleshooting or network troubleshooting, moving that knowledge to Azure is not that straight forward. So therefore I wanted to talk a bit about how networking in Azure works on a deeper level and how to troubleshoot network traffic flow with different scenarioes.
Now before we start there it is important to understand how traffic flows within Azure with the different services in there. Here is a simple overview of the networking flow for the main components within Azure.
Full Azure Networking picture here –> https://i.imgur.com/UkfXaG1.png
Azure Networking Overview
The first important aspect is the Virtual Network. Within Azure you have a Virtual Network is contained within a Azure region. A Virtual Network in Azure is a software-defined networking using a form of network virtualization called VXLAN, which means that all virtual networks that are created in Azure are unique and not directly interconnected and unique. VXLAN is wrapping packets within UDP and therefore there is a limit of how large MTU can be on the traffic flow. Internally in Azure the MTU for a Virtual Network is 1400 MTU (Unlike the orginal Ethernet MTU which is 1500) This MTU also affects VPN Gateways. So if you want to integrate VPN with on-premises or other locations you need to apply MSS clamping so that you avoid IP fragmentation.
Also within Azure. Virtual Network stack is set up to drop “out of order fragments,” that is, fragmented packets that don’t arrive in their original fragmented order.
When you setup a virtual network you need to have a subnet if you want to deploy virtual machines or other resources. When you configure one or more subnets within a virtual network, by default all subnets are directly connected and routeable. Within a subnet the first IP address is reserved by the Azure NFV Gateway, and the second and third are used for the NFV Azure DNS service. This also means that when you setup a virtual network Azure will automatically assigned a public ip address for the VNET which is not visible which means that virtual machines on the virtual network can communicate with the Internet without having assigned any Public IP addresses or NAT Gateway features using SNAT.
Since this hidden Public IP (PIP) is a single IP, it means that you might happen to hit port exhaustion. Since by default an Azure PIP has 64,000 SNAT ports if you distribute this over multiple virtual machines you might hit the limit.
However you also have the ability to use other means of outbound network connectivity. Since it by default is taken care of the NFV Router in Azure with the predefined PIP attached to the virtual network.
The following features change the outbound traffic flow of a virtual machine
- Public IP address assigned to the VM
- Virtual Machine part of a Standard Load Balancer
- NAT Gateway assigned to a virtual network (Superseds Load Balancer)
- NVA or Azure Firewall as next-hop using a User Defined Route
The NAT Gateway supports up to 16 Public IP addresses x 64,000 ports to extended the amount of supported SNAT translations. The Azure Load Balancer is not intended as a replacement for NAT, but supports load balancing of traffic coming external connections into a pool of backend-servers. Azure Load Balancer also supports NAT feature to do PNAT to a dedicated VM & Port. Azure LB also has the support for HA-ports which allows for load balancing of the entire port-ranges. Which is often used to provide High-availability for Network Virtual appliances within Azure. It is important to know that Azure LB supports TCP/UDP protocols and not ICMP. Also it does not accept IP fragmentation.
The other option is to use Azure Firewall which is a managed PaaS service in Azure. Azure Firewall requires its own subnet called AzureFirewallSubnet and can also utilize multiple PIP’s for outbound traffic. Azure Firewall also provide more granular firewall capabilities in addiotion to five-tuple (Network Rules) you can also defined Applications Rules (FQDN’s). Azure Firewall can be centrally managed standalone or using Azure Firewall Manager if you have multiple Azure Firewall instances. For your virtual machines to use Azure Firewall for outbound traffic you will need to defined a User-Defined Route. User-Defined Routes are assigned to a subnet level and always take presedence over BGP Routes (learned trough VPN/ER) or built-in System Routes.
These different features above allow you to change from using the single PIP attached to the virtual network to control outbound traffic coming from known and controlled PIP addresses.
From a security perspective you can also have assigned network security groups (NSG) to either a vNIC or a subnet level. These NSG can be used to defined five-tuple rules which can apply either for outbound or inbound traffic flow. The NSG on a subnet level are handled by the NFV feature on the vSwitch.
Virtual Networks are by default not able to communicate with each other, if you use the example with VXLAN. Each virtual network has it’s own ID known as a VNI. We have the ability to use a feature called VNET Peering which allows us to connect the two virtual networks together. This means that there will be created routes between the two virtual networks and route tables will be exchanged, essentially allowing the two VNI’s to communicate with each other. This allows the virtual network to communicate to the other VNET using the default route mechanism within Azure.
When you deploy a VM to a Subnet it also has two options when it comes to DNS lookup. By default the virtual machines are using an Azure managed DNS service which is available on the IP 168.63.129.16. This IP is a virtual loopback IP address which is available for all virtual machines in Azure. You have the ability to change the DNS servers on Azure to a custom servers as well (which is defined on a virtual network level)
When you deploy a virtual machine on a subnet, there are some quotas that you need to be aware of. First of there is the limit in terms of bandwidth on the vNIC’s that are assigned to the VM. This bandwidth calculation is total amount of troughput distributed across the different vNIC’s that a virtual machines has. So for instance a D2_v3 has support for 2 VNIC’s, 1000 Mbps bandwidth and up to 500K flow limit. This limit can be viewed within Azure Monitor for the VM
A Long way to go before I meet the limit….
This limit is for outbound traffic and not inbound. A Virtual Machine can have multiple vNIC’s (depending on SKU) which can have multiple IP configurations attached to it.
Virtual Machines within Azure are by default using the regular SDN stack with VXLAN that Microsoft has, which also uses a hypervisor based vSwitch. This encapsulation and overhead of the vSwitch can also add latency to network flow. Within Azure you also have the ability to deploy virtual machines using Accelerated Networking (Supported for Windows Server) With accelerated networking, network traffic arrives at the VM’s network interface (NIC) and is then forwarded to the VM (bypassing the Hypervisor vSwitch). All network policies that the virtual switch applies are now offloaded and applied in hardware, using SR-IOV feature. If you have a demand for low latency connection between multiple servers in Azure. Deploy them with Accelerated Networking in combination with proximity placement groups, which are logical grouping to make sure that Azure compute resources are physically located close to each other. Note that placing services across different availabilty zones will affect latency (not that much but if you have sensitive applications they will be affected.
Source: https://blog.thousandeyes.com/microsoft-azure-releases-performance-dashboard-thousandeyes/
To connect Azure virtual network to on-premises, there are some options to choose from. You have the ability to use standard VPN Gateway, Virtual WAN or ExpressRoute or a combination of these services. (I’m not go into detail on Expressroute here but will come back to it later)
Azure VPN Gateway supports S2S and P2S VPN (limit in terms of connections and bandwidth are based upon SKU chosen. As an example a VPN Gateway VPNGW2 has a theoretical bandwidth of 1.25 GBps, but that is also dependent on encryption schema that is used. For instance different encryption algoritms used will provide different results
- GCMAES256 – 1.25 Gbps
- AES256 & SHA256 – 550 Mbps
- DES3 & SHA256 – 120 Mbps
Azure VPN Gateway is used Internet based access. So Software based connection going from local datacenter, then via ISP th toe Microsoft PoP closest to the service.
While Azure Virtual WAN is a bit different. It supports the same features in terms of S2S and P2S, but uses TCPAnycast which allows Microsoft to broadcast the PIP of the Azure VWAN Gateway to the closest Microsoft POP. Therefore reducing the latency from the on-premises datacenter to the Azure Global Network.
‘
Now by default, Azure VPN Gateway will be deployed within a custom subnet called GatewaySubnet. The service also consists of an active/passive node by default and some ports are open by default both for management purposes (such as port 20000 which is used to conenction information replication) and 8081 which can be used for health monitoring) Also remember that the bandwidth that is available for a VPN Gateway will be shared between the S2S and the P2S connections.
From a security perspective, we also have Azure DDoS Protection which is aimed at protecting Public IP addresses in Azure from Layer 3 and 4 based network attacks. This feature monitors all PIP’s for traffic patterns where you can define a custom monitoring and alerts based upon metrics.
Monitoring & Troubleshooting
Great so now we have some more fundamental understanding on the core networking features in Azure. So what about monitoring or troubleshooting?
Full PIcture here –> https://i.imgur.com/JJ3Yhjm.png
By default all the services that I described above have a diagnotics/logging feature which essentially collects events/alerts/metrics into a service called Log Analytics. Log Analytics is a log aggregation service which is essentially a database where it collects all data from the different resources.
Log Analytics also comes with different modules which can enhance the data it collects or even collect new data. One example is Azure Sentinel which is an SIEM module which provides analytics and automated response to the data that is collected into Log Analytics. Another example is Network Performance Monitor which uses agents on windows machines to detect Latency/Packet-Drop between two endpoints. The data it collects is gathered into Log Analytics.
Here is an example of setting up diagnotics into a Log Analytics workspace
https://docs.microsoft.com/en-us/azure/azure-monitor/platform/diagnostic-settings
In addition, a lot of the different services within Azure collect data into Log Analytics, such as VPN Gateway, Azure Firewall, API Gateway, Network Security Groups and such (As long as it is enabled) some examples of what is collected from the different services.
NOTE: You can view the different resource logs that are collected here –> https://docs.microsoft.com/en-us/azure/azure-monitor/platform/resource-logs-categories
Azure Firewall
AzureFirewallApplicationRule – Azure Firewall Application Rule
AzureFirewallNetworkRule – Azure Firewall Network Rule
Application Gateway
ApplicationGatewayAccessLog – Application Gateway Access Log
ApplicationGatewayFirewallLog – Application Gateway Firewall Log
ApplicationGatewayPerformanceLog – Application Gateway Performance Log
Azure VPN Gateway
GatewayDiagnosticLog – Gateway Diagnostic Logs
IKEDiagnosticLog – IKE Diagnostic Logs
P2SDiagnosticLog – P2S Diagnostic Logs
RouteDiagnosticLog – Route Diagnostic Logs
TunnelDiagnosticLog – Tunnel Diagnostic Logs
Public IP Addresses
DDoSMitigationFlowLogs – Flow logs of DDoS mitigation decisions
DDoSMitigationReports – Reports of DDoS mitigations
DDoSProtectionNotifications – DDoS protection notifications
Network Security Groups
NetworkSecurityGroupEvent – Network Security Group Event
NetworkSecurityGroupFlowEvent – Network Security Group Rule Flow Event
NetworkSecurityGroupRuleCounter – Network Security Group Rule Counter
In addition to this you also have the ability to collect Flow Logs, which collects metadata about data flow going trough a network security group. Which will happen regardless if it a pass or deny. Flow logs are the source of truth for all network activity in your cloud environment.
Some more info about Flow Logs:
- Flow logs operate at Layer 4 and record all IP flows going in and out of an NSG
- Logs are collected at 1-min interval through the Azure platform and do not affect customer resources or network performance in any way.
- Logs are written in the JSON format and show outbound and inbound flows on a per NSG rule basis.
- Each log record contains the network interface (NIC) the flow applies to, 5-tuple information, the traffic decision & (Version 2 only) throughput information. See Log Format below for full details.
- Flow Logs have a retention feature that allows automatically deleting the logs up to a year after their creation.
It should be noted that Flow Logs are currently not supported for all services.
1: Issue with Application Gateway V2 Subnet NSG: Flow logging on the application gateway V2 subnet NSG is not supported currently.
2: Due to current platform limitations, a small set of Azure services are not supported by NSG Flow Logs which are currently AKS and Logic Apps.
In addition we can also have collecting information from the Windows virtual machines using the agent to collect Windows Firewall logs as well. Then you need to use the custom log collection https://docs.microsoft.com/en-us/azure/azure-monitor/platform/data-sources-windows-events
So where do we start? in my example with a faulty service which was not connecting to the on-premises enviroment. We start with the basics.
0: Azure Services working? Has this happened before? Yes. Use Azure Monitor Networking view as good indication.
1: Is VPN Tunnel active? If not check the status why. If for some reason the VPN tunnel is not active. You have the option to use the built-in VPN diagnostics feature to check for why. This drops the IKE connection information in a storage account. You can also make sure that you are running correct IPSEC configuration.
You can also use the built-in diagnostics logs to check issues. Also do not have NSG applies to the virtual gateway subnet network.
2: If you have a advanced virtual network, check if routing is in place and VNET peering is in place. If you have multiple virtual networks in place you need to have VNET peering enabled to make sure that traffic is flowing from one VNET to Another. If you want a resource in one VNET to use the Gateway that is within another VNET you also need to enable “Use Remote Gateway” on the VNET Peering as well. If you are using NVA or Azure Firewall to centralize control network make sure that you have route tables (UDR’s) that control the network flow. A good way to check the route table is by going into the NIC’s resource in the Azure Portal to show the “Effective Routes
3: Make sure that you do not have any NSG or firewall rules in place that can prohibit the traffic. Here you can use the built-in network watcher component called IP Flow Verify which checks against NSG rules that have been configured. For Azure Firewall you need to check the diagnostic logs using Log analytics and kusto queries.
Example to check for deny rules
AzureDiagnostics | where TimeGenerated > ago(24h) | where OperationName == "AzureFirewallNetworkRuleLog" | parse msg_s with * "Action:" Action | parse msg_s with * "from" from | project Action, from
There might be a combination of Azure Firewall rules and NSG rules that apply. If you use Flow Logs and Traffic Analysis you can also use that to check for insight as shown here –> https://msandbu.org/24-hours-of-network-traffic-analysis-from-microsoft-azure/
4: Make sure you understand the outbound connection flow. If you have a set of virtual machines that have applied a NAT gateway or part of a public load balancer, they will by default use either of those for outbound connectivity. Unless you have an UDR which specifies where traffic would flow. Which will overrule the outbound flow of the other services. Still you have the issue with Asymmetric routing.
5: Protocol supported?. If you are using services which are using some older protocol or leveraging GARP/RARP for failover or anycast or other fancy protocols (PXE) is it not supported.
6: Traffic going by slow? Might be a case of IP fragmentation or IP packet flow reached on the virtual machine. Check the IP flow count in Azure Monitor and also check the bandwidth available. The virtual nics in Azure can show 50 GB troughput but the limit is outside of the OS and can therefore be a cap on the virtual NIC or the packet flow. This can be showed using Wireshark or other built-in PCAP collection tools. Here an example.
Not all applications handle that well, secondly you have might more latency from the applicaiton itself to backend services. Look at use of services such as Proximity groups here to provide low latency as possible.
So in my example, the connection was not flowing from my resource in Azure to on-premises. I looked the traffic flow, making sure that routes where in place, making sure that traffic was going trough the NSGs and the Azure Firewall. Since the service was using FQDN I check that DNS lookup was working as it should. So what was the culprit?
The on-premises Firewall! Case closed.