A couple of weeks ago I was contacted to help figure out why DNS lookup was failing “sometimes” for services that were configured using private endpoints. This was somewhat difficult to troubleshoot since this happened “sometimes” and therefore it took some time to figure out what was going on.
This happened when we tried to log on to an Azure OpenAI and Azure Machine Learning service in Azure that has a private endpoint enabled. When you have that enabled, the portal will always try to connect to the private endpoint. When they tried to publish something on OpenAI chat they got this error message.
“Error, Competions call failed. Please try again”
But let me do some background first on DNS resolution in Microsoft Azure.
- Virtual networks in Azure that have NOT configured a DNS server setting, will use the built-in DNS resolver in Azure. Hence DNS will work by default without us setting any DNS servers.
- For private endpoints attached to a PaaS service (which is public by default) which then has its own virtual network interface and with it also a private DNS zone.
- A Private DNS zone will ONLY be resolvable by the network to which it is LINKED to. Meaning that if you have a service like this where you have a virtual machine and a PaaS service within the same virtual network you need to have the private DNS zone linked to the virtual network to allow the VM to be able to resolve the private endpoint. By linking a Private DNS zone to a virtual network is like adding a custom DNS zone to a domain controller, which will override the lookup for that specific domain.
- In most cases, customers have a centralized DNS server which can be the private DNS resolver service in Azure, other DNS servers, or domain controllers.
- In most cases customers also have Azure Firewall which can act as a DNS proxy, essentially forwarding DNS requests to a centralized DNS service.
- Using the standard hub-and-spoke architecture services in a spoke, and configured to point to the Azure Firewall which in turn is pointing to the centralized DNS service.
- Then in turn all the private DNS zones are linked to the virtual network where the central DNS service is located.
- If you link a private DNS zone to a virtual network that has a DNS server setting configured, resources will NOT be able to resolve the private DNS zone since all DNS requests will then be sent to the DNS server configured in the virtual network.
Below is how the DNS lookup was configured.
It should be noted that when you configure a DNS server on a virtual network, the Azure agent on the server placed within the same VNET will automatically apply the DNS servers in that order as well. If the main DNS server is unresponsive or busy the DNS traffic will go to the second one.
Now back to the case here where we had a similar setup using a hub-and-spoke architecture, all spokes were configured to send DNS requests to the Azure Firewall DNS Proxy. Which in turn points to the domain controllers one and two as seen in the diagram below.
The Azure Firewall has a built-in log for DNS Proxy requests (if it is configured, which we then can use for troubleshooting)
Depending on if you use the old firewall logs or the new ones, you have different kusto queries that you can use, since the logs are stored in different tables.
- The old firewall log settings as seen here log EVERYTHING into the AzureDiagnostics tables and most of the metadata is stored in the msg_s field as seen in the picture below. In this case the table is AzureDiagnostics and the category is called AzureFirewallDnsProxy you also have various categories for Application and Network rules.
After reviewing the DNS proxy log in Azure firewall, we saw that all requests were heading in the right direction which was two domain controllers in the hub subscription.
The next part was to verify the DNS zone mappings to the hub virtual network, and fair enough we saw that the private DNS zones were mapped to the virtual network meaning that the private DNS zones would be “reachable” from the domain controllers.
We verified this using a simple nslookup from the domain controller which would, in turn, be logged using the DNS log collection feature in Azure Sentinel. Here we saw that the ClientIP address was in fact an IP address from the Firewall subnet. Meaning that traffic was going in the right direction from the Services –> Azure firewall proxy –> domain controllers and then finding the correct IPs based upon the linked dns zones.
We saw this using the DNS Solution from Microsoft Sentinel (Which we could see in the kusto query in the screenshot below), which we saw that DNS requests were coming in. However, we noticed that majority of the DNS traffic was getting to DC01.
We then noticed that majority of DNS traffic was received on DC01, which made sense seeing that it was the primary and that it was indeed responding with the correct private IP address. However, when DNS traffic was headed to DC02 it responded with the public IP address.
Aha! now we found the culprit! for some reason DC02 was replying with the public IP address for Azure Machine Learning and OpenAI services instead of the private IP address which was configured for private link. The reason? This setting here. (Zone replication) Since the public zones for Azure AD were always set up on main domain controller the zones did not by default have zone replication enabled, meaning that zone data would only be stored on DC01 and not DC02.
When we changed the Zone replication to all DNS servers, it would work properly.