To be perfectly honest this is a blog post based upon my own personal research and not personal experience, where I’ve gone through and tested different solutions on how you can secure an Azure Kubernetes environment (and the workloads)
A Brief overview of how the architecture is within AKS. Where Microsoft provides management of the management components such as etcd, scheduler, API manager, and such. By default a Kubernetes cluster in Azure is public, meaning that the API gateway is publicly available and available through an Azure Standard load balancer.
Access and Authorization
- Access to Kubernetes API and Kubectl
When you provision a Kubernetes cluster in Azure, by default it uses regular Kubernetes-based RBAC and permissions. Users with the correct permissions in Azure can then by default interact with the Kubernetes cluster (even if they shouldn’t have access) This is possible through Azure AD permissions.
This means that if the user has access to Azure and has specific permission, they can download the admin credentials using the az aks get-credentials command and interact with the cluster using Cloud Shell. This applies to all with Contributor or Owner access. Clusters that do not use Azure AD only use the cluster-admin role.
Secondly, when you provision an AKS cluster it is by default public. This means that you can connect to the API server using the public IP address that is assigned to it. Then if you use the Clusteradmin credentials it has no enabled RBAC mechanisms.
- Authorized IP Range for Kubernetes API
One option we have to lock down access to the API server is to use authorized IP ranges for the Kubernetes API. This will effectively add a set of firewall rules to lock down access to the API server. This configuration can be added when you are creating a cluster or you can update the configuration
az aks update \ --resource-group myResourceGroup \ --name myAKSCluster \ --api-server-authorized-ip-ranges 220.127.116.11/24
It is important to know that all services that need to interact with the API server also needs to have the IP ranges included.
- Private Cluster
Still having a cluster that is publicly available might not be the best option. We also have the option to configure a private cluster where the control plane or API server has internal IP addresses. This feature uses Azure Private Link to create a virtual NIC within the Kubernetes virtual network to expose the API Interface there. Since it uses Private Link it also means that it will create a Private DNS Zone which will be assigned to the AKS virtual network. It should be noted that it has some limitations.
- It cannot be used in combination with authorized IP ranges
- It cannot be used together with Microsoft-hosted DevOps agents.
- If you want to use it together with Azure Container Registry you need to use it with VNET and VNET Peering
- In the case of maintenance on the control plane, your AKS IP might change. In this case, you must update the A record pointing to the API server private IP on your custom DNS server and restart any custom pods or deployments.
If you want to use a Private Cluster it needs to be done during provisioning.
az aks create -n <private-cluster-name> -g <private-cluster-resource-group> --load-balancer-sku standard --enable-private-cluster
- Azure AD RBAC to Kubernetes
While another option is using native Kubernetes Cluster Admin access to control access to the API server, neither of those options integrate with Azure AD. We have the option to configure Kubernetes role-based access control (Kubernetes RBAC) based on a user’s identity or directory group membership in Azure Active Directory as well.
Using Kubernetes RBAC and Azure AD-integration, you can secure the API server and provide the minimum permissions required to a scoped resource set, like a single namespace. You can grant different Azure AD users or groups different Kubernetes roles.
To enable Azure AD Integration for authentication you can use the following command
az aks update -g resourcegroup -n --enable-aad nameofcluster --aad-admin-group-object-ids azureadgroupid
When you enable this and try to login you will get this redirect from the API server
NOTE: You can still get access to the admin using the cluster credentials by using the command az aks get-credentials –resource-group myResourceGroup –name myManagedCluster –admin
Now you also have two options when it comes to RBAC, even if you can integrate authentication to Azure AD you can also decide if you want to control RBAC within Kubernetes or Azure AD.
So, I’m going to focus on using Azure AD-based RBAC in Kubernetes. With the Azure RBAC integration, AKS will use a Kubernetes Authorization webhook server so you can manage Azure AD-integrated Kubernetes cluster resource permissions and assignments using Azure role definition and role assignments.
Enabling Azure RBAC can also be done for new and existing Kubernetes Clusters.
az aks update -g resourcegroup -n nameofcluster --enable-azure-rbac
Once RBAC and Azure AD integration is in place you should be able to see this enabled in the portal under Cluster Configuration
Here are all the different permissions that can be assigned within the Kubernetes roles in Azure AD –> Azure resource provider operations | Microsoft Docs
When you have Azure AD integrated it means that you can also monitor activity against the Kubernetes environment as well. As one example where you can use Azure Monitor to monitor sign-in attempts to AKS
SigninLogs | where AppDisplayName == "Azure Kubernetes Service AAD Client"
Hosts and Cluster
When you install AKS you get a predefined set of virtual machines running as VMSS which are running either Ubuntu (For Linux worker nodes) or Windows Server). The Linux-based VMs use an Ubuntu image, with the OS configured to automatically check for updates every night. Some security updates, such as kernel updates, require a node reboot. A Linux node that requires a reboot creates a file named /var/run/reboot-required. This reboot process doesn’t happen automatically so that is something that needs to be done manually or you can use for instance Kured which is a Kubernetes Reboot Daemon (weaveworks/kured: Kubernetes Reboot Daemon (github.com)
This tool watches for the presence of a reboot sentinel file and then performs safe automatic node reboots when the need to do so is indicated by the package management system of the underlying OS.
Also, these Ubuntu-based nodes it is running with containerd as the container runtime.
As part of AKS by default has regular maintenance performed on it automatically. By default, this work can happen at any time. Planned Maintenance allows you to schedule weekly maintenance windows that will update your control plane as well as your kube-system Pods on a VMSS instance and minimize workload impact.
With a new preview feature called Planned Maintenance you can define when this maintenance window occurs. As an example you define maintenance window to 1 AM on Monday night.
az aks maintenanceconfiguration add -g MyResourceGroup --cluster-name myAKSCluster --name default --weekday Monday --start-hour 1
When you need to upgrade from one Kubernetes version to another you can either do that manually or using the auto-upgrade channel mechanism. Both processes will follow
Monitoring of within Kubernetes consists of multiple data sources to get the right insight into what is going on. In Azure the are different log sources that contain information about the different pieces.
Azure Kubernetes Diagnostics
Container Insight (Collects the following data from each container)
- ContainerLog (stdout, stderr, and/or environmental variables)
- Performance Metrics for Nodes
Enabling Container Insight can generate a lot of Log Events and flood your log analytics workspace with data. You can also use ConfigMaps to configure what kind of data should be collected by the OMS Agents that will be installed on the Kubernetes Cluster.
You also have Azure AD Audit Logs which mentioned earlier are useful to get logs from administrators that have access to the Kubernetes Cluster API (Only visible when the Cluster is integrated with Azure AD)
- Threat Detection
In terms of Threat detection for Azure Kubernetes, it can apply on two layers. First on the cluster level and secondly on the workload level. I’ve previously written about it here –> Getting started with Azure Defender and Azure Monitor for Kubernetes using Azure Arc | Marius Sandbu (msandbu.org) but it requires a log analytics agent running on the cluster to be able to provide threat detection for the workload. You can see a list of alerts and detection rules for AKS here –> Reference table for all security alerts in Azure Security Center | Microsoft Docs
- Network Policies
When running services in Kubernetes they are deployed into a pod that runs a service (not to be confused with the term Kubernetes service). However, by default, all pods can communicate with one another. Every
Pod gets its own IP address. This means you do not need to explicitly create links between
Pods and you almost never need to deal with mapping container ports to host ports.
To lock down access you can either use a Network Policy or use Service Mesh. Network Policies are an application-centric construct that allows you to specify how a pod is allowed to communicate with various network entities.
The entities that a Pod can communicate with are identified through a combination of the following 3 identifiers:
- Other pods that are allowed (exception: a pod cannot block access to itself)
- Namespaces that are allowed
- IP blocks (exception: traffic to and from the node where a Pod is running is always allowed, regardless of the IP address of the Pod or the node)
Within Azure there are two main approaches to Network Policies.
- Azure’s own implementation, called Azure Network Policies. (Supports Linux) supported by Azure support team
- Calico Network Policies, an open-source network and network security solution founded by Tigera. (Supports Linux and Windows) and Kubenet CNI
NOTE: You can read the difference between Azure CNI and Kubenet here –> Azure Kubernetes Service Kubenet vs Azure CNI | Marius Sandbu (msandbu.org)
Both implementations use Linux IPTables to enforce the specified policies. Policies are translated into sets of allowed and disallowed IP pairs. These pairs are then programmed as IPTable filter rules.
NOTE: The network policy feature can only be enabled when the cluster is created. You can’t enable network policy on an existing AKS cluster.
It should also be noted that network policies are more of an allow/deny firewall mechanism. It does not provide granular access or observability into the traffic layer. Then you should be looking into Service Mesh.
Service Mesh uses a different approach, where it deployed a sidecar container (a proxy) container which handles traffic in and out to service. Much like the traditional reverse-proxy architecture. This means that we can visibility, more granular traffic steering, and load balancing mechanisms.
- Image Scanning
As part of the workload, there might be numerous different container images that are used as part of the deployment, and it can become difficult to monitor the different images/dependencies that are used for each service. There are many services that offer image scanning mechanisms that can scan container images for known vulnerabilities. Azure Container Registry has this feature (powered by Qualys) which is part of Azure Defender.
Azure Defender pulls the image from the registry and runs it in an isolated sandbox with the Qualys scanner. The scanner extracts a list of known vulnerabilities.
The scan can occur based upon three different triggers on push, recently pulled or on-import.
As an example I uploaded the damn vulnerable docker image which can be found here –> vulnerables/web-dvwa (docker.com)
Already I could see security alerts from Security Center.
Other best practices
- Uptime SLA
This is not a security best-practices, but an operation best practice. With AKS Uptime SLA is an optional feature to enable SLA for a cluster. Uptime SLA guarantees 99.95% availability of the Kubernetes API server endpoint for clusters that use Availability Zones and 99.9% of availability for clusters that don’t use Availability Zones. For clusters with not uptime SLA enabled, clusters get a service level objective (SLO) of 99.5%.
This can be enabled for a cluster using the following commands
az aks update --resource-group myResourceGroup --name myAKSCluster --uptime-sla
- Azure Policies
Azure Policies can provide desired state configuration management which can apply both to Azure as a platform and in-guest. Microsoft also provides an Azure Policy Add-on for Kubernetes clusters. It does this by extending Gatekeeper v3, an admission controller webhook for Open Policy Agent (OPA).
To define policies for in-guest clusters, you first need to enable the addon which requires registration to a new resource provider
az provider register --namespace Microsoft.PolicyInsights
Then you need to install the AKS addon.
az aks enable-addons --addons azure-policy --name MyAKSCluster --resource-group MyResourceGroup
Once that is done you should have an additional Azure-policy pod installed in the kube-system namespace
kubectl get pods -n kube-system
The default behaviour of the add-on is to check in with Azure Policy service for changes in policy assignments every 15 minutes. During this refresh cycle, the add-on checks for changes.
Here is a list of built-in Policy definitions that can be used –> Built-in policy definitions for Azure Kubernetes Service – Azure Kubernetes Service | Microsoft Docs
For instance, I assigned a policy definition based upon an policy initiative that Microsoft has created
Which I could see has taken effect by checking
kubectl get constraints
- Architecture and Best practices
In terms of architecture and best-practices, I like to refer to Microsoft own enterprise architecture for AKS which is Microsoft’s own best-practices baseline –> Baseline architecture for an Azure Kubernetes Service (AKS) cluster – Azure Architecture Center | Microsoft Docs
The design goes through much of the same topics covered here but also network design, CI/CD pipeline configuration, and more.
In the next post I will go a bit more in-depth on ServiceMesh and different options that you can use in Azure.