Few weeks back, I received an invitation from a department at my workplace to share my expertise on how to manage and optimize cloud costs for services in Microsoft Azure. So, after that workshop I decided that I wanted to compile my tips and tricks in this blog post, outlining how you can regain control of public cloud cost. Just to give some context, I have been working with Azure for close to 12 years now and in all those years one thing that keeps coming up again and again as an issue is how to manage cost.
What are the challenges that most organizations are struggling with related to cost?
Whether you are a new adopter of public cloud or have been using it for some time, managing costs in this environment can be a challenge for many organizations. Based on my observations, the following are some of the most usual challenges faced by organizations.
- Understanding the pricing elements – It can be challenging to determine the precise cost of a service (What is service X going to cost me each month?). For instance, we encountered this issue when estimating the expenses for utilizing Azure Monitor to monitor specific aspects of a virtual machine. You can refer to the following link to review the various cost components. You can view the different cost elements here –> Pricing – Azure Monitor | Microsoft Azure but to show an example where we were looking into what would monitoring of a single VM cost per month using Azure Monitor?
- Standard Metrics (Free)
- Custom Metrics (OS specific metrics) 0.16 / 10 million samples ingested
One custom metric approximately 8 bytes = 125 000 entries =
One custom metric that checks every 15th seconds
- Log Collection ($2.99 per GB per month a single VM ingests between 3,3 GB to 10 GB)
- Metric Alert (0.10$ for check every month per month)
- Secure Web Hook ($6/1,000,000 secure web hooks)
- Log Alert (0.30$ for an alert check every minute) for a month (0.05$ for an alert check every 15 minutes)
- This of course makes it rather difficult to estimate the cost of what “monitoring” is going to cost. While this might be maybe one of the worst examples, another difficult part is that all services are priced differently meaning that it is difficult to estimate this consistently.
- Understanding the difference in SKUs – In most Azure based services you have different SKUs like bronze, silver, and gold. Where you in most cases might have an understand on what the service costs on one SKU but not on another or that aware of the impact is has on the price to change from one SKU to another. Just to highlight how big of a difference is between one SKU over another with this screenshot of two Service Bus examples below. Where one change
- Keep up to date on all the changes – The ability to process all the changes within the platform is crucial and has a significant impact. Azure alone implements close to 2,000 changes annually. Although not all changes may affect you directly, it can be highly advantageous to remain aware of them. For example, Microsoft recently unveiled a new DDoS protection mechanism that costs 1/15th of the previous version, and while it does have some limitations, it is sufficient for most organizations. Another instance is the introduction of a new set of storage capabilities with data disk V2, which has improved throughput and lower costs compared to the previous version as seen in the table below.
|Disk Size||Premium v1 Performance||Premium V2 Performance||Premium v1 Cost||Premium V2 Cost|
|128 GB||500 IOPS / 100 MB/S||3000 IOPS / 125 MB/S||19,7$||10,28$|
|256 GB||1100 IOPS / 125 MB/S||3000 IOPS / 125 MB/S||38$||20$|
|512 GB||2300 IOPS / 150 MB/S||3000 IOPS / 150 MB/S||73$||40$|
|1024 GB||5000 IOPS / 200 MB/S||5000 IOPS / 200 MB/S||135$||95$|
|2048 GB||7500 IOPS / 250 MB/S||7500 IOPS / 250 MB/S||259$||192$|
|2048 GB||7500 IOPS / 250 MB/S||15000 IOPS / 500 MB/S||259$||251$|
- Understanding which service to use? This is also a new thing that I’ve seen more frequently. Microsoft has a wide range of different services, and some services are also overlapping meaning that they provide some of the same capabilities. One example can be that we need to have a file service that supports the SMB protocol, in Azure there are multiple ways to build this. One example is using Azure files, and another is Azure NetApp files. If we look at both, purely from a cost perspective for 10 TB of storage, Azure Files is significantly cheaper.
What kind of measures should we have in place to control cost?
1: When creating new projects or services in the public cloud, require the team to generate an estimate using the Azure Price Calculator. Although the initial estimate may not be entirely precise, it can provide an indication of the type and quantity of services required. This also offers the project team valuable visibility into the anticipated costs of the services they intend to use.
2: Take the estimates generated for the new service and incorporate them as a budget within Azure Cost Management, either at the resource group or subscription level depending on the service’s architecture or design. This enables us to track the costs as they develop for the service and receive alerts if the expenses are projected to surpass the estimates. It is also important that those responsible for a service assign proper tags in place on their different resources since these can be used in Cost Management to provide more in-depth visibility on cost of each component or project. Such as allowing a team to view the cost allocated to a dev/test environment compared to the production environment.
3: Establish Guard-rails to prevent unnecessary provisioning of expensive premium SKUs. You can accomplish this primarily by using Azure Policies to set restrictions on the virtual machine SKUs and regions that can be utilized. Additionally, deploy policies to limit the deployment of expensive services to only controlled environments.
Using as an example these built-in policies can be used as guidelines for how you should limit SKUs and services.
- Limiting Service SKU’s (https://www.azadvertizer.net/azpolicyadvertizer/7433c107-6db4-4ad1-b57a-a76dce0154a1.html)
- Limiting VM SKU’s (https://www.azadvertizer.net/azpolicyadvertizer/cccc23c7-8427-4f53-ad12-b6a63eb452b3.html)
- Define a process that ensures that projects or new initiatives that are coming are using tags on their resource to more easily track cost utilization.
4: Cost governance involves implementing a set of processes that ensure a team is responsible for monitoring costs on a monthly basis. This responsibility should not fall solely on the finance department but rather on a central CCOE or Cloud advisory team that comprehends the services and cost components. Bi-weekly meetings should be held to review cloud consumption in your environment. While I will discuss the specific tools and considerations in more detail later, it is important to emphasize the process aspect first. This is an opportunity to discuss and address any cost-related concerns.
- New services or infrastructure that has been released and how it can impact your existing services
- Look at how you are using cloud services and if there are any new orphaned resources
- Look at storage consumption (data that should be moved? deleted? rehosted?
- Look at log ingestion (data trends going up or down?)
- Services that have been configured with the right level of redundancy. do you need GRS?
- Resource optimization – look at if there are changes that can be done to optimize cost.
- Reservations – Look at if there are services that can be put on reservations. This can be services such as SQL/Storage/Web Apps/Virtual Infrastructure where capacity can be reserved for either 1 or 3 years.
- Are you using Dev/Test pricing possibilities within your Enterprise Agreement to get cheaper services for dev/test environments?
- Are services configured to scale up and down when needed? for instance, in your VDI environment or for other virtual machine-based environments you can stop and start VMs on a schedule using automation tools or with AVD you can use scaling plans.
5: Cost ownership! ensuring that everyone has some form of ownership to the cost of a service or product. Lately I’ve been using more a tool called Infracost which is a useful tool especially if you are using Terraform or other IaC tools. This tool integrates with either your IDE or can be part of a CI/CD pipeline to show you the cost of a resource while it is being built based upon the specifications in the code. As seen in the screenshot below where it shows the cost related to the VM resource that I’m about to create.
An emerging challenge involves having a platform team and multiple teams working on the same platform, such as with a Kubernetes platform. Providing visibility into the number of resources used by each team can be difficult there especially if there are using the same resources. Fortunately, there are tools available to address this issue, such as Kubecost. That can both provide insight and optimization tips.
Another concern is identifying orphaned resources. How can we determine which resources are no longer in use? All resources in Azure are represented as a JSON object, and using Azure Resource Graph, we can query this data to examine dependencies between a disk and VM or a Load Balancer and a backend service, for example. Thankfully, someone has already developed an Azure Orphan Resources dashboard, saving us the trouble of finding most of these resources. You can find the dashboard workbook here –> dolevshor/azure-orphan-resources: Centralize orphan resources in Azure environments (github.com)
Despite the multiple queries used in this dashboard to analyze the resource graph, it may not detect all resources that are not being used efficiently or at all, particularly those without a direct dependency, just some examples
* Storage Accounts with no new data (just consuming space)
* Storage Accounts with a higher redundancy level then needed. Do you need GRS?
* NSG Flow logs not getting cleaned up
* Diagnostics data that you do not need being sent to Log Analytics (for instance having enabled Kubernetes diagnostics with collecting all audit logs will generate an extremely high amount of log data, just by removing the kube-audit log mechanisms will reduce the data greatly, I’ve written a blog post about that here –> Customize Azure Kubernetes Service Diagnostics for Azure Log Analytics – msandbu.org
* Disks that are unattached or snapshots that are not deleted
* Backup data for a VM resource that is still present
* Utilizing archive options for backup data (Azure Backup – Archive tier overview – Azure Backup | Microsoft Learn)
* Application Gateway or other network services that are no longer used.
* Check what kind of data that is being collected into the log service, for instance using the Insight part on Log Analytics can give us some valuable insight if there are changes to the data that is coming (which could be an indication of an attack, or as I’ve encountered in many cases issues with the log agent sending corrupt data) and if you are using Sentinel as well there can be much to save just by looking at what kind of data that is coming in.
The ultimate factor is setting expectations for the customer or organization. In order to thrive in a public cloud environment, it is essential for all parties to take ownership of the associated costs. Furthermore, any plan or strategy for public cloud adoption should not be solely focused on cost savings. While there are various options available, in most instances, running a fixed set of workloads on an established private cloud datacenter will prove to be more cost-effective.