Aside from the variety of technical questions, a very common discussion around Azure Kubernetes Service (AKS) is … “What will it cost me?”. In today’s post we’ll dissect how the pricing dynamics work and how you can optimize the cost for your cluster(s). Where this might not be rocket science, I do have noticed some organizations struggling with this. So with this I hope to help those out… 😉
Let us take a look at the Azure Kubernetes Service pricing page… Here we can quickly notice that the distinction is made between ;
- Cluster Management, sometimes also referred to as “Master Node(s)”, or “Kubernetes API Server” (Purple)
- The nodes that will do the heavy lifting for you, also referred to as “Worker Node(s)” (Green)
The cluster management (purple) is free of charge. Here it will strive to to attain at least 99.5% uptime. Where you can opt to purchase an Uptime SLA (roughly a bit less than 70 Euro per month per cluster). In that case, you get a financially backed guaranteed uptime of 99.95% for the Kubernetes API server for clusters that use Azure Availability Zone and 99.9% for clusters that does not use Azure Availability Zone.
Next to that we have the worker nodes (green) that are basically charged at the regular virtual machine prices… And will also follow the same logic in terms of SLAs. If you let that sip in… Then actually the managed kubernetes service is right of the bat cheaper than running your own kubernetes distribution in Azure! If you know that the kubernetes code base has changed for about 97% in the last three years, you can imagine the amount of work that falls in your lap if you would want to maintain the same service level yourself.
Dimensions that influence pricing
In essence, the worker nodes will be the thing that influence your cost.
- Where on one hand we have the choice in terms of disks that are used. On one hand we have the capacity that needs to be foreseen per node, and on the other hand the storage class that is being used (Standard HDD, Standard SSD or Premium SSD). If we leverage a 128GB Premium SSD, then this will cost us give or take an 18 Euro per month per node.
- Though the biggest cost will typically come from the “compute” (virtual machine) part.
- Rightsizing ; When looking towards the compute, the VM family will have an impact of course. The B-series (“burstable“) will provide a very cost efficient solution, when the full performance is not needed all the time. Where each VM family has their own traits for which they have been designed. Next up is of course the size of the virtual machine. As you can imagine, a machine with 1 core and 2GB of memory will cost less than a machine with 10TB of memory. So choosing the right size that aligns with your needs is already a crucial step! In an on premises setup, it is common to oversize a virtual machine. This is an anti-pattern in the cloud! You can upgrade (or downgrade of course) your virtual machine when your needs change.
- Snoozing / Scaling ; Your workload will not need all the resources all the time… You will see a baseline performance that is needed all the time, and zone which has a given “seasonality” / “flexibility in terms of demand” to it. By scaling in and out, you can optimize your costs too! If you say that the workload is typically used for about 10h per working day, then this will equal to 200 hours per month. If you know that a month on average is 730 hours, then you basically already saved 72,6% off the costs of that “flexibile zone” on top of your baseline. The concept of shutting down & starting up services when they are (not) needed is called “Snoozing”. The same logic can be applied to a non-production environment? Do you need it 24×7 (730h/month), or do the office hours suffice (200h/month)?
- Reservations ; We just talked about scaling on top of the baseline. Though where the baseline performance is the level we need to have all the time… Anything you do to help with capacity management ends up with a financial gain on your side. In this category, there is also the concept of “Reservations”, where you commit to running the workload 24×7 for 1 or 3 years. Where you can even exchange or get a refund (with a cap per year) for those reservations.
- Failure-to-Tolerate ; The shit will hit the fan… It is not a matter of if, but more of when. You should always ensure that your stack is resilient. In terms of your nodes, you can incorporate the concept of “failure-to-tolerate” (FTT). Meaning that if your capacity can be met by X amount of nodes, that you add an additional Y amount who are in a warm standby mode. If you say that you have an FTT of 2, then the amount of spare nodes (Y) is set to 2. As a general suggestion, the smaller the size of a virtual machine.. The lesser the impact FTT has on your costs! Next to that, when being very prudent, you could say that you would want to protect yourself from feeling the failure of an entire zone. Meaning that in a region where there are three zones, you would add 50% additional (and unused) capacity to your cluster. Where in a region where there are only two zones, you will add 100% of additional capacity. As you can imagine, the benefit is the level of protection, where the flip side is of course that you are adding 50 to 100% to the cost profile of your workload.
Now let us do some simulations! For this exercise, we are starting off with the base need of 80 cores and 320GB of memory for our workload. Where we noticed that ten nodes of the type “D8as v4” (8 vCPU & 32 GB RAM) can meet our need. For each cluster (located in the region West Europe), we will provide an Uptime SLA and select a 128GB Premium SSD disk per individual node. From here, we will simulate 7 scenarios ;
- Full Flex ; All the nodes are consumed via a PAYG (pay-as-you-go) offering. This to simulate a scenario where there is no commitment and you have the most flexibility.
- One Year ; All the nodes have been reserved for 1 year. This to simulate a scenario where you commit to run the entire cluster for one year.
- Three Years ; All the nodes have been reserved for 3 years. This to simulate a scenario where you commit to run the entire cluster for three years.
- Burst – Base 5 ; We start with a baseline of 5 nodes (which are reserved for 1 year), and the remaining 5 nodes are consumed in a PAYG offering and running for 200 hours per month. Here we simulate a workload that has a baseline of 5 nodes, and auto scaling for the remaining 5.
- Burst – Base 3 ; We start with a baseline of 3 nodes (which are reserved for 1 year), and the remaining 7 nodes are consumed in a PAYG offering and running for 200 hours per month. Here we simulate a workload that has a baseline of 3 nodes, and auto scaling for the remaining 7.
- Burst – Base 1 ; We start with a baseline of 1 nodes (which are reserved for 1 year), and the remaining 9 nodes are consumed in a PAYG offering and running for 200 hours per month. Here we simulate a workload that has a baseline of 1 nodes, and auto scaling for the remaining 9.
- Working Hours ; All the nodes are consumed via A PAYG offering and running for 200 hours per month. This to simulate a scenario of a non-production environment that is only needed during the working hours.
Now let us take a look at the results…
And we can notice the following things ;
- The fully flexible (“Full Flex”) model is the most expensive, being three times as expensive as the cheapest one (“Working Hours”).
- When running the entire workload 24×7, we see that a reservation of one year will provide a benefit of 31% and reserving three years will slash the costs pretty much in half.
- In case of scaling, the benefit improve by reducing the baseline. Having a 50/50 split between the base & scaling, results in give or take the same results as a three year reservation. Where the 10/90 split results in a similar flow as the “Working Hours” scenario.
Want to check the details yourself? Here you can go… 😉 => aks-cost-simulations.xlsx
By just playing with the reservation & snoozing sliders, you can already reduce your costs by 2/3th… Where on the other hand the “Failure-to-Tolerate” (architecture) concept, can double your costs in certain cases. Where I hope that knowing the impact of the various dimensions can help you optimize your costs! Where quoting Johan Cruff ; “Every disadvantage has its advantage!”. Committing is giving in on flexibility. Reducing safety guards will help in terms of costs. Though being aware of the dimensions will help you in understanding the cost impact, and make the right judgment calls.
Simulation for Azure RedHat Openshift
For those interested, if you are looking to do similar simulations for Azure RedHat Openshift, feel free to take a look at the following VMchooser module ; http://www.vmchooser.com/azureredhatopenshift