Introduction
For this post I am assuming you are pretty familiar with the concept of deployment strategies (if not check out this post by Etienne). Now these are typically seen from an application deployment level, where platforms (like for instance Kubernetes) typically have out-of-the box mechanisms in place to do this. Now what if you would want to do this on an “infrastructure level”, like for instance the Kubernetes version of Azure Kubernetes Service. We could do an in-place upgrade, which will carefully cordon and drain the nodes. Though what if things go bad? We could do a Canary, Blue/Green, A/B, Shadow, … on cluster level too? Though how would we tackle the infrastructure point of view of this? That is the base for today’s post!
Architecture at hand
For today’s post we’ll leverage the following high level architecture ;
This project leverages Terraform under the hood. Things like DNS, Traffic Manager, Key Vault, CosmosDB, etc are “statefull’ where its lifecycle is fully managed by Terraform. On the other hand, our kubernetes clusters are “stateless” from an Infrastructure-as-Code point-of-view. We deploy them via Terraform, though do not keep track of them… All the lifecycle management is done on operating on the associated tags afterwards.
Community-Tool-of-the-day
The drawing above was not created in Visio for once. The above was made leveraging CloudSkew, which was created by Mithun Shanbhag. Always awesome to see community contributions, which we can only applaud!
Tags, tags, tags!
People typically think of tags in association with billing scenarios. Though they do unlock a lot of potential on the operational side of things too! Check out these two resources ;
- Traffic Manager
- AKS
Upon creation, both resources are tagged with the environment and workload they need to serve. The traffic manager even has an additional dimension to it, stating which component (“microservice”) is linked to it. Where the AKS cluster has a tag indicating the creation date, which helps on the maintenance tasks as we did not want to have this one as statefull from an Infrastructure-as-Code point-of-view,
Azure Graph
Cool, we tagged stuff! But now what? You can query the Azure Resource Graph to find them! For instance, the deployment script will look for (aka “discover”) the clusters it needs to deploy to. An example from within the script ;
clusters=`az graph query -q "Resources | where type =~ \"Microsoft.ContainerService/ManagedClusters\" | where properties.provisioningState =~ \"Succeeded\" | where tags[\"Environment\"] =~ \"$environment\" | where tags[\"Workload\"] =~ \"$workload\" | project name" -o yaml | awk '{ print $3 }'`
So it looks for a resource type, that has succeeded as provisioning stage, with the environment/workload tags set to where we would like to deploy. The same logic applies to the Traffic Manager one, where we go one level further and filter on the linked “component”.
trafficmanagerprofiles=`az graph query -q "Resources | where type =~ \"Microsoft.Network/trafficManagerProfiles\" | where tags[\"Component\"] =~ \"$component\" | where tags[\"Environment\"] =~ \"$environment\" | where tags[\"Workload\"] =~ \"$workload\" | project name" -o yaml | awk '{ print $3 }'`
Putting things together
So which components are used in the project to stitch it all together?
- Azure DevOps for all the CI/CD orchestration
- Terraform for all the infrastructure-as-code landscaping
- Azure Container service to store the container images AND helm charts
- Helm charts as package manager
- Azure Resource Graph supporting the shell script(s) used to discover clusters and deploy the applications
What are the gotchas?
- Error handling! As always… Here I initially forget to do a check on the status of my k8s service, which made deployments fail. Or the other way around, if no deploys existed, that it would loop forever.
- The maintenance scripts for keeping the environment clean are a crucial point. Working with Terraform to keep the state is awesome. As I did not want a static set for blue/green, I went stateless and need to suck it up on the disadvantages of this approach.
Closing Thoughts
This setup has cost me way more time than that I would care to admit. Though I am -very- happy with how this setup has turned out! It is doing what I wanted it to do and gives me the flexibility to both work on the infra level and individual components running on top of it.
At the moment it the infra level is a very basic blue/green approach. In the future I’m looking to expand on it even further with multi-regions and possibly a kind of automated canary approach. Though, as always, baby steps… Move, learn, iterate. 😉