Introduction
After having migrated VMchooser from a fully Serverless infrastructure to Containers, I am currently doing the opposite move. As I can start off the same code base to basically run different deployment options in Azure. Where I found that the serverless deployment added more value for me compared to a lower cost profile. That being said, one of the big learnings I had this week is that while having an automated landscape with Terraform, some changes are rather intrusive… Where I should have checked the output of the terraform plan stage, I failed to do so. Which resulted in downtime for VMchooser. So I was looking for way to do operational validation in the least intrusive and re-usable way. This led me to a solution where the Azure DevOps pipelines would leverage the health-check used in the Traffic manager deployment. This was already part of the deployment of course and in this a key aspect of understanding if the deployment was healthy or not.
Gates
In order to add validation steps in our deployment process, we can leverage the concept of Gates in Azure DevOps ;
Gates allow automatic collection of health signals from external services, and then promote the release when all the signals are successful at the same time or stop the deployment on timeout. Typically, gates are used in connection with incident management, problem management, change management, monitoring, and external approval systems.
As most of the health parameters vary over time, regularly changing their status from healthy to unhealthy and back to healthy. To account for such variations, all the gates are periodically re-evaluated until all of them are successful at the same time. The release execution and deployment does not proceed if all gates do not succeed in the same interval and before the configured timeout. The following diagram illustrates the flow of gate evaluation where, after the initial stabilization delay period and three sampling intervals, the deployment is approved.
So for the operational validation, I added three gates to my production environment ;
- Approval ; Someone that manually approves that the release may continue
- Query Azure monitor alerts (2x) ; To check if there are any Azure Monitor Alerts on either the front-end of back-end traffic manager.
This has the effect that the deployment will only start once all three gates are “green”!
Which will already help us like a lot in terms of validating the quality of the release.
Architecture
So what does the high level architecture look like? We have an Azure Traffic Manager that is configured with a health check towards the back-end service. On Azure Monitor, there is an Alert Rule configured that will check if there is at least one health endpoint for the Traffic Manager, and it will raise an alert if that is not the case. On Azure DevOps, we’ll have a release gate that will check if there are any alerts in the last hour for both front-end and back-end Traffic Managers. Once all gates are marked as “okay”, the deployment towards the production environment will commence.
Taking a closer look
I can already hear you saying ; “All fine and dandy… Show me the goods!”. So let us delve into the details… 😉 First of all, I have several environments for my pipelines.
Let us take a look at the production one, and go to “Approvals and checks”.
Here we see the three gates that have been configured ;
One where the manual approval is needed…
And then two checks on the traffic manager profile ;
…
An Example Pipeline
If we check one of the pipelines that is linked to this environment, we can see that the “Deploy to Prd” is showing that three checks have passed.
If we click on that, we will see that the approval and both Azure Monitor Alert checks passed.
When delving into the Azure Monitor alerts, we can see that it took a while for the release to turn green.
If we look at the first attempt, we can see that there were 4 outstanding alerts on the traffic manager resource ;
Though in the last one, these have cleared, giving us the green light to continue. (Due to the fact that the environment was not deployed, and only got deployed a bit after.)
And of course, we can also see who approved the release… 😉
View from Azure Monitor
Checking on the Traffic Manager, we see that the endpoint monitor is set to a given path to see if the back-end is healthy or not ;
If we now go to “Alerts”, then we can see that there are quite a bit of outstanding issues…
Which are all logged from the same alert rule…
When we check the details of this, we can see the reason why this alert fired, as the endpoint was unavailable for about half an hour.
Closing Thoughts
Something I noticed is that the clearing of the alerts is not as seamless as you would expect. So be aware that the time range you set is actually to be regarded as “the time no alert should be detected”, which is also a good approach to take. Though it was a misconception I had when first experimenting with this.
Despite that I was pretty late to the party with adding these kind of operational validation steps, they have already proven to be very valuable. As you can imagine, I make more than enough errors and only spotless quality should hit production! 😉
And of course, the way to achieve all this is actually very simple to do! For about 20 cents per month (per alert rule), I can easily improve the quality checks on VMchooser.