Introduction
Resiliency patterns in Azure is a very common / return question. Though over the course of time, I’ve noticed there is a lot of confusing around the architectural patterns involved here. This mostly comes down to the basic illusion that HA (High Availability) and DR (Disaster Recovery) are both met when doing a stretched cluster.
Overview of all the patterns
Theory
The above overview was based on the following concepts ;
- Check out the SLAs around virtual machines
- Single Instance SLA = 99,9%
- Multiple machines in an Availability Set = 99,95%
- Multiple machines in an Availability Zone = 99,99%
- Check out a previous blog post around “System Reliability & Availability”
- Check out blogs on the difference between “High Availability & Disaster Recovery”
High availability (HA) is the measurement of a system’s ability to remain accessible in the event of a system component failure. Generally, HA is implemented by building in multiple levels of fault tolerance and/or load balancing capabilities into a system. On the other hand, disaster recovery (DR) is the process by which a system is restored to a previous acceptable state, after a natural or man-made disaster. While they both increase overall availability, a notable difference is that with HA there is, generally, no loss of service. HA refers to the retaining of the service and DR to the retaining of the data. Whereas, with DR there is usually a slight loss of service while the DR plan is executed and the system is restored. HA and DR strategies should strive to address any non-functional requirements, such as performance, system availability, fault tolerance, data retention, business continuity, and user experience. It is imperative that selection of the appropriate HA and DR strategy be driven by business requirements. For HA, determine any service level agreements expected of your system. For DR, use measurable characteristics, such as Recovery Time Objective (RTO) and Recovery Point Objective (RPO), to drive your DR plan.
Considerations
- On several slides I colour coded DR grey… In this case one could argue both ways. If for example, you would do a truly stretched cluster, then it would be RED. On the other hand, if you would setup something like an Always On with replicas, you could pull off GREEN. Though, be aware, it will be very hard to achieve such a setup!
- Another consideration to make is that a BCP (Business Continuity Plan, which is where the DRP (Disaster Recovery Plan), is typically incorporated…) typically looks at more than the technical risks. A geopolitical zone might be considered. Imagine that war would break out in … the Netherlands (?) … Then one could argue that all the zones of the “West Europe” region of Azure might be compromised.
- Clustered stretched across regions are heavily impacted by physics (latency). This will have an impact on performance. Whilst I’ve included some designs to cover this, you’ll notice that they are very hard to pull of in reality.
- Be aware that both inter-Zone & inter-Region designs will have an impact on the networking costs! So do take this into consideration too…
Closing Thoughts
As always, I hope this post helped you in your personal journey… 😉