Azure : How to prepare for maintenance impacting your Virtual Machines

Introduction

A topic that’s often discussed in workshops is “Availability Sets“. And during that topic, a question/comment that pops up every time ; “Can I schedule the maintenance for my VMs, because…”. Today we’ll delve into that part.

 

Why do we need maintenance?

For some this might seem like a very odd question to pose and that is a given fact of life. Though some organisations live by the mantra “if it isn’t broke, why fix it”, and once a systems gets deployed, they’ll (try) never to touch it again…

To the (un)fortunate that have been in my sessions, you’ll probably have heard me say multiple times that the cloud evolves at an incredible pace. Last year (2016) Azure had about 600 features/services added to the landscape. As you can imagine, this has an impact on the changes required to the systems providing those services.

Aside from that, Microsoft Azure periodically performs updates across the globe to improve the reliability, performance, and security of the host infrastructure that underlies virtual machines. You expect a top service from Azure on all fields, so the flip side of that coin is that the systems will need to be maintained.

 

Types of maintenance

First thing to understand, is that there are two types of maintenance ;

  • Planned maintenance events are periodic updates made by Microsoft to the underlying Azure platform to improve overall reliability, performance, and security of the platform infrastructure that your virtual machines run on. Most of these updates are performed without any impact upon your virtual machines or cloud services. However, there are instances where these updates require a reboot of your virtual machine to apply the required updates to the platform infrastructure.
  • Unplanned maintenance events occur when the hardware or physical infrastructure underlying your virtual machine has faulted in some way. This may include local network failures, local disk failures, or other rack level failures. When such a failure is detected, the Azure platform automatically migrates your virtual machine from the unhealthy physical machine hosting your virtual machine to a healthy physical machine. Such events are rare, but may also cause your virtual machine to reboot.

So with the planned (scheduled) maintenance, the biggest chunk of updates are done without any impact via ;

Memory-preserving updates
For a class of updates in Microsoft Azure, customers will not see any impact to their running virtual machines. Many of these updates are to components or services that can be updated without interfering with the running instance. Some of these updates are platform infrastructure updates on the host operating system that can be applied without requiring a full reboot of the virtual machines.

These updates are accomplished with technology that enables in-place live migration, also called a “memory-preserving” update. When updating, the virtual machine is placed into a “paused” state, preserving the memory in RAM, while the underlying host operating system receives the necessary updates and patches. The virtual machine is resumed within 30 seconds of being paused. After resuming, the clock of the virtual machine is automatically synchronized.+

Not all updates can be deployed by using this mechanism, but given the short pause period, deploying updates in this way greatly reduces impact to virtual machines.

Multi-instance updates (for virtual machines in an availability set) are applied one update domain at a time.

For the technical guys/gals out there… We are talking about anything that can be done with things similar to “VMware vMotion”, “Hyper-V Live Migration”, etc.

 

Single & Multi Instance machines

Though sometimes, a virtual machine will have to get rebooted to gain the benefits of certain updates on host level. In that case, we see a different behaviour depending on the “high availability”-characteristics of the given system ;

  • Multi-instance machines
  • Single-instance machines

 

What does the documentation say about this…

Single-instance configuration updates
After the multi-instance configuration updates are complete, Azure will perform single-instance configuration updates. This update also causes a reboot to your virtual machines that are not running in availability sets.

Please note that even if you have only one instance running in an availability set, the Azure platform treats it as a multi-instance configuration update.

For virtual machines in a single-instance configuration, virtual machines are updated by shutting down the virtual machines, applying the update to the host machine, and restarting the virtual machines, approximately 15 minutes of downtime. These updates are run across all virtual machines in a region in a single maintenance window.

This planned maintenance event will impact the availability of your application for this type of virtual machine configuration. Azure offers a 1-week advanced notification for planned maintenance of virtual machines in the single-instance configuration.

 

vmplanned1

 

Service Levels 

Is there an SLA on my Azure Virtual Machines? Yes, there is!

  • Multi-instance machines have an SLA of 99,95%
  • Single-instance machines (if linked to premium storage only) have an SLA of 99,9%

Though, be aware that these availability targets are not covering the planned maintenance!

 

So what about my single VMs?

How hard it is to face reality… If you have a business critical workload, then you will need to set it up in a high available manner. Here I’m fully aware that A LOT of applications are unable to do so. Though, if we’re really honest about this, should this software be used to run your business critical workload? And if you have no other option, then the 15 minutes of downtime will not be the end of the world I guess. Yes, I know, I’m being a bit blunt here… But to be honest, I’ve seen way too many customers fooling themselves by thinking something has the needed availability. Where in reality, they need to jump through hoops to achieve the needed uptime.

 

 

Closing Thoughts

  • The good news on this story is that you’ll get notified a week in advance. So you will be able to notify your users in advance about the maintenance. Sadly, you will not be able to schedule the maintenance window of this system.
  • High availability is often seen as a commodity. In reality, I’ve seen but too many software implementations struggle with the capability.
  • Be sure to plan for your expected availability and do not take anything for granted! You might look towards Azure to fix issues that shouldn’t be fixed by the underlying platform. Application architectures should be designed with resilience as a base capability.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s