A topic that’s often discussed in workshops is “Availability Sets“. And during that topic, a question/comment that pops up every time ; “Can I schedule the maintenance for my VMs, because…”. Today we’ll delve into that part.
Why do we need maintenance?
For some this might seem like a very odd question to pose and that is a given fact of life. Though some organisations live by the mantra “if it isn’t broke, why fix it”, and once a systems gets deployed, they’ll (try) never to touch it again…
Continue reading “Azure : How to prepare for maintenance impacting your Virtual Machines”
There are several questions that I’m often posed that relate to availability on Azure. In today’s post, we’ll take a look at the different availability patterns. Here I hope this will answer a big portion of the questions you might have about availability on Azure. The main intake for this post will relate to the “IaaS” chunk of Azure services. Concepts like Azure SQL, Webapps, etc may have a totally different approach. But then again, you are not responsible for designing (and thus do not need to worry about) the availability aspect of these services.
Continue reading “Azure : Availability Patterns for IaaS – Can I do multiple regions?”
Today I was setting up a Traffic Manager deployment in Resource Manager. I wanted a rather “simple” failover scenario where my secondary site would only take over when my primary site was down. As you might now, there are several routing methods, where “failover” is one ;
Failover: Select Failover when you have endpoints in the same or different Azure datacenters (known as regions in the Azure classic portal) and want to use a primary endpoint for all traffic, but provide backups in case the primary or the backup endpoints are unavailable.
Though I was surprised that the naming between the “classic mode” (“the old portal“) and “resource manager” (“the new portal“) were different!
“Classic Mode” / Service Management
So when taking a look at “classic mode”, we see three methods ;
They are described fairly in-depth on the documentation page, though in short ;
- Performance : You’ll be redirected to the closest endpoint (based on network response in ms)
- Round Robin : The load will be distributed between all nodes. Depending on the weight of a node, one might get more or less requests.
- Failover : A picking order will be in place. The highest ranking system alive will receive the requests.
“New Portal” / Resource Manager
When taking a look at “Resource Manager”, we’ll see (again) three methods ;
Though the naming differs… When going into the technical details, it’s more a naming thing than a technical thing. The functionalitity is (give of take) the same. Where the “Round Robin” had the option of weights (1-1000) before, this now seems a focal point. Where “Failover” was working with a list (visualizuation), you can now directly alter the “priority” (1-1000) of each endpoint.
The info when checking out the routing method from within the portal ;
- Performance: Use this method when your endpoints are deployed in different geographic locations, and you want to use the one with the lowest latency.
- Priority: Use this method when you want to select an endpoint which has highest priority and is available.
- Weighted: Use this method when you want to distribute traffic across a set of endpoints as per the weights provided.
Where the naming differs between the two stacks, the functionality remains the same ;
- Performance didn’t get renamed
- Round Robin became “weighted”
- Failover became “priority
It is important to know that you will only get an SLA (99,95%) with Azure when you have two machines deployed (within one availability set) that do the same thing. If this is not the case, then Microsoft will not guarantee anything. Why is that? Because during service windows, a machine can go down. Those service windows are quite broad in terms of time where you will not be able to negotiate or know the exact downtime.
That being said… Setting up your own high available SQL database is not that easy. There are several options, though it basically bears down to the following ;
- an AlwaysOn Availability Groups setup
- a Failover Cluster backed by SIOS datakeeper
Where I really like AlwaysOn, there are two downsides to that approach ;
- to really enjoy it, you need the enterprise edition (which isn’t exactly cheap)
- not all applications support AlwaysOn with their implementations
So a lot of organisations were stranded in terms of SQL and moving to Azure. Though, thank god, a third party tool introduced itself ; SIOS Datakeeper ! Now we can build our traditional Failover Cluster on Azure.
Before we start, let’s delve into the design for our setup ;
Continue reading “Azure : Setting up a high available SQL cluster with standard edition”
One of the sensitive areas when it comes to docker is persistent storage… A typical service upgrade involves shutting down the “V1” container and pulling/starting the “V2” container. If no actions are taken, then all your data will be wiped… This is not really the scenario we want off course!
So today we’ll go over several variants when it comes down to data persistence ;
- Default : No Data Persistence
- Data Volumes : Container Persistence
- Data Only Container : Container Persistence
- Host Mapped Volume : Container Persistence
- Host Mapped Volume, backed by Shared Storage : Host Persistence
- Convoy Volume Plugin : Host Persistence
What do I mean with the different (self invented) persistence level ;
- Container : An upgrade of the container will not scratch the data
- Host : A host failure will not result in data loss
So let’s go through the different variants, shall we?
Continue reading “Docker : Storage Patterns for Persistence”
Persistent storage is a though cookie for Docker… We’ve seen things like Flocker, Convoy, … Today, we’ll do a very rough (and experimental!) setup with an Azure storage account (via Azure File Share) as shared storage. How would such a design look?
Continue reading “Azure & Docker : Shared storage anyone?”
Ever heard about the terms RTO (Restore Time Objective> and RPO (Recovery Point Objective)?
To explain it, let us take a look at this mockup…
In the middle, the crash indicates the time disaster struck. When we go back to the latest point we took our backup off-site (in regards to the affected crash site), where this point should be less than the maximum tolerable period in which data might be lost from an IT service due to a major incident. So the arrow towards the left indicates the RPO. Be aware that storing these backups on the same risk site does not fulfil your RPO!
The arrow to the left indicates the time needed to restore the service. Be aware that this is from a non-technical / business perspective. So merely starting up the system is not enough. Users of the service need to be able to use it again!
In terms of costs, be aware that strict objectives will imply more expensive solutions. The closer you want the objectives towards your crash zone, the more expensive it will become. An RPO of 7 days will still allow tape backups to be taken offsite once a week, where 1 day will still allow a nightly replication, yet where shorten time constraints implies near online replication.