Everyone who has been working with cloud, and involved with tenders, has had the follow question (in one form or another) ; “Has the cloud datacenter achieved a tier 3 (or higher) classification? In today’s post we will delve into the specifics linked to the ask ; Why do organizations ask the question, and how does it related to cloud?
What is a “Tier 3 Datacenter”?
To better understand the concept of data-center tiers, it is important to understand that several organizations (like the Telecommunications Industry Association (TIA) and the Uptime Institute) have defined standards for data-centers.
Uptime Institute created the standard Tier Classification System as a means to effectively evaluate data center infrastructure in terms of a business’ requirements for system availability. The Tier Classification System provides the data center industry with a consistent method to compare typically unique, customized facilities based on expected site infrastructure performance, or uptime. Furthermore, Tiers enables companies to align their data center infrastructure investment with business goals specific to growth and technology strategies.
Source ; https://uptimeinstitute.com/tiers
Which typically consists of several tiers…
Four tiers are defined by the Uptime Institute :
- Tier I : lacks redundant IT equipment, with 99.671% availability, maximum of 1729 minutes annual downtime
- Tier II : adds redundant infrastructure, with 99.741% availability (1361 minutes)
- Tier III : adds more data paths, duplicate equipment, and that all IT equipment must be dual-powered, with 99.982% availability (95 minutes)
- Tier IV : all cooling equipment is independently dual-powered; adds Fault-tolerance, with 99.995% availability (26 minutes)
So it is a classification for organizations to understand the quality of the data-center, and be able to take a given availability into account. Though it is important to understand, that this relates to “datacenter housing” (colocation) and not to the cloud service models! Why is this statement important? As on top of that housing, additional services will be delivered by cloud providers to achieve service models like IaaS, PaaS, SaaS, …
Important Note – Local Regulation & Tier IV
In countries like Belgium & the Netherlands, it is impossible to achieve a Tier IV (4) classification. This is due to the fact that you need to have two fully independent power sources, where the regulation in these countries limits the power distribution to one organization per geographic area.
Update ; Sylvie pointed me to the following ;
“If you read the topology, you will see that Engine-generator systems are considered the primary power source for the data center. The local power utility is an economic alternative. Disruptions to the utility power are not considered a failure, but rather an expected operational condition for which the site must be prepared.”
the above is not correct (anymore?)…
System Availability & Reliability : The basics…
A bit less than 8 years ago I posted about the concepts of System Availability & Reliability. These are some of the fundamental concepts when talking about service levels.
- Serial Connection : When two systems depend on each other, the possible service level will reduce
=> “Serial Availability = Availability X * Availability Y * Availability Z”
- Parallel Connection : When a system has parallel paths (f.e. multiple nodes), the possible service level will increase
=> “Parallel Availability = 1 – ( (1 – Availability X) * (1- Availability Y) * (1 – Availability Z) )”
- System Reliability : How to get to a given SLA? This can be done by estimating (or gathering) the MTBF (Mean time between failure) and MTTR (Mean time to repair) values for the specific component.
=> “System Availability = MTBF / ( MTBF + MTTR )”
So dependencies reduce the service level and redundant components increase the service level. Where you can calculate the system availability (a service level) by estimating the period between failures and the time needed to fix the failure (end to end).
System Availability & Reliability : Why should I care?
To make the previous section a bit more apparent, I have prepared some calculations… Here you can (for instance) see an SLA for “Region A” and one for “Region B”. Where the parallel SLA will be calculated on the “Cross Region” column (based on the SLAs for region A and B). The serial calculation will be done for example with the composite ones, as it will tackle the end-to-end SLA for all the components involved.
I will leave the green bits for later on, and we’ll kick this one off with the blue section… Imagine a “simple” cloud native architecture which consists of Azure FrontDoor, AppService, SQL DB and DNS. If we take a look at the SLAs linked to those (TIP : Check out azurecharts.com), then the composite SLA of that solution is 99,84% for a single region. Though if we adjust our design to go cross regions, then we can upgrade this to 99,999744%.
Now let us do the same exercise (red/orange section) to achieve a common cloud service model called “IaaS” (Infrastructure-as-a-Service). What ingredients do we need for this? We will start with housing (tier 3) and add … network, storage (SAN) and a hypervisor to the mix. Here I have taken the assumption that the components we added have an individual system availability of 99,99%. This means that the composite of the solution is 99,95% for a single region, and it will go up to 99,999977%.
How do we need to look at the assumption of 99,99% on the components used in the red/orange section? For this we can look at the grey section. This covers the calculations of service availability given a certain MBTF and MTTR. The 99,99% matches the scenario where we have a failure once every 10 years, and it takes us 8 hours to fix the outage. Now let us do a reality check… In the last 10 years, how many outages have you had for these components? Is this assumption a correct one to make? 😉
Looping back to the green section… Here we have the SLAs for virtual machines (IaaS) in Azure. The composite SLA for our orange section was 99,95% with the optimistic assumptions made earlier on. Which matches the Availability Set SLA for the Azure VM. If we go “cross data-center” (Availability Zone) or “cross region”, then we be able to go beyond that.
Want to play around with it yourself? Here is the Excel file… 😉
So… what about the data-center quality?
Availability is one of the cornerstones in an information security strategy (“CIA” = Confidentiality / Integrity / Availability). Where you can expect these aspects being audited of course! In light of this, I advise everyone to take a look at the SOC II Type 2 report. (Source ; Service Trust Portal)
The document is A-W-E-S-O-M-E! Okay, I must admit… With its 300+ pages, it is not light reading material. Though, till now, it has covered all the security questions I ever needed to answer (except one, which was very very very industry specific). So I heavily advise to check out this document and grasp the dimensions covered by it! For example, to be a tier 3 data-center, one would need to have designed everything around redundant components to sustain isolated faults.
This is basically covered by security control “DS – 6”
My advise is to discuss why the tier 3 certification is important? This as the scope for the certification is typically linked towards housing services. This is something that is part of the composition of the cloud services offered of course, though it is not a service that is offered directly. Due to this, there is no need to obtain such a certification.
When looking at the composite calculation of an IaaS mock-up with a tier 3 datacenter in the equation, then we arrive at 99,95%. This basically matches that 99,95% given on Availability Sets with Azure Virtual Machines.
For everything security related, I cannot stress enough the value of the SOC II Type 2 report (which can be found in the Service Trust Portal). It is a source of almost infinite information linked to all security aspects, which also cover everything related to availability of course!