DevOps : What’s the impact on my ITIL/COBIT/… based shop?

Introduction

When talking to customers about DevOps, I often get the two following questions ;

  • Does this mean I have to get rid of ; ITIL / COBIT / … ?
  • Do I have to start moving people around and creating new units?

The quick answer is ; No.

A typical parabel in any project methodology is  ;

How do you eat an elephant? Take snack sized bites and work your way through it.

And the same goes for DevOps!

Continue reading “DevOps : What’s the impact on my ITIL/COBIT/… based shop?”

RTO, RPO, … What the O?

Ever heard about the terms RTO (Restore Time Objective> and RPO (Recovery Point Objective)?

To explain it, let us take a look at this mockup…

20130117-222210.jpg

In the middle, the crash indicates the time disaster struck. When we go back to the latest point we took our backup off-site (in regards to the affected crash site), where this point should be less than the maximum tolerable period in which data might be lost from an IT service due to a major incident. So the arrow towards the left indicates the RPO. Be aware that storing these backups on the same risk site does not fulfil your RPO!

The arrow to the left indicates the time needed to restore the service. Be aware that this is from a non-technical / business perspective. So merely starting up the system is not enough. Users of the service need to be able to use it again!

In terms of costs, be aware that strict objectives will imply more expensive solutions. The closer you want the objectives towards your crash zone, the more expensive it will become. An RPO of 7 days will still allow tape backups to be taken offsite once a week, where 1 day will still allow a nightly replication, yet where shorten time constraints implies near online replication.

System reliability & availability

System Availability
System Availability is calculated by the interconnection of all its parts. These parts can be connected in serial (“dependency”) or in parallel (“clustering”). So in basis, if the failure of one component leads to the the combination being unavailable, then it’s considered a serial connection. If the failure of one component leads to the other component taking of, then it’s considered a parallel connection.

Serial connection
If two components are connection in serial, then the availability of the whole will always be lower than the availability of its individual components.

When both components have an availability of 99,75%, then the serial combination of both will have an availability of 99,50%. This value can be calculated by multiplying both availabilities. If there are three systems in a serial combination, where each system has an availability of 99,75%, then the combination will have an availability of 99,2518%.

Serial Availability = Availability X * Availability Y * Availability Z

Example of a parallel connection
If two components are connection in parallel, then the availability of the whole will always be higher than the availability of its individual components.

When both components have an availability of 99,75%, then the parallel combination of both will have an availability of 99,999753%. This value can be calculated by multiplying the unavailability of both components. If there are three systems in a parallel combination, where each system has an availability of 99,75%, then the combination will have an availability of 99,9999984%.

Parallel Availability = 1 - ( (1 - Availability X) * (1- Availability Y) * (1 - Availability Z) )

System Reliability
Now how do you get the availability of one component? This can be done by estimating (or gathering) the MTBF (Mean time between failure) and MTTR (Mean time to repair) values for the specific component. Once these values are known, use the following formula :

System Availability = MTBF / ( MTBF + MTTR )

The MTBF is the value that indicates how many hours (on average) are between system failures. The MTTR is the time (on average) needed to fix this system failure. The latter will consist of the time identifying the problem & restoring system status.

Practical Example
Let’s say we have two (application) servers and one (database) server. If the application server would have an MTBF of one year (8760h) and an MTTR of 12h, then the availability would be 99,86320%. For the database, an MTBF of three years (26280h) and an MTTR of one week (168h) will result in an availability of 99,36479%.

That would mean that the cluster of application servers would get an increased availability of 99,9998129% due to the parallel setup. Yet the database server that is set up in serial after this cluster will reduce the availability to 99,3646053%.

The Service Catalog

A service catalog (or catalogue), as defined in Information Technology Infrastructure Library Service Design, is a list of services that an organization provides, often to its employees or customers. Each service within the catalog typically includes:

  • A description of the service
  • Timeframes or service level agreement for fulfilling the service
  • Who is entitled to request/view the service
  • Costs (if any)
  • How to fulfill the service

Source : Wikipedia

A service catalog is a great way to identify the services which are served by an IT department. It’s a detailled listing where you define the scope of a service. Personally I found the following Service Catalog Example (Source : RSteinberg) a very good start to work from! With the skeleton you can easily start creating your own.

Yet be aware! To be effective, the Service Catalog must be understood and used by the business. Yet all too often, IT departments invest countless hours to create Service Catalog documentation that few customers will ever read or use. Ultimately, the majority of these static Service Catalogs are rarely seen or read by either end users or business decision-makers – and thus have little to no impact. So if you’re planning to do this, be sure to read the following article too, which will guide you through the process a bit : How To Produce An Actionable IT Service Catalog (Rodrigo Fernando Flores)