System reliability & availability

System Availability
System Availability is calculated by the interconnection of all its parts. These parts can be connected in serial (“dependency”) or in parallel (“clustering”). So in basis, if the failure of one component leads to the the combination being unavailable, then it’s considered a serial connection. If the failure of one component leads to the other component taking of, then it’s considered a parallel connection.

Serial connection
If two components are connection in serial, then the availability of the whole will always be lower than the availability of its individual components.

When both components have an availability of 99,75%, then the serial combination of both will have an availability of 99,50%. This value can be calculated by multiplying both availabilities. If there are three systems in a serial combination, where each system has an availability of 99,75%, then the combination will have an availability of 99,2518%.

Serial Availability = Availability X * Availability Y * Availability Z

Example of a parallel connection
If two components are connection in parallel, then the availability of the whole will always be higher than the availability of its individual components.

When both components have an availability of 99,75%, then the parallel combination of both will have an availability of 99,999753%. This value can be calculated by multiplying the unavailability of both components. If there are three systems in a parallel combination, where each system has an availability of 99,75%, then the combination will have an availability of 99,9999984%.

Parallel Availability = 1 - ( (1 - Availability X) * (1- Availability Y) * (1 - Availability Z) )

System Reliability
Now how do you get the availability of one component? This can be done by estimating (or gathering) the MTBF (Mean time between failure) and MTTR (Mean time to repair) values for the specific component. Once these values are known, use the following formula :

System Availability = MTBF / ( MTBF + MTTR )

The MTBF is the value that indicates how many hours (on average) are between system failures. The MTTR is the time (on average) needed to fix this system failure. The latter will consist of the time identifying the problem & restoring system status.

Practical Example
Let’s say we have two (application) servers and one (database) server. If the application server would have an MTBF of one year (8760h) and an MTTR of 12h, then the availability would be 99,86320%. For the database, an MTBF of three years (26280h) and an MTTR of one week (168h) will result in an availability of 99,36479%.

That would mean that the cluster of application servers would get an increased availability of 99,9998129% due to the parallel setup. Yet the database server that is set up in serial after this cluster will reduce the availability to 99,3646053%.

One thought on “System reliability & availability

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.