Storage Sizing 101 – How to choose the right storage for you?

What will influence the outcome?
When looking for a new storage device, be aware that there are three facets that will influence your purchase ;
A) The needed capacity – The number of Gigabytes or Terabytes you need.
B) The needed performance – The number of IOPs you need.
C) The needed availability – How much downtime can you accept?

  • Capacity
    Capacity Indication
    When going for a given storage option, always be sure to request the netto capacity (in combination with the needed performance/availability). The architecture of some products may result in a difference towards the netto/bruto ratio.

    Despite the technology behind this feature, in essence it’s just about capacity. Bare in mind that the field shows that the dedup factor is mostly 1:7, in comparison to the 1:XXX ratios some vendors might promise. Just think logically here… If your base volume has capacity X, then the capacity of all your deduplication data will at least be X + the differential between those environments.

    Thin Provisioning
    Yet again a feature that’s aimed towards capacity. In my opinion, it’s only valuable for VDI environments or service providers. If you choose to use this features, be sure to have the proper monitoring & escaltion procedures in place. For me this is a feature where you are living above your paycheck.

    Thin Cloning
    Some vendors will over you the ability to use a snapshot as a writeable volume for testing purposes. Similar to deduplication, this is also a capacity features.

  • Performance
    Performance Benchmark : MB/s vs IOPs
    Capacity is mostly an easy one to define, yet Performance is mostly quite difficult. Some (application) vendors will supply you with guidelines towards the needed IOPs for a given number of users. SAP for instance has it’s “SAPs“.

    Be aware to avoid measuring in MB/s. The MB/s is achieved by multiplying the average block / transfer size times the IOPs. So requesting MB/s will give vendors the ability to turn the performance benchmarks in their direction by “simply” changing the block size. Where this might not be so bad for large sequential data, it will hurt your database performance.

    Spindle Count/Types
    The “Spindle Count” (or the “number of disks”) is the most basic performance sizing for all vendors. Where a 7,2K SATA drive will have about 75 to 100 IOPs, a 10k SAS drive will have about 140 IOPS and a 15k SAS drive will have something between 175 en 200 IOPs. Just multiply the number of disks by the average IOPs they produce, and you’ll have (give or take) the maximum IOPs you can achieve “spindle-wise”. Offcourse, things like bandwidth (1-2-4-4Gbps or 1-10Gbps links), controller cache or backplane speeds will have its effect too.
    => Devices that leverage this kind of performance calculation; (for example & limited to) HP P2000/MSA, HP P6000/EVA, Dell MD, …

    Storage Tiering
    As just said with the spindle types, each disk will have a different performance profile. Combine that with the fact that it’s only a fraction of your data that will need that performance, which then results in the idea that you provide the performance to the data that needs it (thus, the more expensive disks) and use “cheap” disks for the data that doesn’t. So for these devices, there will be high performant tiers (SSD), performant tiers (SAS) and low performance (SATA) tiers. Some vendors will dynamically profile your data, and do “automated tiering” so you don’t have too. (Risk scenario : A heavy report will have only be ran once a month. So with automated storage tiering, it’ll probably be located on your low performance tiers, due to the high inactivity. Yet you’ll need the top performance once a month…)
    => Vendors that leverage this feature ; HP 3PAR, Dell Compellent, EMC FAST, …

    Acceleration Modules
    Other vendors will tackle the high performance part by adding additional caching mechanisms to their solution. This will be used as a buffer for peak moments. Thus increasing the IOPs for those burst moments.
    => Vendors that leverage this feature ; Netapp FlashCache/ Performance Acceration Modules (PAM), EMC FastCache, … (sidenote : The difference between EMC FAST, EMC FastCache & Netapp FlashCache)

  • Availability
    Availability Calculation
    A blog post of mine a while back explained system availability & reliability. The same goes for storage devices, so be sure to ask each vendor for the MTBF of their suggested products and calculate the expected availability in combination with the suggested support contract.

    Parallel Reliability (“Active / Active”)
    Remember the “parallel” part in the system reliability/availability post? Most storage solutions WILL become your SINGLE POINT OF FAILURE. You can decrease the MTTR by adding an additional device as your failover, yet true parallel execution cannot be achieved by many vendors.
    => Netapp MetroCluster & HP P4000/Lefthand are one of the few that I know which can be set up in an active / active manner. Most other devices will always come in an active / passive set up.

    Enterprise Grade
    Ever noticed the terms “ENT” & “MDL” in quotes? You probably have… ENT is short for “Enterprise Grade” and “MDL” stands for “Midline Grade”. And offcourse you also have the consumer grade material. As you’ve probably expect, the price difference also bears a quality difference. Yet how to measure this? Once again, this bears down to the “MTBF”. The enterprise grade material will have the best MTBF statistics, where consumer grade… are kinda the worst. So for your primary storage, don’t cheap out on MDL (or consumer grade…), and use ENT. For your secondary storage, feel free to use MDL and protect yourself with a more fault toleratant raid level where you can accept the performance impact it will induce.

Don’t go on your hunch and on sales talks! Define your actual needs (requirements) and make your procurement a pure purchasing matter. Sometimes technical minded people get lost in all the nice features. Yet as you’ve just read, those features just come down to achieving high level business goals. If you define those, then your process will smoothen up. It will be more easy to get your requirements across to your internal business partners, as the comparision between vendors will also become more transparant.

System reliability & availability

System Availability
System Availability is calculated by the interconnection of all its parts. These parts can be connected in serial (“dependency”) or in parallel (“clustering”). So in basis, if the failure of one component leads to the the combination being unavailable, then it’s considered a serial connection. If the failure of one component leads to the other component taking of, then it’s considered a parallel connection.

Serial connection
If two components are connection in serial, then the availability of the whole will always be lower than the availability of its individual components.

When both components have an availability of 99,75%, then the serial combination of both will have an availability of 99,50%. This value can be calculated by multiplying both availabilities. If there are three systems in a serial combination, where each system has an availability of 99,75%, then the combination will have an availability of 99,2518%.

Serial Availability = Availability X * Availability Y * Availability Z

Example of a parallel connection
If two components are connection in parallel, then the availability of the whole will always be higher than the availability of its individual components.

When both components have an availability of 99,75%, then the parallel combination of both will have an availability of 99,999753%. This value can be calculated by multiplying the unavailability of both components. If there are three systems in a parallel combination, where each system has an availability of 99,75%, then the combination will have an availability of 99,9999984%.

Parallel Availability = 1 - ( (1 - Availability X) * (1- Availability Y) * (1 - Availability Z) )

System Reliability
Now how do you get the availability of one component? This can be done by estimating (or gathering) the MTBF (Mean time between failure) and MTTR (Mean time to repair) values for the specific component. Once these values are known, use the following formula :

System Availability = MTBF / ( MTBF + MTTR )

The MTBF is the value that indicates how many hours (on average) are between system failures. The MTTR is the time (on average) needed to fix this system failure. The latter will consist of the time identifying the problem & restoring system status.

Practical Example
Let’s say we have two (application) servers and one (database) server. If the application server would have an MTBF of one year (8760h) and an MTTR of 12h, then the availability would be 99,86320%. For the database, an MTBF of three years (26280h) and an MTTR of one week (168h) will result in an availability of 99,36479%.

That would mean that the cluster of application servers would get an increased availability of 99,9998129% due to the parallel setup. Yet the database server that is set up in serial after this cluster will reduce the availability to 99,3646053%.