Tale of two ilities

BSOD 0x07B | Flickr - Justin MartySeveral times in the past few weeks I’ve answered questions about the reliability and availability Key Performance Parameters (KPPs) for CG-LIMS.

Before describing the specific KPPs for CG-LIMS and how they’ve been derived with the end user in mind, I want to provide a little background.

There are two great resources to provide a foundation for understanding reliability and availability.

DoD Guide for Achieving Reliability, Availability, and Maintainability

http://www.acq.osd.mil/sse/docs/RAM_Guide_080305.pdf

DoD Reliability, Availability, and Maintainability, and Cost Rationale Report Manual

http://www.acq.osd.mil/sse/docs/DoD-RAM-C-Manual.pdf

You can find a simple explanation of availability and reliability on page 2 of the RAM-C Report:

Operational Availability indicates the percentage of time that a system or group of systems within a unit are operationally capable of performing an assigned mission and can be expressed as (uptime/(uptime + downtime)). Determining the optimum value for Operational Availability requires a comprehensive analysis of the system and its planned use as identified in the Concept of Operations (CONOPS), including the planned operating environment, operating tempo, reliability alternatives, maintenance approaches, and supply chain solutions.

Reliability measures the probability that the system will perform without failure over a specified interval under specified conditions. Reliability must be sufficient to support the warfighting capability needed in its expected operating environment.

When the Requirements IPT was developing the CONOPS, they needed to determine the performance needed from CG-LIMS as a system supporting 24 x 7 x 365 maintenance. Here’s the explanation in the CONOPS for how availability and reliability were derived based on end users’ needs.

2.4.3 Scheduling and Operations Planning

2.4.3.1 Availability

CG-LIMS users will typically expect the system to be available for use at all times. While usage will peak during a normal workday, maintenance will be performed on assets around the clock, requiring availability of all functions of CG-LIMS. Many units strategically perform maintenance with their duty sections during the night when operations are typically at the lowest levels.

It is critical that mission support personnel have access to the parts and technical data required to maintain assets. CG-LIMS will only be as usable as it is available. Access to maintenance schedules and procedures, parts availability information, parts requisitioning functionality, and technical documentation will be expected of the system during routine field level mission support operations. Furthermore, to maintain data integrity users will be expected to input their activities into CG-LIMS upon completion of a task.

A unit can routinely tolerate a scheduled 2 hour downtime once in any given week, allowing mission support personnel to pull a reasonable amount of information prior to the anticipated downtime so that mission support operations can continue without system access. System outages up to 4 hours could be tolerated, but not more than four times in a year, due to its significant impact on mission support operations. For this reason, CG-LIMS will require a high degree of reliability, discussed in Section 2.4.3.2.

Since CG-LIMS will be operational 24 hours a day, 365 days a year, total time used in availability calculations must be based on 8,760 hours per year.
The total downtime tolerable to system users is 120 hours per year (2 routine hours per week plus four 4 hour unexpected outages). Consequently, the total operational uptime targeted for the system is 8,640 (8,760 – 120 = 8,640) hours per year. The required operational availability for CG-LIMS is the total operational uptime targeted (8,640 hours) divided by the total time (8,760 hours), or 98.6%.

An even more desirable level of availability would be achieved if the scheduled downtime were limited to 1 hour per week. Maintaining the tolerable 16 hours of unexpected downtime over a year, this would increase the availability to 99.2%.

2.4.3.2 Reliability

Reliability is a measure of the probability that the system will perform without failure over a specific interval. CG-LIMS reliability must also be sufficient to support the required availability. Reliability is generally expressed in terms of a Mean Time Between Failures (MTBF).

Once operational, the reliability can be measured as an inverse exponential function of Euler’s constant raised to an exponent of actual operating hours divided by the number of system failures experienced during a specific interval. The interval is determined by calculating the Mean Time Between Maintenance (MTBM).

Because CG-LIMS users are willing to tolerate weekly maintenance periods (of no greater than 2 hours each) and 4 unexpected failures per year, the MTBM requirement for CG-LIMS should not be less than 154.3 hours (target operational uptime ÷ total number of periods of downtime due to maintenance).

To keep unplanned downtime to a tolerable minimum, the requirement for MTBF should not be less than 2,160 hours (target operational uptime ÷ number of failure events).
Consequently, the required reliability, or probability that the system will operate without an unexpected failure between maintenance periods, should not be less than 93.1%, calculated by using the following equation:

Reliability = e ^ -MTBM/MTBF

Folks seem to be confused that reliability is lower than availability.

Reliability is dependent on failure tolerance during operations, whereas availability is dependent on both scheduled maintenance and failure.

The MTBM and MTBF used to calculate availability are based on our assumptions about acceptable downtime for scheduled maintenance and failure. We’re planning for the system to be down for a short period of time once a week and up to four times a year. The reliability is the probability the system will perform for that week (or less than a week 4 times a year).

If this post raises more questions, please fire away in the comments and we’ll try to explain better. I wanted to start the conversation by making sure you had links to the two source docs and the explanation in the CONOP.

10/20/2010 dpt: I made minor edits to final paragraph based on input from Jim Sylvester, Sponsor’s Rep.

2 Responses to “Tale of two ilities”


  1. 1 nathanial.r.williams October 20, 2010 at 6:48 am

    Classification: UNCLASSIFIED

    I couldn’t have said it better myself 😉

  2. 2 fidel.d.manansala October 29, 2010 at 9:30 pm

    Classification: UNCLASSIFIED

    Agree with Than, well stated and explained. Another perspective on availability of an IT system is that it is relevant to the end user location and connection node. Since it will depend greatly on the CG network, overall end user experience will depend on items beyond the scope of CG-LIMS. I’ve seen really good software “appear” to be problematic during deployment on the CGDN. It will be easy for users to blame CG-LIMS for a sluggish network. To summarize, Engineering is key.


Comments are currently closed.