Название: Maintaining Mission Critical Systems in a 24/7 Environment
Автор: Peter M. Curtis
Издательство: John Wiley & Sons Limited
Жанр: Физика
isbn: 9781119506140
isbn:
Figure 3.1 “Seven steps” is a continuous cycle of evaluation, implementation, preparation, and maintenance
(Source: Courtesy of PMC Group One, LLC)
Table 3.1 Law of Nines
% Uptime/Reliability Level | Downtime Per Year |
---|---|
99% | 87.6 hours |
99.9% | 8.76 hours |
99.99% | 52 minutes |
99.999% | 5.25 minutes |
99.9999% | 32 seconds |
3.2 Companies’ Expectations: Risk Tolerance and Reliability
In order to design a building with the appropriate level of reliability, a company must first assess the cost of downtime and determine its associated risk tolerance. Because recovery time is now a significant component of downtime, downtime can no longer be equated to simple power availability, measured in terms of one nine (90%) or six nines (99.9999%). Today, recovery time is typically many times longer than outages, since operations have become much more complex. Restoration of a shutdown IT infrastructure backbone must be carried out in a specific sequence so that IT equipment can be restored with limited communication conflicts and be brought back online speedily. Just turning IT equipment on again does not work with our complex IT systems. Is a 32‐second outage really only 32 seconds? Is it perhaps 2 hours or 2 days? The real question is: How long does it take to fully recover from the 32‐second outage and return to normal operational status? Although measuring in terms of nines has its limitations, it remains a useful measurement we need to identify. For a 24/7 facility:
In new 24/7 facilities, it is imperative to not only design and integrate the most reliable systems, but also to keep them simple. When there is a problem, the facilities manager is under enormous pressure to isolate the faulty system without disrupting any critical electrical loads and does not have the luxury of time for complex switching procedures during a critical event. An overly complex system can be a quick recipe for failure via human error if key personnel who understand the system functionality are unavailable. When designing a critical facility, it is important that the building design does not outsmart the facilities manager. Companies can also maximize profits and minimize cost by using the simplest design approach possible or integrate automatic recovery or “self‐healing” automatic controls to recover from a failure. One prevalent example is the current use of Static Transfer Switches (STS’s) discussed in a later chapter. The STS will automatically and within milliseconds switch power sources to critical equipment.
In older buildings, facility engineers and senior management need to evaluate the cost of operating with obsolete electrical distribution systems and the associated risk of an outage. Where a high potential for losses exists, serious capital expenditures to upgrade the electrical distribution system are monetarily justified by senior management. The cost of downtime across a spectrum of industries exploded in recent years, as businesses have become completely computer‐dependent, and systems have become increasingly complex (Table 3.2).
Table 3.2 The Cost of Downtime
(Source: Data from Information Technology Intelligence Consulting).
Industry | Average Cost per Hour in 2017 |
---|---|
Energy | $22,321,000 |
Brokerage | $9,300,000 |
Media | $9,000,000 |
Manufacturing | $8,500,000 |
Health Care | $6,900,000 |
Retail | $6,600,000 |
Telecommunications | $4,800,000 |
Credit Card Operations | $3,100,000 |
Human Life | “Priceless” |
* Prepared by a disaster‐planning consultant of Contingency Planning Research
Imagine that you are the manager responsible for a major data center that provides approval of checks and other on‐line electronic transactions for American Express, MasterCard, and Visa. On the biggest shopping day of the year, the day after Thanksgiving, you find out that the data center has lost its utility service. Your first reaction is that the data center has a UPS and standby generator, so there is no problem, right? However, the standby generator is not starting due to a fuel problem, and the data center will shut down in 15 minutes, the amount of time the UPS system batteries can supply power at full load. The penalty for not being proactive is the loss of revenue, potential loss of major clients, and if the problem is large enough, your business could be at risk of financial collapse. You, the manager, could have avoided this nightmare scenario by exercising the standby generator every week for 30 minutes – the proverbial ounce of prevention.
There are about ten times as many UPS systems in use today than there were 10 years ago, and many more companies are still discovering their worth after losing data during a power line disturbance. Do you want electrical outages to be scheduled or unscheduled? Serious facilities engineers use comprehensive preventative maintenance procedures to avoid being caught off‐guard.
Many companies do not consider installing backup equipment until after an incident has already occurred. During the months following the U.S. Northeast Blackout of 2003, the industry experienced a boom in the installation of UPS systems and standby generators. Small and large businesses alike learned how susceptible they are to power disturbances and the associated costs of not being prepared. Some businesses that are not typically considered mission critical learned that they could not afford to be unprotected during a power outage. For example, the Blackout of 2003 destroyed $250 million of perishable food in New York City alone.1 Businesses everywhere, and of every type, are reassessing their level of risk tolerance and cost of downtime.
3.3 Identifying the Appropriate Redundancy in a Mission Critical Facility
Mission critical facilities cannot be susceptible at any time to an outage, including during maintenance of the subsystems. Therefore, careful consideration must be given in evaluating and implementing redundancy in systems design. Examples of redundancy are classified as (N+1) and (N+2) configurations and are normally applied to the systems below:
Utilities service
Power distribution
UPS
Emergency generator
Fuel system supplying emergency generator
Mechanical СКАЧАТЬ