Название: Maintaining Mission Critical Systems in a 24/7 Environment
Автор: Peter M. Curtis
Издательство: John Wiley & Sons Limited
Жанр: Физика
isbn: 9781119506140
isbn:
At the present time, the risks associated with cascading power supply interruptions from the public electrical grid in the United States have increased due to the ever‐increasing reliance on computer and related technologies. This has occurred while investments in the reliability and security of the grid have not kept pace with the levels recommended by industry experts. Today there are trillions of devices and billions of people connected to the world‐wide‐web. As the number of computers and related technologies continue to multiply in this increasingly digital world, the demand for reliability increases as well. Businesses are not only competing in the marketplace to deliver whatever goods and services are produced for consumption, but now they must compete to hire the best engineers from a dwindling pool of talent who can design the best infrastructures needed to obtain and deliver reliable power and cooling. This keeps the mission critical manufacturing and technology centers up and running with the ability to produce the very goods and services that sustain them. The idea that businesses today must compete for the best talent to obtain reliable power is not new, as are the consequences of failing to meet this challenge. Without reliable power, there are no goods and services for sale, no revenues, and no profits ‐ only losses when power is not available. Hiring and keeping the best‐trained engineers employing the very best analyses, making the best strategic choices, and following the best operational plans to keep ahead of the power supply curve is essential for any technologically sophisticated business to thrive and prosper. A key to success is to provide proper training and educational resources to engineers so they may increase their knowledge and keep current on the latest mission critical technologies available all over the world, which is one of the purposes of this content. In addition, companies need to pool their efforts toward improving educational opportunities and certification programs for young mission critical engineers to help address the decreasing workforce necessary to sustain the growing mission critical industry.
It is also essential for critical industries to constantly and systematically evaluate their mission critical systems, assess and reassess their level of risk tolerance versus the cost of downtime, and plan for future upgrades in equipment and services that are designed to meet business needs and ensure uninterrupted power and cooling supplies in the years ahead. Simply put, minimizing unplanned downtime reduces risk. Unfortunately, the most common approach is reactive, that is, spending time and resources to repair a failed piece of equipment after the fact as opposed to identifying when the equipment is likely to fail and repairing or replacing it without interruption. If the utility goes down, install a generator. If a ground‐fault trips critical loads, redesign the distribution system. If a lightning strike burns out power supplies, install a new lightning protection system. Such measures certainly make sense, as they address real risks associated with the critical infrastructure; however, they are always performed after the harm has occurred. Often, such efforts proceed in haste without enough consideration of how the short‐term fix fits into the larger picture of how the facility’s systems should operate in an integrated manner. This can result in the introduction of new vulnerabilities. Strategic planning, on the other hand, can identify internal risks and provide a prioritized plan for reliability improvements that identify the root causes of failure before they occur.
In the world of high‐powered business, owners of real estate have come to learn that they, too, must meet the demands for reliable power supply to their tenants. As more and more buildings are required to deliver service guarantees, management must decide what performance is required from each facility in the building. Availability levels of 99.999% (5.25 minutes of downtime per year) allow virtually no facility downtime for maintenance or other planned or unplanned events. Moving toward high reliability is imperative. Moreover, avoiding the landmines that can cause outages and unscheduled downtime never ends. Event planning and impact assessments are tasks that are never truly completed; they should be viewed afresh at least once every budget cycle.
The evolution of data center design and function has been driven, in part, by the need for uninterrupted power. Data centers now employ many unique designs developed specifically to achieve the goal of uninterrupted power within defined project constraints based on technological need, budget limitations, and the specific tasks each center must achieve to function usefully and efficiently. Providing continuous operation under all foreseeable risks of failure such as power outages, equipment breakdown, internal fires, and so on requires the use of modern design and modeling techniques to enhance reliability. These include redundant systems and components, standby power generation, fuel systems, automatic transfer and static switches, pure power quality, UPS systems, cooling systems, raised access floors, fire protection, as well as the use of Probabilistic Risk Analysis modeling software (each will be discussed in detail later) to predict potential future outages and develop maintenance and upgrade action plans for all major systems.
Also vital to the facility's life cycle is two‐way communication between upper management and facilities management. Only when both ends fully understand the three pillars of infrastructure reliability ‐ design, maintenance, and operation of critical environments (including the potential risk of downtime and recovery time) ‐ can they fund and implement an effective plan. Because the costs associated with reliability enhancements are significant, sound decisions can only be made by quantifying performance benefits against downtime cost estimates for each upgrade option to determine the best course of action. Planning and careful implementation will minimize disruptions while making the business case to fund necessary capital improvements and implement comprehensive maintenance strategies. When the business case for additional redundancy, specialized consultants, documentation, and ongoing training reaches the boardroom, the entire organization can be galvanized to prevent catastrophic data losses, damage to capital equipment, and danger to life and limb.
1.2 Risk Assessment
Critical industries require an extraordinary degree of planning and assessing. It is important to identify the best strategies to reach the targeted level of reliability. In order to design a critical building with the appropriate level of reliability, the cost of downtime and the associated risks need to be assessed. It is important to understand that downtime occurs due to more than one type of failure: design failure, catastrophic failures, equipment failures or failures due to human error. Each type of failure will require a different approach on prevention. A solid and realistic approach to business resiliency must be a priority, especially because the present critical infrastructure is inevitably designed with all the eggs located in one basket.
Within the banking and financial services, planning the critical area places considerable pressure on designing an infrastructure that evolves in an effort to support continuous business growth. Routine maintenance and upgrading equipment alone do not ensure continuous availability. The 24/7 operation of such service means an absence of scheduled interruptions for any reason, including routine maintenance, modifications, and upgrades. The main question is how and why infrastructure failures occur. Employing new methods of distributing critical power, understanding capital constraints, and developing processes that minimize human error are some key factors in improving recovery time in the event critical systems are impacted by base‐building failures.
The infrastructure reliability can be enhanced by conducting a formal Risk Management Assessment (RMA), gap analysis, and by following the guidelines of the Critical Area Program (CAP). The RMA and the CAP are used in other industries and СКАЧАТЬ