Название: Maintaining Mission Critical Systems in a 24/7 Environment
Автор: Peter M. Curtis
Издательство: John Wiley & Sons Limited
Жанр: Физика
isbn: 9781119506140
isbn:
1.10 Employee Certification
Empowering employees to function effectively & efficiently can be achieved through a well‐planned certification program. Employees have a vested interest in working with management to reduce risk. Empowering employees to take charge in times of crisis creates valuable communication allies who not only reinforce core messages internally but also carry them into daily operations. The internal crisis communication should be conducted using established communication channels and venues in addition to those that may have been developed to manage specific crisis scenarios. Whichever method of internal crisis communication a company may choose, the more upfront management is about what is happening, the better‐informed and more confident employees feel.
In this way, security can be placed on an operation or a task requiring that an employee be certified to perform that action. Certification terms should be defined by industry best practices. Furthermore, the company’s risk profile should include training and periodic recertification. Should these evaluations fall below standard over a period of time, the system could recommend de‐certification.
Technology is driving itself faster than ever. Large investments are made in new technologies to keep up to date with advancements, yet industries are still faced with operational challenges. One possible reason is the limited training provided to employees operating the mission critical equipment. Employee certification is crucial not only to keep up with advanced technology but also to promote quick emergency response and situational awareness. In the last few years, technologies have been developed to solve the technical problem of linkage and interaction of equipment but without well‐trained personnel. How can we confirm that the employee meets the complex requirements of the facility to ensure high levels of reliability?
1.11 Standards and Benchmarking
The past decade has seen wrenching change for many organizations. As firms and institutions have looked for ways to survive and remain profitable, a simple but powerful change strategy called “benchmarking” has become popular. The underlying rationale for the benchmarking process is that learning by example and from best‐practice cases is the most effective means of understanding the principles and the specifics of effective practices. Recovery and redundancy together cannot provide sufficient resiliency if they can be disrupted by a single unpredictable event. A mission critical data center must be able to endure hazards of nature, such as earthquakes, tornados, floods, and other natural disasters, as well as human‐made events. Great care should be taken to ensure critical functions that will minimize downtime. Standards should be established with guidelines and mandatory requirements for continuity of business applications. Procedures should be developed for the systematic sharing of safety ‐ and performance‐related material, best practices, and standards.
The key is to benchmark the facility on a routine basis with the goal of identifying performance deviations from the original design specifications. Done properly, this will provide an early warning mechanism to allow potential failure to be addressed and corrected before it occurs. Once deficiencies are identified, and before any corrective action can be taken, a Method of Operation (MOP) must be written. The MOP will clearly stipulate step‐by‐step procedures and conditions, including who is to be present, the documentation required, phasing of work, and the state in which the system is to be placed after the work is completed. The MOP will greatly minimize errors and potential system downtime by identifying the responsibility of vendors, contractors, the owner, the testing entity, and anyone else involved. In addition, a program of ongoing operational staff training, and procedures is important to deal with emergencies outside of the regular maintenance program.
The most important aspect of benchmarking is that it is a process driven by the participants whose goal is to improve their organization. It is a process through which participants learn about successful practices in other organizations and then draw on those cases to develop solutions most suitable for their own organizations. True process benchmarking identifies the “how’s” and “whys” for performance gaps and helps organizations learn and understand how to perform with higher standards of practice. Keep in mind that you can’t improve if you don’t measure and benchmark.
1.12 What is a Mission Critical Engineer
What are some attributes of mission‐critical engineers? Well, mission‐critical engineers are never complacent; they are always organized and prepared, are always creative, and are always looking to improve. They are always observing their surroundings with all their senses, always looking for deficiencies and always ready to take action. A mission‐critical engineer doesn't stop after the first try. Mission critical engineers understand the importance of their positions and how their employers impact the public. They entered this industry to contribute to society. They are ethical, share their knowledge, and strive to motivate others.
I've been a mission‐critical engineer for close to 30 years and am still puzzled by some things. We all know what an investment of $500 million dollars buys. We invest this money because we think we are buying reliability and business resiliency. After this kind of investment, we are enamored with the infrastructure, and we feel confident that it will pe1fonn as designed when called upon.
Among the industries that have zero tolerance for error, the ones that stand out are aviation, rail, nuclear power plants, and, of course, NASA. You can call these industries “mission control” type industries, where error can lead to catastrophes, cascading failures, and loss of life, money, and reputation.
Are we falling short in fields that require this type of intolerance for error? As we are already aware, human error causes approximately 60 percent of all downtime experienced by mission‐critical facilities. This number is far too high. Today there are a growing number of DCIM tools that can help reduce downtime, but we are just beginning to scratch the surface in moving toward a significant reduction in downtime. We are still many years away from that goal of 'zero downtime.' There have been many recent examples of human error that have caused fatalities:
The crash of Air France Flight 447 that killed 228 people due to a lack of pilot training in surprise situations.
The head‐on collision of a Metrolink train near Chatsworth, CA, which was probably caused by an engineer who was texting, 25 people were killed and 135 injured.
The actions of the Costa Concordia captain before and after the collision that led to the death of 32 passengers.
Colgan Flight 3407 operated under Continental Airlines, which crashed, killing 49 people in the suburbs of Buffalo.
Either character flaws or a lack of training played a role in each of these disasters. All could have been avoided if the right people had been in these positions.
Beyond these man‐made disasters, we have natural disasters that are even more difficult to cope with. In the wake of Superstorm Sandy, we are once again reminded of how vulnerable our country's infrastructure is and how large‐scale disasters and catastrophes can produce extended downtime.
Sandy left millions without power in the tri‐state area, causing untold chaos and the worst gasoline shortages since the 1970s. There are so many ways to defend against these disruptions, from ensuring that the refineries have the appropriate standby or microgrids that are designed to support the critical infrastructure vital to the sustainability of how we live digitally today. How can we expeditiously improve? The СКАЧАТЬ