Название: Maintaining Mission Critical Systems in a 24/7 Environment
Автор: Peter M. Curtis
Издательство: John Wiley & Sons Limited
Жанр: Физика
isbn: 9781119506140
isbn:
Today, data centers are pushed to the limit. Servers are crammed into racks, and their high‐performance processors all add up to outrageous power consumption. In the early 2000s, data center power consumption increased by about 25% each year, according to Hewlett Packard. The load in DC’s has leveled off in my mind. They were designed for 200 or 300W/sq. ft. at one time, and they never came close to that rating. 100W to 200W/sq. ft. is more common today and offers sufficient power and cooling for these loads. At the same time, processor performance has gone up 500%, and as equipment footprints shrink, the free floor area is populated with more hardware. However, because the smaller equipment is still rejecting the same amount of heat, cooling densities are growing dramatically and are rapidly consuming more floor space. Traditional design using watts per square foot has continued to grow and can also be calculated as transactions per watt. All this increased processing power generates heat, but if the data center gets too hot, all applications grind to a halt.
Electrical power is easy to design by serving each cabinet with an A and B UPS power source sized to support the IT load within the cabinet. Note the data center industry is now predominantly dual corded equipment fed by 208V power, making it more efficient than the old 120V single‐phase power standard. These cabinet feeds are typically 20A, 208V single phase circuits or 3‐phase 30A or 60A, 208V circuits for large IT installations. Cooling, on the other hand, to remove the heat is more complex to design and maintain since cold air cannot be easily adjusted to each cabinet to neutralize its heat output and data processing board requirements. Underfloor air distribution with a 36 or 48‐inch raised floor may be required for the high heat densities. Other data center operators have been successful with equipment cabinets on the floor slab with an overhead air supply in “cold aisles” and hot air returning to the CRAH units in the hot aisles. Another method to increase heat removal and efficiency is installing cold or hot aisle containment. The cold (or hot) aisle is closed off at each end with doors, and a barrier is extended up to the ceiling to ensure the maximum amount of cold air is supplied to the IT equipment intakes. For hot aisle containment, the hot aisle is barriered to remove the exhausted cabinet heat quickly and not short cycle back to the cold aisle.
Many data center designers (and their clients) would like to build for a 20‐year life cycle, yet the reality is that most cannot realistically look beyond 2 to 5 years. As companies push to wring more data‐crunching ability from the same real estate, the lynchpin technology of future data centers will not necessarily involve greater processing power or more servers, but improved heat dissipation and better airflow management.
To combat high temperatures and maintain the current trend toward more powerful processors, engineers are reintroducing old technology: liquid cooling, which was used to cool mainframe computers decades ago. To successfully reintroduce liquid into computer rooms, standards will need to be developed, another arena where standardization can promote reliable solutions that mitigate risk for the industry.
The large footprint now required for reliable power without planned downtime also affects the planning and maintenance of data center facilities. Over the past two decades, the cost of the facility relative to the computer hardware it houses has not grown proportionately. Budget priorities that favor computer hardware over facilities improvement can lead to insufficient performance. The best way to ensure a balanced allocation of capital is to prepare a business analysis that shows the costs associated with the risk of downtime.
3.9 Human Factors and the Commissioning Process
There is no such thing as plug and play when critical infrastructure is deployed, or existing systems are overhauled to support a company's changing business mission. Reliability is not guaranteed simply by installing new equipment or even building an entirely new data center. An aggressive and rigorous design, failure mode analysis, testing/commissioning process, and operations plan proportional to the facility's criticality level are a necessity and not an option.
Of particular importance is the actual commissioning process and developing a detailed operations plan. More budget dollars should be allocated to testing/commissioning, documentation, education/ training, and operations/maintenance because more than 50 percent of data center downtime can be traced to human error. Due to the facility’s 24/7 mission critical status, this will be the construction team’s sole opportunity to integrate and commission all of the systems. At this point in the project, a competent, independent test engineer familiar with the equipment has witnessed testing of all installed systems at the factory.
Commissioning is a systematic process of ensuring, through documented verification, that all building systems perform according to the design intent and to the future owner's operational needs. The goal is to provide the owner with a safe and reliable installation. A commissioning agent who serves as the owner's representative usually manages the commissioning process. The commissioning agent's role is to facilitate a highly interactive process of verifying that the project is installed correctly and operating as designed. This is achieved through coordination with the owner, design team, construction team, equipment vendors, and third‐party commissioning provider during the various phases of the project. ASHRAE’s Commissioning Guideline 0‐2500 is a recognized model and a good resource that explains this process in detail and can be applied to critical systems.
Prior to installation at the site, all major equipment should undergo factory acceptance testing that is witnessed by an independent test engineer familiar with the equipment and the testing procedures. However, relying on the factory acceptance test is not sufficient. Once the equipment is delivered, set in place, wired, and functional testing completed, integrated system testing begins. The integrated system test verifies and certifies that all components work together as a fully integrated system. This is the time to resolve all potential equipment problems. There is no “one size fits all” formula.
Before a new data center or renovation within an existing building goes on‐line, it is crucial to ensure that the systems are burned‐in and failure scenarios are tested no matter the schedule, milestones, and pressures. You won't have a chance to do this phase over, so get it right the first time. A tremendous amount of coordination is required to fine‐tune and calibrate each component. For example, critical circuit breakers must be tested and calibrated prior to exposing them to any critical electrical load. After all tests are complete, results must be compiled for all equipment and the certified test reports prepared, establishing a benchmark for all future testing.
Scheduling time to educate staff during systems integration testing is not considered part of the commissioning process but is extremely important in order to reduce human error.
This activity can be considered part of the transitions‐to‐operations process. Hands‐on training is invaluable because it improves situational awareness and operator confidence, which in turn reduces human error. The training can also break through misplaced confidence. Sometimes we deem ourselves ready for a task, but we are really not. Handing off the new infrastructure or facility to fully trained and prepared operations teams improves success and uptime throughout the facility lifecycle. When you couple the right design with the right operations plan (training programs, documentation/IMOPs, and preventative maintenance), the entire organization will be much better prepared to manage through critical events and the unexpected. If proper training and preparation aren’t done during the commissioning stage, building engineers will not become familiar with various processes and procedures. Learning on the job increases operational risk. Knowing this up front, there is absolutely no reason not to commit the necessary budget for training, technical maintenance programs, accurate documentation/storage, and finally, some type of credible certification that is continually revisited. If done correctly, potential mishaps and near misses will be avoided, and the reduced risk will be like an annuity paying “Reliability/ Availability” dividends.