A holistic approach to system reliability in blue gene
Abstract
Optimizing supercomputer performance requires a balance between objectives for processor performance, network performance, power delivery and cooling, cost and reliability. In particular, scaling a system to a large number of processors poses challenges for reliability, availability and serviceability. Given the power and thermal constraints of data centers, the BlueGene/L supercomputer has been designed with a focus on maximizing floating point operations per second per Watt (FLOPS/Watt). This results in a drastic reduction in FLOPS/m2 floor space and FLOPS/dollar, allowing for affordable scale-up. The BlueGene/L system has been scaled to a total of 65,536 compute nodes in 64 racks. A system approach was used to minimize power at all levels, from the processor to the cooling plant. A Blue-Gene/L compute node consists of a single ASIC and associated memory. The ASIC integrates all system functions including processors, the memory subsystem and communication, thereby minimizing chip count, interfaces, and power dissipation. As the number of components increases, even a low failure rate per-component will lead to an unacceptable system failure rate. Additional mechanisms will have to be deployed to achieve sufficient reliability at the system level. In particular, the data transfer volume in the communication networks of a massively parallel system poses significant challenges on bit error rates and recovery mechanisms in the communication links. Low power dissipation and high performance, along with reliability, availability and serviceability were prime considerations in BlueGene/L hardware architecture, system design, and packaging. A high-performance software stack, consisting of operating system services, compilers, libraries and middleware, completes the system, while enhancing reliability and data integrity. © 2006 IEEE.