Scalable, fault-tolerant job step management for high-performance systems
Abstract
Scientific applications on the CORAL systems demanded a fault-tolerant, scalable job launch infrastructure for complex workflows with multiple job steps within an allocation. The distinct design of IBM's Job Step Manager (JSM) infrastructure, working in concert with Load Sharing Facility (LSF) and Cluster System Management (CSM), achieves these goals. JSM demonstrated launching over three-quarters of a million processes in under a minute while providing efficient process management interface for exascale-based services to communication libraries, such as parallel active messaging interface and message passing interface, and tools over the management network. JSM relies on the parallel task support library to provide a fault-tolerant, scalable communication medium between the JSM daemons. Application workflows using job steps harness the unique resource set abstraction concept in JSM to manage CPUs, GPUs, and memory between groups of processes, possibly in discrete job steps, sharing a node. The resource set concept gives JSM the opportunity to better organize process placement to optimize, for example, CPU-to-GPU communication. Applications that need complete control over the shaping of the resource sets and the placement, binding, and ordering of processes within them can leverage JSM's co-designed Explicit Resource File mechanism. This article explores the design decisions, implementation considerations, and performance optimizations of IBM's JSM infrastructure to support scientific discovery on the CORAL systems.