Towards high availability for high-performance computing system services: Accomplishments and limitations

C. Engelmann; C. Leangsuksun; S. L. Scott; X. He

Towards high availability for high-performance computing system services: Accomplishments and limitations

Authors: C. Engelmann
C. Leangsuksun
S. L. Scott
X. He
Publication date: 1 January 2006
Publisher

Abstract

During the last several years, our teams at Oak Ridge National Laboratory, Louisiana Tech University, and Tennessee Technological University focused on efficient redundancy strategies for head and service nodes of high-performance computing (HPC) systems in order to pave the way for high availability (HA) in HPC. These nodes typically run critical HPC system services, like job and resource management, and represent single points of failure and control for an entire HPC system. The overarching goal of our research is to provide high-level reliability, availability, and serviceability (RAS) for HPC systems by combining HA and HPC technology. This paper summarizes our accomplishments, such as developed concepts and implemented proof-of-concept prototypes, and describes existing limitations, such as performance issues, which need to be dealt with for production-type deployment

Similar works

Full text

Available Versions

CiteSeerX

oai:CiteSeerX.psu:10.1.1.77.90...

Last time updated on 22/10/2014

CiteSeerX

oai:CiteSeerX.psu:10.1.1.610.7...

Last time updated on 29/10/2017