322 research outputs found
Scalable Reed-Solomon-based Reliable Local Storage for HPC Applications on IaaS Clouds
International audienceWith increasing interest among mainstream users to run HPC applications, Infrastructure-as-a-Service (IaaS) cloud computing platforms represent a viable alternative to the acquisition and maintenance of expensive hardware, often out of the financial capabilities of such users. Also, one of the critical needs of HPC applications is an efficient, scalable and persistent storage. Unfortunately, storage options proposed by cloud providers are not standardized and typically use a different access model. In this context, the local disks on the compute nodes can be used to save large data sets such as the data generated by Checkpoint-Restart (CR). This local storage offers high throughput and scalability but it needs to be combined with persistency techniques, such as block replication or erasure codes. One of the main challenges that such techniques face is to minimize the overhead of performance and I/O resource utilization (i.e., storage space and bandwidth), while at the same time guaranteeing high reliability of the saved data. This paper introduces a novel persistency technique that leverages Reed-Solomon (RS) encoding to save data in a reliable fashion. Compared to traditional approaches that rely on block replication, we demonstrate about 50% higher throughput while reducing network bandwidth and storage utilization by a factor of 2 for the same targeted reliability level. This is achieved both by modeling and real life experimentation on hundreds of nodes
Designing SSI clusters with hierarchical checkpointing and single I/O space
Adopting a new hierarchical checkpointing architecture, the authors develop a single I/O address space for building highly available clusters of computers. They propose a systematic approach to achieving a single system image by integrating existing middleware support with the newly developed features.published_or_final_versio
Scalable Techniques for Fault Tolerant High Performance Computing
As the number of processors in today’s parallel systems continues to grow, the mean-time-to-failure of these systems is becoming significantly shorter than the execu- tion time of many parallel applications. It is increasingly important for large parallel applications to be able to continue to execute in spite of the failure of some components in the system. Today’s long running scientific applications typically tolerate failures by checkpoint/restart in which all process states of an application are saved into stable storage periodically. However, as the number of processors in a system increases, the amount of data that need to be saved into stable storage increases linearly. Therefore, the classical checkpoint/restart approach has a potential scalability problem for large parallel systems.
In this research, we explore scalable techniques to tolerate a small number of process failures in large scale parallel computing. The goal of this research is to develop scalable fault tolerance techniques to help to make future high performance computing appli- cations self-adaptive and fault survivable. The fundamental challenge in this research is scalability. To approach this challenge, this research (1) extended existing diskless checkpointing techniques to enable them to better scale in large scale high performance computing systems; (2) designed checkpoint-free fault tolerance techniques for linear al- gebra computations to survive process failures without checkpoint or rollback recovery; (3) developed coding approaches and novel erasure correcting codes to help applications to survive multiple simultaneous process failures. The fault tolerance schemes we introduce in this dissertation are scalable in the sense that the overhead to tolerate a failure of a fixed number of processes does not increase as the number of total processes in a parallel system increases.
Two prototype examples have been developed to demonstrate the effectiveness of our techniques. In the first example, we developed a fault survivable conjugate gradi- ent solver that is able to survive multiple simultaneous process failures with negligible overhead. In the second example, we incorporated our checkpoint-free fault tolerance technique into the ScaLAPACK/PBLAS matrix-matrix multiplication code to evaluate the overhead, survivability, and scalability. Theoretical analysis indicates that, to sur- vive a fixed number of process failures, the fault tolerance overhead (without recovery) for matrix-matrix multiplication decreases to zero as the total number of processes (as- suming a fixed amount of data per process) increases to infinity. Experimental results demonstrate that the checkpoint-free fault tolerance technique introduces surprisingly low overhead even when the total number of processes used in the application is small
Parallel software for lattice N=4 supersymmetric Yang--Mills theory
We present new parallel software, SUSY LATTICE, for lattice studies of
four-dimensional supersymmetric Yang--Mills theory with gauge
group SU(N). The lattice action is constructed to exactly preserve a single
supersymmetry charge at non-zero lattice spacing, up to additional potential
terms included to stabilize numerical simulations. The software evolved from
the MILC code for lattice QCD, and retains a similar large-scale framework
despite the different target theory. Many routines are adapted from an existing
serial code, which SUSY LATTICE supersedes. This paper provides an overview of
the new parallel software, summarizing the lattice system, describing the
applications that are currently provided and explaining their basic workflow
for non-experts in lattice gauge theory. We discuss the parallel performance of
the code, and highlight some notable aspects of the documentation for those
interested in contributing to its future development.Comment: Code available at https://github.com/daschaich/sus
Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining
Extensive system scales (i.e. thousands of GPU/TPUs) and prolonged training
periods (i.e. months of pretraining) significantly escalate the probability of
failures when training large language models (LLMs). Thus, efficient and
reliable fault-tolerance methods are in urgent need. Checkpointing is the
primary fault-tolerance method to periodically save parameter snapshots from
GPU memory to disks via CPU memory. In this paper, we identify the frequency of
existing checkpoint-based fault-tolerance being significantly limited by the
storage I/O overheads, which results in hefty re-training costs on restarting
from the nearest checkpoint. In response to this gap, we introduce an in-memory
fault-tolerance framework for large-scale LLM pretraining. The framework boosts
the efficiency and reliability of fault tolerance from three aspects: (1)
Reduced Data Transfer and I/O: By asynchronously caching parameters, i.e.,
sharded model parameters, optimizer states, and RNG states, to CPU volatile
memory, Our framework significantly reduces communication costs and bypasses
checkpoint I/O. (2) Enhanced System Reliability: Our framework enhances
parameter protection with a two-layer hierarchy: snapshot management processes
(SMPs) safeguard against software failures, together with Erasure Coding (EC)
protecting against node failures. This double-layered protection greatly
improves the survival probability of the parameters compared to existing
checkpointing methods. (3) Improved Snapshotting Frequency: Our framework
achieves more frequent snapshotting compared with asynchronous checkpointing
optimizations under the same saving time budget, which improves the fault
tolerance efficiency. Empirical results demonstrate that Our framework
minimizes the overhead of fault tolerance of LLM pretraining by effectively
leveraging redundant CPU resources.Comment: Fault Tolerance, Checkpoint Optimization, Large Language Model, 3D
parallelis
- …