322 research outputs found

    Scalable Reed-Solomon-based Reliable Local Storage for HPC Applications on IaaS Clouds

    Get PDF
    International audienceWith increasing interest among mainstream users to run HPC applications, Infrastructure-as-a-Service (IaaS) cloud computing platforms represent a viable alternative to the acquisition and maintenance of expensive hardware, often out of the financial capabilities of such users. Also, one of the critical needs of HPC applications is an efficient, scalable and persistent storage. Unfortunately, storage options proposed by cloud providers are not standardized and typically use a different access model. In this context, the local disks on the compute nodes can be used to save large data sets such as the data generated by Checkpoint-Restart (CR). This local storage offers high throughput and scalability but it needs to be combined with persistency techniques, such as block replication or erasure codes. One of the main challenges that such techniques face is to minimize the overhead of performance and I/O resource utilization (i.e., storage space and bandwidth), while at the same time guaranteeing high reliability of the saved data. This paper introduces a novel persistency technique that leverages Reed-Solomon (RS) encoding to save data in a reliable fashion. Compared to traditional approaches that rely on block replication, we demonstrate about 50% higher throughput while reducing network bandwidth and storage utilization by a factor of 2 for the same targeted reliability level. This is achieved both by modeling and real life experimentation on hundreds of nodes

    Designing SSI clusters with hierarchical checkpointing and single I/O space

    Get PDF
    Adopting a new hierarchical checkpointing architecture, the authors develop a single I/O address space for building highly available clusters of computers. They propose a systematic approach to achieving a single system image by integrating existing middleware support with the newly developed features.published_or_final_versio

    Scalable Techniques for Fault Tolerant High Performance Computing

    Get PDF
    As the number of processors in today’s parallel systems continues to grow, the mean-time-to-failure of these systems is becoming significantly shorter than the execu- tion time of many parallel applications. It is increasingly important for large parallel applications to be able to continue to execute in spite of the failure of some components in the system. Today’s long running scientific applications typically tolerate failures by checkpoint/restart in which all process states of an application are saved into stable storage periodically. However, as the number of processors in a system increases, the amount of data that need to be saved into stable storage increases linearly. Therefore, the classical checkpoint/restart approach has a potential scalability problem for large parallel systems. In this research, we explore scalable techniques to tolerate a small number of process failures in large scale parallel computing. The goal of this research is to develop scalable fault tolerance techniques to help to make future high performance computing appli- cations self-adaptive and fault survivable. The fundamental challenge in this research is scalability. To approach this challenge, this research (1) extended existing diskless checkpointing techniques to enable them to better scale in large scale high performance computing systems; (2) designed checkpoint-free fault tolerance techniques for linear al- gebra computations to survive process failures without checkpoint or rollback recovery; (3) developed coding approaches and novel erasure correcting codes to help applications to survive multiple simultaneous process failures. The fault tolerance schemes we introduce in this dissertation are scalable in the sense that the overhead to tolerate a failure of a fixed number of processes does not increase as the number of total processes in a parallel system increases. Two prototype examples have been developed to demonstrate the effectiveness of our techniques. In the first example, we developed a fault survivable conjugate gradi- ent solver that is able to survive multiple simultaneous process failures with negligible overhead. In the second example, we incorporated our checkpoint-free fault tolerance technique into the ScaLAPACK/PBLAS matrix-matrix multiplication code to evaluate the overhead, survivability, and scalability. Theoretical analysis indicates that, to sur- vive a fixed number of process failures, the fault tolerance overhead (without recovery) for matrix-matrix multiplication decreases to zero as the total number of processes (as- suming a fixed amount of data per process) increases to infinity. Experimental results demonstrate that the checkpoint-free fault tolerance technique introduces surprisingly low overhead even when the total number of processes used in the application is small

    Parallel software for lattice N=4 supersymmetric Yang--Mills theory

    Get PDF
    We present new parallel software, SUSY LATTICE, for lattice studies of four-dimensional N=4\mathcal N = 4 supersymmetric Yang--Mills theory with gauge group SU(N). The lattice action is constructed to exactly preserve a single supersymmetry charge at non-zero lattice spacing, up to additional potential terms included to stabilize numerical simulations. The software evolved from the MILC code for lattice QCD, and retains a similar large-scale framework despite the different target theory. Many routines are adapted from an existing serial code, which SUSY LATTICE supersedes. This paper provides an overview of the new parallel software, summarizing the lattice system, describing the applications that are currently provided and explaining their basic workflow for non-experts in lattice gauge theory. We discuss the parallel performance of the code, and highlight some notable aspects of the documentation for those interested in contributing to its future development.Comment: Code available at https://github.com/daschaich/sus

    Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining

    Full text link
    Extensive system scales (i.e. thousands of GPU/TPUs) and prolonged training periods (i.e. months of pretraining) significantly escalate the probability of failures when training large language models (LLMs). Thus, efficient and reliable fault-tolerance methods are in urgent need. Checkpointing is the primary fault-tolerance method to periodically save parameter snapshots from GPU memory to disks via CPU memory. In this paper, we identify the frequency of existing checkpoint-based fault-tolerance being significantly limited by the storage I/O overheads, which results in hefty re-training costs on restarting from the nearest checkpoint. In response to this gap, we introduce an in-memory fault-tolerance framework for large-scale LLM pretraining. The framework boosts the efficiency and reliability of fault tolerance from three aspects: (1) Reduced Data Transfer and I/O: By asynchronously caching parameters, i.e., sharded model parameters, optimizer states, and RNG states, to CPU volatile memory, Our framework significantly reduces communication costs and bypasses checkpoint I/O. (2) Enhanced System Reliability: Our framework enhances parameter protection with a two-layer hierarchy: snapshot management processes (SMPs) safeguard against software failures, together with Erasure Coding (EC) protecting against node failures. This double-layered protection greatly improves the survival probability of the parameters compared to existing checkpointing methods. (3) Improved Snapshotting Frequency: Our framework achieves more frequent snapshotting compared with asynchronous checkpointing optimizations under the same saving time budget, which improves the fault tolerance efficiency. Empirical results demonstrate that Our framework minimizes the overhead of fault tolerance of LLM pretraining by effectively leveraging redundant CPU resources.Comment: Fault Tolerance, Checkpoint Optimization, Large Language Model, 3D parallelis

    Load-Balance and Fault-Tolerance for Massively Parallel Phylogenetic Inference

    Get PDF

    Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing

    Full text link
    • …
    corecore