10,055 research outputs found

    Algorithmic Based Fault Tolerance Applied to High Performance Computing

    Full text link
    We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly

    Active Virtual Network Management Prediction: Complexity as a Framework for Prediction, Optimization, and Assurance

    Full text link
    Research into active networking has provided the incentive to re-visit what has traditionally been classified as distinct properties and characteristics of information transfer such as protocol versus service; at a more fundamental level this paper considers the blending of computation and communication by means of complexity. The specific service examined in this paper is network self-prediction enabled by Active Virtual Network Management Prediction. Computation/communication is analyzed via Kolmogorov Complexity. The result is a mechanism to understand and improve the performance of active networking and Active Virtual Network Management Prediction in particular. The Active Virtual Network Management Prediction mechanism allows information, in various states of algorithmic and static form, to be transported in the service of prediction for network management. The results are generally applicable to algorithmic transmission of information. Kolmogorov Complexity is used and experimentally validated as a theory describing the relationship among algorithmic compression, complexity, and prediction accuracy within an active network. Finally, the paper concludes with a complexity-based framework for Information Assurance that attempts to take a holistic view of vulnerability analysis

    Adaptive control in rollforward recovery for extreme scale multigrid

    Full text link
    With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution process, while the solution in the faulty domain is re-constructed by an asynchronous on-line recovery. The computations in both the faulty and healthy subdomains must be coordinated in a sensitive way, in particular, both under and over-solving must be avoided. Both of these waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal re-coupling, we introduce a stopping criterion based on a mathematical error estimator. It involves hierarchical weighted sums of residuals within the context of uniformly refined meshes and is well-suited in the context of parallel high-performance computing. The re-coupling process is steered by local contributions of the error estimator. We propose and compare two criteria which differ in their weights. Failure scenarios when solving up to 6.9â‹…10116.9\cdot10^{11} unknowns on more than 245\,766 parallel processes will be reported on a state-of-the-art peta-scale supercomputer demonstrating the robustness of the method
    • …
    corecore