148 research outputs found

    Algorithmic Based Fault Tolerance Applied to High Performance Computing

    Full text link
    We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly

    Enhancing Energy Production with Exascale HPC Methods

    Get PDF
    High Performance Computing (HPC) resources have become the key actor for achieving more ambitious challenges in many disciplines. In this step beyond, an explosion on the available parallelism and the use of special purpose processors are crucial. With such a goal, the HPC4E project applies new exascale HPC techniques to energy industry simulations, customizing them if necessary, and going beyond the state-of-the-art in the required HPC exascale simulations for different energy sources. In this paper, a general overview of these methods is presented as well as some specific preliminary results.The research leading to these results has received funding from the European Union's Horizon 2020 Programme (2014-2020) under the HPC4E Project (www.hpc4e.eu), grant agreement n° 689772, the Spanish Ministry of Economy and Competitiveness under the CODEC2 project (TIN2015-63562-R), and from the Brazilian Ministry of Science, Technology and Innovation through Rede Nacional de Pesquisa (RNP). Computer time on Endeavour cluster is provided by the Intel Corporation, which enabled us to obtain the presented experimental results in uncertainty quantification in seismic imagingPostprint (author's final draft

    Heterogeneous hierarchical workflow composition

    Get PDF
    Workflow systems promise scientists an automated end-to-end path from hypothesis to discovery. However, expecting any single workflow system to deliver such a wide range of capabilities is impractical. A more practical solution is to compose the end-to-end workflow from more than one system. With this goal in mind, the integration of task-based and in situ workflows is explored, where the result is a hierarchical heterogeneous workflow composed of subworkflows, with different levels of the hierarchy using different programming, execution, and data models. Materials science use cases demonstrate the advantages of such heterogeneous hierarchical workflow composition.This work is a collaboration between Argonne National Laboratory and the Barcelona Supercomputing Center within the Joint Laboratory for Extreme-Scale Computing. This research is supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under contract number DE-AC02- 06CH11357, program manager Laura Biven, and by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), by Generalitat de Catalunya (contract 2014-SGR-1051).Peer ReviewedPostprint (author's final draft

    PaPaS: A Portable, Lightweight, and Generic Framework for Parallel Parameter Studies

    Full text link
    The current landscape of scientific research is widely based on modeling and simulation, typically with complexity in the simulation's flow of execution and parameterization properties. Execution flows are not necessarily straightforward since they may need multiple processing tasks and iterations. Furthermore, parameter and performance studies are common approaches used to characterize a simulation, often requiring traversal of a large parameter space. High-performance computers offer practical resources at the expense of users handling the setup, submission, and management of jobs. This work presents the design of PaPaS, a portable, lightweight, and generic workflow framework for conducting parallel parameter and performance studies. Workflows are defined using parameter files based on keyword-value pairs syntax, thus removing from the user the overhead of creating complex scripts to manage the workflow. A parameter set consists of any combination of environment variables, files, partial file contents, and command line arguments. PaPaS is being developed in Python 3 with support for distributed parallelization using SSH, batch systems, and C++ MPI. The PaPaS framework will run as user processes, and can be used in single/multi-node and multi-tenant computing systems. An example simulation using the BehaviorSpace tool from NetLogo and a matrix multiply using OpenMP are presented as parameter and performance studies, respectively. The results demonstrate that the PaPaS framework offers a simple method for defining and managing parameter studies, while increasing resource utilization.Comment: 8 pages, 6 figures, PEARC '18: Practice and Experience in Advanced Research Computing, July 22--26, 2018, Pittsburgh, PA, US

    Collaborative e-science architecture for Reaction Kinetics research community

    Get PDF
    This paper presents a novel collaborative e-science architecture (CeSA) to address two challenging issues in e-science that arise from the management of heterogeneous distributed environments: (i) how to provide individual scientists an integrated environment to collaborate with each other in distributed, loosely coupled research communities where each member might be using a disparate range of tools; and (ii) how to provide easy access to a range of computationally intensive resources from a desktop. The Reaction Kinetics research community was used to capture the requirements and in the evaluation of the proposed architecture. The result demonstrated the feasibility of the approach and the potential benefits of the CeSA
    • …
    corecore