1,229 research outputs found

    Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-016-1863-z[Abstract] The Message Passing Interface (MPI) standard is the most popular parallel programming model for distributed systems. However, it lacks fault-tolerance support and, traditionally, failures are addressed with stop-and-restart checkpointing solutions. The proposal of User Level Failure Mitigation (ULFM) for the inclusion of resilience capabilities in the MPI standard provides new opportunities in this field, allowing the implementation of resilient MPI applications, i.e., applications that are able to detect and react to failures without stopping their execution. This work compares the performance of a traditional stop-and-restart checkpointing solution with its equivalent resilience proposal. Both approaches are built on top of ComPiler for Portable Checkpoiting (CPPC) an application-level checkpointing tool for MPI applications, and they allow to transparently obtain fault-tolerant MPI applications from generic MPI Single Program Multiple Data (SPMD). The evaluation is focused on the scalability of the two solutions, comparing both proposals using up to 3072 cores.Ministerio de EconomĂ­a y Competitividad; TIN2013-42148-PMinisterio de EconomĂ­a y Competitividad; BES-2014-068066Galicia.ConsellerĂ­a de Cultura, EducaciĂłn e OrdenaciĂłn Universitaria; GRC2013/05

    Resilience of Parallel Applications

    Get PDF
    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.Future exascale systems are predicted to be formed by millions of cores. This is a great opportunity for HPC applications, however, it is also a hazard for the completion of their execution. Even if one computation node presents a failure every one century, a machine with 100.000 nodes will encounter a failure every 9 hours. Thus, HPC applications need to make use of fault tolerance techniques to ensure they successfully finish their execution. This PhD thesis is focused on fault tolerance solutions for generic parallel applications, more specifically in checkpointing solutions. We have extended CPPC, an MPI application-level portable checkpointing tool developed in our research group, to work with OpenMP applications, and hybrid MPI-OpenMP applications. Currently, we are working on transparently obtaining resilient MPI applications, that is, applications that are able to recover themselves from failures without stopping their execution.European Cooperation in Science and Technology. COSTThis research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P, and the predoctoral grant of Nuria Losada ref. BES-2014-068066) and by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)

    In-memory application-level checkpoint-based migration for MPI programs

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-014-1120-2[Abstract] Process migration provides many benefits for parallel environments including dynamic load balancing, data access locality or fault tolerance. This paper describes an in-memory application-level checkpoint-based migration solution for MPI codes that uses the Hierarchical Data Format 5 (HDF5) to write the checkpoint files. The main features of the proposed solution are transparency for the user, achieved through the use of CPPC (ComPiler for Portable Checkpointing); portability, as the application-level approach makes the solution adequate for any MPI implementation and operating system, and the use of the HDF5 file format enables the restart on different architectures; and high performance, by saving the checkpoint files to memory instead of to disk through the use of the HDF5 in-memory files. Experimental results prove that the in-memory approach reduces significantly the I/O cost of the migration process.Ministerio de Ciencia e InnovaciĂłn; TIN2010-16735Galicia. ConsellerĂ­a de EconomĂ­a e Industria; 10PXIB105180P

    Resilient MPI applications using an application-level checkpointing framework and ULFM

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-016-1629-7[Abstract] Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. The Fault Tolerance Working Group, within the MPI forum, has presented the User Level Failure Mitigation (ULFM) proposal, providing new functionalities for the implementation of resilient MPI applications. In this work, the CPPC checkpointing framework is extended to exploit the new ULFM functionalities. The proposed solution transparently obtains resilient MPI applications by instrumenting the original application code. Besides, a multithreaded multilevel checkpointing, in which the checkpoint files are saved in different memory levels, improves the scalability of the solution. The experimental evaluation shows a low overhead when tolerating failures in one or several MPI processes.Ministerio de EconomĂ­a y Competitividad; TIN2013-42148-PMinisterio de EconomĂ­a y Competitividad; TIN2014-53522-REDTMinisterio de EconomĂ­a y Competitividad; BES-2014-068066Galicia. ConsellerĂ­a de Cultura, EducaciĂłn e OrdenaciĂłn Universitaria; GRC2013/05

    Failure Avoidance in MPI Applications Using an Application-Level Approach

    Get PDF
    [Abstract] Execution times of large-scale computational science and engineering parallel applications are usually longer than the mean-time-between-failures. For this reason, hardware failures must be tolerated by the applications to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to provide fault tolerance support to parallel applications. However, when a failure occurs, most checkpointing mechanisms require a complete restart of the parallel application from the last checkpoint. New advances in the prediction of hardware failures have led to the development of proactive process migration approaches, where tasks are migrated in a preventive way when node failures are anticipated, avoiding the restart of the whole application. The work presented in this paper extends an application-level checkpointing framework to proactively migrate message passing interface (MPI) processes when impending failures are notified, without having to restart the entire application. The main features of the proposed solution are: low overhead in failure-free executions, avoiding the checkpoint dumping associated to rolling back strategies; low overhead at migration time, by means of the design of a light and asynchronous protocol to achieve a consistent global state; transparency for the user, thanks to the use of a compiler tool and a runtime library and portability, as it is not locked into a particular architecture, operating system or MPI implementation.Ministerio de Ciencia e InnovaciĂłn; TIN2010-16735Galicia. ConsellerĂ­a de EconomĂ­a e Industria; 10PXIB105180P

    Reducing the overhead of an MPI application-level migration approach

    Get PDF
    [Abstract] Process migration provides many benefits for parallel environments including dynamic load balance, data access locality, or fault tolerance. This work proposes a solution that reduces the memory and I/O overhead in an application-level checkpoint-based migration approach. The proposal splits the checkpoint files in order to overlap the writing of the state in the terminating processes with the read and restarting operation in the newly spawned processes. It has been tested using the MPI NAS Parallel Benchmarks, showing encouraging results, both in terms of memory consumption and I/O migration times.Ministerio de EconomĂ­a y Competitividad; TIN2013-42148-PGalicia. ConsellerĂ­a de Cultura, EducaciĂłn e OrdenaciĂłn Universitaria; GRC2013/05

    Parallelization of ARACNe, an Algorithm for the Reconstruction of Gene Regulatory Networks

    Get PDF
    [Abstract] Gene regulatory networks are graphical representations of molecular regulators that interact with each other and with other substances in the cell to govern the gene expression. There are different computational approaches for the reverse engineering of these networks. Most of them require all gene-gene evaluations using different mathematical methods such as Pearson/Spearman correlation, Mutual Information or topology patterns, among others. The Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNe) is one of the most effective and widely used tools to reconstruct gene regulatory networks. However, the high computational cost of ARACNe prevents its use over large biologic datasets. In this work, we present a hybrid MPI/OpenMP parallel implementation of ARACNe to accelerate its execution on multi-core clusters, obtaining a speedup of 430.46 using as input a dataset with 41,100 genes and 108 samples and 32 nodes (each of them with 24 cores).Ministerio de EconomĂ­a y Competitividad; TIN2016-75845-PXunta de Galicia; ED431G/01Xunta de Galicia; ED431C 2017/0

    High Performance Air Quality Simulation in the European CrossGrid Project

    Get PDF
    This paper focuses on one of the applications involved into the CrossGrid project, the STEM-II air pollution model used to simulate the environment of As Pontes Power Plant in A Coruna (Spain). The CrossGrid project offers us a Grid environment oriented towards computation- and data-intensive applications that need interaction with an external user. The air pollution model needs the interaction of an expert in order to make decisions about modifications in the industrial process to fulfil the European standard on emissions and air quality. The benefits of using different CrossGrid components for running the application on a Grid infrastructure are shown in this paper, and some preliminary results on the CrossGrid testbed are displayed

    Extending an Application-Level Checkpointing Tool to Provide Fault Tolerance Support to OpenMP Applications

    Get PDF
    [Abstract] Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing fault tolerance support to shared-memory applications. CPPC (ComPiler for Portable Checkpointing) is an application-level checkpointing tool focused on the insertion of fault tolerance into long-running MPI applications. This paper presents an extension to CPPC to allow the checkpointing of OpenMP applications. The proposed solution maintains the main characteristics of CPPC: portability and reduced checkpoint file size. The performance of the proposal is evaluated using the OpenMP NAS Parallel Benchmarks showing that most of the applications present small checkpoint overheads.Ministerio de EconomĂ­a y Competitividad; TIN2013-42148-

    PyToxo: a Python tool for calculating penetrance tables of high-order epistasis models

    Get PDF
    [Abstract] Background Epistasis is the interaction between different genes when expressing a certain phenotype. If epistasis involves more than two loci it is called high-order epistasis. High-order epistasis is an area under active research because it could be the cause of many complex traits. The most common way to specify an epistasis interaction is through a penetrance table. Results This paper presents PyToxo, a Python tool for generating penetrance tables from any-order epistasis models. Unlike other tools available in the bibliography, PyToxo is able to work with high-order models and realistic penetrance and heritability values, achieving high-precision results in a short time. In addition, PyToxo is distributed as open-source software and includes several interfaces to ease its use. Conclusions PyToxo provides the scientific community with a useful tool to evaluate algorithms and methods that can detect high-order epistasis to continue advancing in the discovery of the causes behind complex diseases.This study and publication costs were funded by the Ministry of Science and Innovation of Spain (grant PID2019-104184RB-I00/AEI/10.13039/501100011033) and by Xunta de Galicia and FEDER funds of the EU (CITIC-Centro de InvestigaciĂłn de Galicia accreditation, grant ED431G 2019/01; Consolidation Program of Competitive Reference Groups, grant ED431C 2021/30). CP was funded by the Ministry of Education of Spain (grant FPU16/01333). The funders did not play any role in the design of the study, the collection, analysis, and interpretation of data, or in writing of the manuscriptXunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C 2021/3
    • …
    corecore