1,215 research outputs found

    CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

    Get PDF
    In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks

    Session-Based Programming for Parallel Algorithms: Expressiveness and Performance

    Full text link
    This paper investigates session programming and typing of benchmark examples to compare productivity, safety and performance with other communications programming languages. Parallel algorithms are used to examine the above aspects due to their extensive use of message passing for interaction, and their increasing prominence in algorithmic research with the rising availability of hardware resources such as multicore machines and clusters. We contribute new benchmark results for SJ, an extension of Java for type-safe, binary session programming, against MPJ Express, a Java messaging system based on the MPI standard. In conclusion, we observe that (1) despite rich libraries and functionality, MPI remains a low-level API, and can suffer from commonly perceived disadvantages of explicit message passing such as deadlocks and unexpected message types, and (2) the benefits of high-level session abstraction, which has significant impact on program structure to improve readability and reliability, and session type-safety can greatly facilitate the task of communications programming whilst retaining competitive performance

    Soft-fault recovery in MPI applications

    Get PDF
    [Abstract] Current high-performance computing (HPC) systems are comprised of thousands of CPU cores, and this number is expected to grow into the millions in the near future. With such an elevated number of processors, the mean time between failures (MTBF) can become so small that most scientific applications will not have time to complete their execution before a failure occurs. It is therefore critical to develop fault tolerance and resilience mechanisms in order to guarantee the completion and integrity of massively parallel applications. University of A Coruña’s Computer Architecture Group (GAC) proposed a solution (Controller/comPiler for Portable Checkpointing - CPPC) in order to transparently convert generic MPI applications into fault tolerant applications, based on a checkpoint-restart scheme. CPPC was extended by Nuria Losada into CPPC-resilience in order to make resilient MPI applications, that is, those that are capable of detecting and reacting to failures without aborting the application, such that survivor processes don’t have to be restarted. This was accomplished by means of a logging protocol and the usage of a proposed fault tolerance interface addition to the MPI standard (User Level Failure Mitigation). However, this system cannot handle soft errors efficiently, since it kills and respawns the failed processes entirely when it is not necessary, as these errors are transient in nature. The object of this project is to extend and adapt CPPCresilience in order to handle soft errors in a more efficient manner, without having to respawn the failed processes. This proposal has been evaluated using 3 MPI applications with different characteristics, achieving a decrease in recovery times after a soft error ranging from 2 to 44 percent, depending on the total number of processes involved.[Resumo] Os sistemas actuais de computación de altas prestacións (HPC) están formados por miles de núcleos de procesadores, e espérase que este número aumente ata os millóns nun futuro cercano. Cun número tan elevado de procesadores, o tempo medio entre fallos (MTBF) pode chegar a reducirse tanto que a maioría de computacións científicas non terían tempo de completar a súa execución antes de que ocorrese un fallo. Polo tanto, é crítico o desenvolvemento de sistemas tolerantes e resilientes a fallos, para garantizar a finalización e integridade das aplicacións masivamente paralelas. O Grupo de Arquitectura de Computadores (GAC) da UDC propuxo unha solución (Controller/comPiler for Portable Checkpointing - CPPC) para convertir de xeito transparente aplicatcións MPI xenéricas en aplicacións tolerantes a fallos, basándose nun esquema de checkpointing e reinicio. CPPC foi posteriormente extendido por Nuria Losada en CPPC-resilience coa finalidade de crear aplicacións MPI resilientes, é dicir, aquelas que son capaces de detectar e reaccionar a fallos sen abortar a aplicación, de xeito que os procesos superviventes non necesitan ser reiniciados. Isto logrouse mediante un protocolo de logging de mensaxes e o uso dunha interfaz de tolerancia a fallos, ULFM (User Level Failure Mitigation), proposta para adición ao estándar MPI. Sen embargo, este sistema non xestiona os errores soft de maneira eficiente, xa que mata e reinicia os procesos fallados por completo cando non é necesario, xa que este tipo de errores teñen natureza transitoria. A meta deste TFG é extender e adaptar CPPC-resilience para poder manexar os errores soft eficientemente, sen ter que reiniciar os procesos fallados. Esta proposta foi evaluada utilizando 3 aplicacións MPI con diferentes características, conseguindo unha redución nos tempos de recuperación tras un erro soft de entre un 2 e un 44 por cento, dependendo do número total de procesos involucrados

    MPI Thread-Level Checking for MPI+OpenMP Applications

    Get PDF
    International audienceMPI is the most widely used parallel programming model. But the reducing amount of memory per compute core tends to push MPI to be mixed with shared-memory approaches like OpenMP. In such cases, the interoperability of those two models is challenging. The MPI 2.0 standard defines the so-called thread level to indicate how MPI will interact with threads. But even if hybrid programs are more common, there is still a lack in debugging tools and more precisely in thread level compliance. To fill this gap, we propose a static analysis to verify the thread-level required by an application. This work extends PARCOACH, a GCC plugin focused on the detection of MPI collective errors in MPI and MPI+OpenMP programs. We validated our analysis on computational benchmarks and applications and measured a low overhead

    An analysis of key generation efficiency of RSA cryptosystem in distributed environments

    Get PDF
    Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2005Includes bibliographical references (leaves: 68)Text in English Abstract: Turkish and Englishix, 74 leavesAs the size of the communication through networks and especially through Internet grew, there became a huge need for securing these connections. The symmetric and asymmetric cryptosystems formed a good complementary approach for providing this security. While the asymmetric cryptosystems were a perfect solution for the distribution of the keys used by the communicating parties, they were very slow for the actual encryption and decryption of the data flowing between them. Therefore, the symmetric cryptosystems perfectly filled this space and were used for the encryption and decryption process once the session keys had been exchanged securely. Parallelism is a hot research topic area in many different fields and being used to deal with problems whose solutions take a considerable amount of time. Cryptography is no exception and, computer scientists have discovered that parallelism could certainly be used for making the algorithms for asymmetric cryptosystems go faster and the experimental results have shown a good promise so far. This thesis is based on the parallelization of a famous public-key algorithm, namely RSA

    Performance evaluation and enhancement of Dendro

    Get PDF
    DENDRO is a collection of tools for solving Finite Element problems in parallel. This package is written in C++ using the standard template library (STL) and uses the Message Passing (MPI). Dendro uses an octree data-structure to solve image-registration problems using finite element techniques. For analyzing the behavior of the package in terms of speed-up and scalability, it is important to know which part of the package is consuming most of the execution-time. The single node performance and the overall performance of the package is dependent on the code-organization and class-hierarchy. We used the PETSC profiler to collect the performance statistics and instrument the code to know which part of the code takes most of the time. Along with the function-specific execution timings, PETSC profiler also provides the information regarding how many floating point operations is being performed in total and on average (FLOP/second). PETSC also provides information related to memory usage and number of MPI messages and reductions being performed to execute that particular function. We have analyzed these performance-statistics to provide some guidelines to how we can make Dendro more efficient by optimizing certain functions. We obtained around 12X speedup over the performance of (default) Dendro by using compiler-provided optimizations and achieved more than 65% speedup over compiler optimized performance (20X over the naive Dendro performance) by manually tuning some-block of code along with the compiler-optimizations

    PARCOACH: Combining static and dynamic validation of MPI collective communications

    Get PDF
    International audienceNowadays most scientific applications are parallelized based on MPI communications. Collective MPI communications have to be executed in the same order by all processes in their communicator and the same number of times, otherwise it is not conforming to the standard and a deadlock or other undefined behavior can occur. As soon as the control-flow involving these collective operations becomes more complex, in particular including conditionals on process ranks, ensuring the correction of such code is error-prone. We propose in this paper a static analysis to detect when such situation occurs, combined with a code transformation that prevents from deadlocking. We focus on blocking MPI collective operations in SPMD applications, assuming MPI calls are not nested in multithreaded regions. We show on several benchmarks the small impact on performance and the ease of integration of our techniques in the development process

    MPI WITHIN A GPU

    Get PDF
    GPUs offer high-performance floating-point computation at commodity prices, but their usage is hindered by programming models which expose the user to irregularities in the current shared-memory environments and require learning new interfaces and semantics. This thesis will demonstrate that the message-passing paradigm can be conceptually cleaner than the current data-parallel models for programming GPUs because it can hide the quirks of current GPU shared-memory environments, as well as GPU-specific features, behind a well-established and well-understood interface. This will be shown by demonstrating a proof-of-concept MPI implementation which provides cleaner, simpler code with a reasonable performance cost. This thesis will also demonstrate that, although there is a virtualization constraint imposed by MPI, this constraint is harmless as long as the virtualization was already chosen to be optimal in terms of a strong execution model and nearly-optimal execution time. This will be demonstrated by examining execution times with varying virtualization using a computationally-expensive micro-kernel
    • …
    corecore