20 research outputs found

    Compiler assisted chekpointing of message-passing applications in heterogeneous environments

    Get PDF
    [Resumen] With the evolution of high performance computing towards heterogeneous, massively parallel systems, parallel applications have developed new checkpoint and restart necessities, Whether due to a failure in the execution or to a migration of the processes to different machines, checkpointing tools must be able to operate in heterogeneous environments. However, some of the data manipulated by a parallel application are not truly portable. Examples of these include opaque state (e.g. data structures for communications support) or diversity of interfaces for a single feature (e.g. communications, I/O). Directly manipulating the underlying ad-hoc representations renders checkpointing tools incapable of working on different environments. Portable checkpointers usually work around portability issues at the cost of transparency: the user must provide information such as what data needs to be stored, where to store it, or where to checkpoint. CPPC (ComPiler for Portable Checkpointing) is a checkpointing tool designed to feature both portability and transparency, while preserving the scalability of the executed applications. It is made up of a library and a compiler. The CPPC library contains routines for variable level checkpointing, using portable code and protocols. The CPPC compiler achieves transparency by relieving the user from time-consuming tasks, such as performing code analyses and adding instrumentation code

    A checkpointing mechanism for GPU intensive HPC applications

    Get PDF
    Please refer to pdf.James Watt ScholarshipEngineering and Physical Sciences Research Council (EPSRC) grants EP/N028201/1 and EP/L00058X/

    Application-level Fault Tolerance and Resilience in HPC Applications

    Get PDF
    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo] As necesidades computacionais das distintas ramas da ciencia medraron enormemente nos últimos anos, o que provocou un gran crecemento no rendemento proporcionado polos supercomputadores. Cada vez constrúense sistemas de computación de altas prestacións de maior tamaño, con máis recursos hardware de distintos tipos, o que fai que as taxas de fallo destes sistemas tamén medren. Polo tanto, o estudo de técnicas de tolerancia a fallos eficientes é indispensábel para garantires que os programas científicos poidan completar a súa execución, evitando ademais que se dispare o consumo de enerxía. O checkpoint/restart é unha das técnicas máis populares. Sen embargo, a maioría da investigación levada a cabo nas últimas décadas céntrase en estratexias stop-and-restart para aplicacións de memoria distribuída tralo acontecemento dun fallo-parada. Esta tese propón técnicas checkpoint/restart a nivel de aplicación para os modelos de programación paralela roáis populares en supercomputación. Implementáronse protocolos de checkpointing para aplicacións híbridas MPI-OpenMP e aplicacións heteroxéneas baseadas en OpenCL, en ámbolos dous casos prestando especial coidado á portabilidade e maleabilidade da solución. En canto a aplicacións de memoria distribuída, proponse unha solución de resiliencia que pode ser empregada de forma xenérica en aplicacións MPI SPMD, permitindo detectar e reaccionar a fallos-parada sen abortar a execución. Neste caso, os procesos fallidos vólvense a lanzar e o estado da aplicación recupérase cunha volta atrás global. A maiores, esta solución de resiliencia optimizouse implementando unha volta atrás local, na que só os procesos fallidos volven atrás, empregando un protocolo de almacenaxe de mensaxes para garantires a consistencia e o progreso da execución. Por último, propónse a extensión dunha librería de checkpointing para facilitares a implementación de estratexias de recuperación ad hoc ante conupcións de memoria. En moitas ocasións, estos erros poden ser xestionados a nivel de aplicación, evitando desencadear un fallo-parada e permitindo unha recuperación máis eficiente.[Resumen] El rápido aumento de las necesidades de cómputo de distintas ramas de la ciencia ha provocado un gran crecimiento en el rendimiento ofrecido por los supercomputadores. Cada vez se construyen sistemas de computación de altas prestaciones mayores, con más recursos hardware de distintos tipos, lo que hace que las tasas de fallo del sistema aumenten. Por tanto, el estudio de técnicas de tolerancia a fallos eficientes resulta indispensable para garantizar que los programas científicos puedan completar su ejecución, evitando además que se dispare el consumo de energía. La técnica checkpoint/restart es una de las más populares. Sin embargo, la mayor parte de la investigación en este campo se ha centrado en estrategias stop-and-restart para aplicaciones de memoria distribuida tras la ocurrencia de fallos-parada. Esta tesis propone técnicas checkpoint/restart a nivel de aplicación para los modelos de programación paralela más populares en supercomputación. Se han implementado protocolos de checkpointing para aplicaciones híbridas MPI-OpenMP y aplicaciones heterogéneas basadas en OpenCL, prestando en ambos casos especial atención a la portabilidad y la maleabilidad de la solución. Con respecto a aplicaciones de memoria distribuida, se propone una solución de resiliencia que puede ser usada de forma genérica en aplicaciones MPI SPMD, permitiendo detectar y reaccionar a fallosparada sin abortar la ejecución. En su lugar, se vuelven a lanzar los procesos fallidos y se recupera el estado de la aplicación con una vuelta atrás global. A mayores, esta solución de resiliencia ha sido optimizada implementando una vuelta atrás local, en la que solo los procesos fallidos vuelven atrás, empleando un protocolo de almacenaje de mensajes para garantizar la consistencia y el progreso de la ejecución. Por último, se propone una extensión de una librería de checkpointing para facilitar la implementación de estrategias de recuperación ad hoc ante corrupciones de memoria. Muchas veces, este tipo de errores puede gestionarse a nivel de aplicación, evitando desencadenar un fallo-parada y permitiendo una recuperación más eficiente.[Abstract] The rapid increase in the computational demands of science has lead to a pronounced growth in the performance offered by supercomputers. As High Performance Computing (HPC) systems grow larger, including more hardware components of different types, the system's failure rate becomes higher. Efficient fault tolerance techniques are essential not only to ensure the execution completion but also to save energy. Checkpoint/restart is one of the most popular fault tolerance techniques. However, most of the research in this field is focused on stop-and-restart strategies for distributed-memory applications in the event of fail-stop failures. Thís thesis focuses on the implementation of application-level checkpoint/restart solutions for the most popular parallel programming models used in HPC. Hence, we have implemented checkpointing solutions to cope with fail-stop failures in hybrid MPI-OpenMP applications and OpenCL-based programs. Both strategies maximize the restart portability and malleability, ie., the recovery can take place on machines with different CPU / accelerator architectures, and/ or operating systems, and can be adapted to the available resources (number of cores/accelerators). Regarding distributed-memory applications, we propose a resilience solution that can be generally applied to SPMD MPI programs. Resilient applications can detect and react to failures without aborting their execution upon fail-stop failures. Instead, failed processes are re-spawned, and the application state is recovered through a global rollback. Moreover, we have optimized this resilience proposal by implementing a local rollback protocol, in which only failed processes rollback to a previous state, while message logging enables global consistency and further progress of the computation. Finally, we have extended a checkpointing library to facilitate the implementation of ad hoc recovery strategies in the event of soft errors) caused by memory corruptions. Many times, these errors can be handled at the software-Ievel, tIms, avoiding fail-stop failures and enabling a more efficient recovery

    Creating science-driven computer architecture: A new path to scientific leadership

    Full text link

    ClusterRAID: Architecture and Prototype of a Distributed Fault-Tolerant Mass Storage System for Clusters

    Get PDF
    During the past few years clusters built from commodity off-the-shelf (COTS) components have emerged as the predominant supercomputer architecture. Typically comprising a collection of standard PCs or workstations and an interconnection network, they have replaced the traditionally used integrated systems due to their better price/performance ratio. As paradigms shift from mere computing intensive to I/O intensive applications, mass storage solutions for cluster installations become a more and more crucial aspect of these systems. The inherent unreliability of the underlying components is one of the reasons why no system has been established as a standard storage solution for clusters yet. This thesis sets out the architecture and prototype implementation of a novel distributed mass storage system for commodity off-the-shelf clusters and addresses the issue of the unreliable constituent components. The key concept of the presented system is the conversion of the local hard disk drive of a cluster node into a reliable device while preserving the block device interface. By the deployment of sophisticated erasure-correcting codes, the system allows the adjustment of the number of tolerable failures and thus the overall reliability. In addition, the applied data layout considers the access behaviour of a broad range of applications and minimizes the number of required network transactions. Extensive measurements and functionality tests of the prototype, both stand-alone and in conjunction with local or distributed file systems, show the validity of the concept

    Adaptive and Power-aware Fault Tolerance for Future Extreme-scale Computing

    Get PDF
    Two major trends in large-scale computing are the rapid growth in HPC with in particular an international exascale initiative, and the dramatic expansion of Cloud infrastructures accompanied by the Big Data passion. To satisfy the continuous demands for increasing computing capacity, future extreme-scale systems will embrace a multi-fold increase in the number of computing, storage, and communication components, in order to support an unprecedented level of parallelism. Despite the capacity and economies benefits, making the upward transformation to extreme-scale poses numerous scientific and technological challenges, two of which are the power consumption and fault tolerance. With the increase in system scale, failure would become a norm rather than an exception, driving the system to significantly lower efficiency with unforeseen power consumption. This thesis aims at simultaneously addressing the above two challenges by introducing a novel fault-tolerant computational model, referred to as \textit{Leaping Shadows}. Based on Shadow Replication, Leaping Shadows associates with each main process a suite of coordinated shadow processes, which execute in parallel but at differential rates, to deal with failures and meet the QoS requirements of the underlying application under strict power/energy constraints. In failure-prone extreme-scale computing environments, this new model addresses the limitations of the basic Shadow Replication model, and achieves adaptive and power-aware fault tolerance that is more time and energy efficient than existing techniques. In this thesis, we first present an analytical model based optimization framework that demonstrates Shadow Replication's adaptivity and flexibility in achieving multi-dimensional QoS requirements. Then, we introduce Leaping Shadows as a novel power-aware fault tolerance model, which tolerates multiple types of failures, guarantees forward progress, and maintains a consistent level of resilience. Lastly, the details of a Leaping Shadows implementation in MPI is discussed, along with extensive performance evaluation that includes comparison to checkpoint/restart. Collectively, these efforts advocate an adaptive and power-aware fault tolerance alternative for future extreme-scale computing

    Applications Development for the Computational Grid

    Get PDF

    GRID superscalar: a programming model for the Grid

    Get PDF
    Durant els darrers anys el Grid ha sorgit com una nova plataforma per la computació distribuïda. La tecnologia Gris permet unir diferents recursos de diferents dominis administratius i formar un superordinador virtual amb tots ells. Molts grups de recerca han dedicat els seus esforços a desenvolupar un conjunt de serveis bàsics per oferir un middleware de Grid: una capa que permet l'ús del Grid. De tota manera, utilitzar aquests serveis no és una tasca fácil per molts usuaris finals, cosa que empitjora si l'expertesa d'aquests usuaris no està relacionada amb la informàtica.Això té una influència negativa a l'hora de que la comunitat científica adopti la tecnologia Grid. Es veu com una tecnologia potent però molt difícil de fer servir. Per facilitar l'ús del Grid és necessària una capa extra que amagui la complexitat d'aquest i permeti als usuaris programar o portar les seves aplicacions de manera senzilla.Existeixen moltes propostes d'eines de programació pel Grid. En aquesta tesi fem un resum d'algunes d'elles, i podem veure que existeixen eines conscients i no-conscients del Grid (es programen especificant o no els detalls del Grid, respectivament). A més, molt poques d'aquestes eines poden explotar el paral·lelisme implícit de l'aplicació, i en la majoria d'elles, l'usuari ha de definir aquest paral·lelisme de manera explícita. Una altra característica que considerem important és si es basen en llenguatges de programació molt populars (com C++ o Java), cosa que facilita l'adopció per part dels usuaris finals.En aquesta tesi, el nostre objectiu principal ha estat crear un model de programació pel Grid basat en la programació seqüencial i els llenguatges més coneguts de la programació imperativa, capaç d'explotar el paral·lelisme implícit de les aplicacions i d'accelerar-les fent servir els recursos del Grid de manera concurrent. A més, com el Grid és de naturalesa distribuïda, heterogènia i dinàmica i degut també a que el nombre de recursos que pot formar un Grid pot ser molt gran, la probabilitat de que es produeixi una errada durant l'execució d'una aplicació és elevada. Per tant, un altre dels nostres objectius ha estat tractar qualsevol tipus d'error que pugui sorgir durant l'execució d'una aplicació de manera automàtica (ja siguin errors relacionats amb l'aplicació o amb el Grid). GRID superscalar (GRIDSs), la principal contribució d'aquesta tesi, és un model de programació que assoleix elsobjectius mencionats proporcionant una interfície molt petita i simple i un entorn d'execució que és capaç d'executar en paral·lel el codi proporcionat fent servir el Grid. La nostra interfície de programació permet a un usuari programar una aplicació no-conscient del Grid, amb llenguatges imperatius coneguts i populars (com C/C++, Java, Perl o Shell script) i de manera seqüencial, per tant dóna un pas important per ajudar als usuaris a adoptar la tecnologia Grid.Hem aplicat el nostre coneixement de l'arquitectura de computadors i el disseny de microprocessadors a l'entorn d'execució de GRIDSs. Tal com es fa a un processador superescalar, l'entorn d'execució de GRIDSs és capaç de realitzar un anàlisi de dependències entre les tasques que formen l'aplicació, i d'aplicar tècniques de renombrament per incrementar el seu paral·lelisme. GRIDSs genera automàticament a partir del codi principal de l'usuari un graf que descriu les dependències de dades en l'aplicació. També presentem casos d'ús reals del model de programació en els camps de la química computacional i la bioinformàtica, que demostren que els nostres objectius han estat assolits.Finalment, hem estudiat l'aplicació de diferents tècniques per detectar i tractar fallades: checkpoint, reintent i replicació de tasques. La nostra proposta és proporcionar un entorn capaç de tractar qualsevol tipus d'errors, de manera transparent a l'usuari sempre que sigui possible. El principal avantatge d'implementar aquests mecanismos al nivell del model de programació és que el coneixement a nivell de l'aplicació pot ser explotat per crear dinàmicament una estratègia de tolerància a fallades per cada aplicació, i evitar introduir sobrecàrrega en entorns lliures d'errors.During last years, the Grid has emerged as a new platform for distributed computing. The Grid technology allows joining different resources from different administrative domains and forming a virtual supercomputer with all of them.Many research groups have dedicated their efforts to develop a set of basic services to offer a Grid middleware: a layer that enables the use of the Grid. Anyway, using these services is not an easy task for many end users, even more if their expertise is not related to computer science. This has a negative influence in the adoption of the Grid technology by the scientific community. They see it as a powerful technology but very difficult to exploit. In order to ease the way the Grid must be used, there is a need for an extra layer which hides all the complexity of the Grid, and allows users to program or port their applications in an easy way.There has been many proposals of programming tools for the Grid. In this thesis we give an overview on some of them, and we can see that there exist both Grid-aware and Grid-unaware environments (programmed with or without specifying details of the Grid respectively). Besides, very few existing tools can exploit the implicit parallelism of the application and in the majority of them, the user must define the parallelism explicitly. Another important feature we consider is if they are based in widely used programming languages (as C++ or Java), so the adoption is easier for end users.In this thesis, our main objective has been to create a programming model for the Grid based on sequential programming and well-known imperative programming languages, able to exploit the implicit parallelism of applications and to speed them up by using the Grid resources concurrently. Moreover, because the Grid has a distributed, heterogeneous and dynamic nature and also because the number of resources that form a Grid can be very big, the probability that an error arises during an application's execution is big. Thus, another of our objectives has been to automatically deal with any type of errors which may arise during the execution of the application (application related or Grid related).GRID superscalar (GRIDSs), the main contribution of this thesis, is a programming model that achieves these mentioned objectives by providing a very small and simple interface and a runtime that is able to execute in parallel the code provided using the Grid. Our programming interface allows a user to program a Grid-unaware application with already known and popular imperative languages (such as C/C++, Java, Perl or Shell script) and in a sequential fashion, therefore giving an important step to assist end users in the adoption of the Grid technology.We have applied our knowledge from computer architecture and microprocessor design to the GRIDSs runtime. As it is done in a superscalar processor, the GRIDSs runtime system is able to perform a data dependence analysis between the tasks that form an application, and to apply renaming techniques in order to increase its parallelism. GRIDSs generates automatically from user's main code a graph describing the data dependencies in the application.We present real use cases of the programming model in the fields of computational chemistry and bioinformatics, which demonstrate that our objectives have been achieved.Finally, we have studied the application of several fault detection and treatment techniques: checkpointing, task retry and task replication. Our proposal is to provide an environment able to deal with all types of failures, transparently for the user whenever possible. The main advantage in implementing these mechanisms at the programming model level is that application-level knowledge can be exploited in order to dynamically create a fault tolerance strategy for each application, and avoiding to introduce overhead in error-free environments

    Proceedings of the Workshop on Parallel/High-Performance Object-Oriented Scientific Computing (POOSC '03)

    Get PDF
    corecore