39 research outputs found

    Application-level Fault Tolerance and Resilience in HPC Applications

    Get PDF
    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo] As necesidades computacionais das distintas ramas da ciencia medraron enormemente nos últimos anos, o que provocou un gran crecemento no rendemento proporcionado polos supercomputadores. Cada vez constrúense sistemas de computación de altas prestacións de maior tamaño, con máis recursos hardware de distintos tipos, o que fai que as taxas de fallo destes sistemas tamén medren. Polo tanto, o estudo de técnicas de tolerancia a fallos eficientes é indispensábel para garantires que os programas científicos poidan completar a súa execución, evitando ademais que se dispare o consumo de enerxía. O checkpoint/restart é unha das técnicas máis populares. Sen embargo, a maioría da investigación levada a cabo nas últimas décadas céntrase en estratexias stop-and-restart para aplicacións de memoria distribuída tralo acontecemento dun fallo-parada. Esta tese propón técnicas checkpoint/restart a nivel de aplicación para os modelos de programación paralela roáis populares en supercomputación. Implementáronse protocolos de checkpointing para aplicacións híbridas MPI-OpenMP e aplicacións heteroxéneas baseadas en OpenCL, en ámbolos dous casos prestando especial coidado á portabilidade e maleabilidade da solución. En canto a aplicacións de memoria distribuída, proponse unha solución de resiliencia que pode ser empregada de forma xenérica en aplicacións MPI SPMD, permitindo detectar e reaccionar a fallos-parada sen abortar a execución. Neste caso, os procesos fallidos vólvense a lanzar e o estado da aplicación recupérase cunha volta atrás global. A maiores, esta solución de resiliencia optimizouse implementando unha volta atrás local, na que só os procesos fallidos volven atrás, empregando un protocolo de almacenaxe de mensaxes para garantires a consistencia e o progreso da execución. Por último, propónse a extensión dunha librería de checkpointing para facilitares a implementación de estratexias de recuperación ad hoc ante conupcións de memoria. En moitas ocasións, estos erros poden ser xestionados a nivel de aplicación, evitando desencadear un fallo-parada e permitindo unha recuperación máis eficiente.[Resumen] El rápido aumento de las necesidades de cómputo de distintas ramas de la ciencia ha provocado un gran crecimiento en el rendimiento ofrecido por los supercomputadores. Cada vez se construyen sistemas de computación de altas prestaciones mayores, con más recursos hardware de distintos tipos, lo que hace que las tasas de fallo del sistema aumenten. Por tanto, el estudio de técnicas de tolerancia a fallos eficientes resulta indispensable para garantizar que los programas científicos puedan completar su ejecución, evitando además que se dispare el consumo de energía. La técnica checkpoint/restart es una de las más populares. Sin embargo, la mayor parte de la investigación en este campo se ha centrado en estrategias stop-and-restart para aplicaciones de memoria distribuida tras la ocurrencia de fallos-parada. Esta tesis propone técnicas checkpoint/restart a nivel de aplicación para los modelos de programación paralela más populares en supercomputación. Se han implementado protocolos de checkpointing para aplicaciones híbridas MPI-OpenMP y aplicaciones heterogéneas basadas en OpenCL, prestando en ambos casos especial atención a la portabilidad y la maleabilidad de la solución. Con respecto a aplicaciones de memoria distribuida, se propone una solución de resiliencia que puede ser usada de forma genérica en aplicaciones MPI SPMD, permitiendo detectar y reaccionar a fallosparada sin abortar la ejecución. En su lugar, se vuelven a lanzar los procesos fallidos y se recupera el estado de la aplicación con una vuelta atrás global. A mayores, esta solución de resiliencia ha sido optimizada implementando una vuelta atrás local, en la que solo los procesos fallidos vuelven atrás, empleando un protocolo de almacenaje de mensajes para garantizar la consistencia y el progreso de la ejecución. Por último, se propone una extensión de una librería de checkpointing para facilitar la implementación de estrategias de recuperación ad hoc ante corrupciones de memoria. Muchas veces, este tipo de errores puede gestionarse a nivel de aplicación, evitando desencadenar un fallo-parada y permitiendo una recuperación más eficiente.[Abstract] The rapid increase in the computational demands of science has lead to a pronounced growth in the performance offered by supercomputers. As High Performance Computing (HPC) systems grow larger, including more hardware components of different types, the system's failure rate becomes higher. Efficient fault tolerance techniques are essential not only to ensure the execution completion but also to save energy. Checkpoint/restart is one of the most popular fault tolerance techniques. However, most of the research in this field is focused on stop-and-restart strategies for distributed-memory applications in the event of fail-stop failures. Thís thesis focuses on the implementation of application-level checkpoint/restart solutions for the most popular parallel programming models used in HPC. Hence, we have implemented checkpointing solutions to cope with fail-stop failures in hybrid MPI-OpenMP applications and OpenCL-based programs. Both strategies maximize the restart portability and malleability, ie., the recovery can take place on machines with different CPU / accelerator architectures, and/ or operating systems, and can be adapted to the available resources (number of cores/accelerators). Regarding distributed-memory applications, we propose a resilience solution that can be generally applied to SPMD MPI programs. Resilient applications can detect and react to failures without aborting their execution upon fail-stop failures. Instead, failed processes are re-spawned, and the application state is recovered through a global rollback. Moreover, we have optimized this resilience proposal by implementing a local rollback protocol, in which only failed processes rollback to a previous state, while message logging enables global consistency and further progress of the computation. Finally, we have extended a checkpointing library to facilitate the implementation of ad hoc recovery strategies in the event of soft errors) caused by memory corruptions. Many times, these errors can be handled at the software-Ievel, tIms, avoiding fail-stop failures and enabling a more efficient recovery

    On improving the performance of optimistic distributed simulations

    No full text
    This report investigates means of improving the performance of optimistic distributed simulations without affecting the simulation accuracy. We argue that existing clustering algorithms are not adequate for application in distributed simulations, and outline some characteristics of an ideal algorithm that could be applied in this field. This report is structured as follows. We start by introducing the area of distributed simulation. Following a comparison of the dominant protocols used in distributed simulation, we elaborate on the current approaches of improving the simulation performance, using computation efficient techniques, exploiting the hardware configuration of processors, optimizations that can be derived from the simulation scenario, etc. We introduce the core characteristics of clustering approaches and argue that these cannot be applied in real-life distributed simulation problems. We present a typical distributed simulation setting and elaborate on the reasons that existing clustering approaches are not expected to improve the performance of a distributed simulation. We introduce a prototype distributed simulation platform that has been developed in the scope of this research, focusing on the area of emergency response and specifically building evacuation. We continue by outlining our current work on this issue, and finally, we end this report by outlining next actions which could be made in this field

    Cooperative Data Backup for Mobile Devices

    Get PDF
    Les dispositifs informatiques mobiles tels que les ordinateurs portables, assistants personnels et téléphones portables sont de plus en plus utilisés. Cependant, bien qu'ils soient utilisés dans des contextes où ils sont sujets à des endommagements, à la perte, voire au vol, peu de mécanismes permettent d'éviter la perte des données qui y sont stockées. Dans cette thèse, nous proposons un service de sauvegarde de données coopératif pour répondre à ce problème. Cette approche tire parti de communications spontanées entre de tels dispositifs, chaque dispositif stockant une partie des données des dispositifs rencontrés. Une étude analytique des gains de cette approche en termes de sûreté de fonctionnement est proposée. Nous étudions également des mécanismes de stockage réparti adaptés. Les problèmes de coopération entre individus mutuellement suspicieux sont également abordés. Enfin, nous décrivons notre mise en oeuvre du service de sauvegarde coopérative. ABSTRACT : Mobile devices such as laptops, PDAs and cell phones are increasingly relied on but are used in contexts that put them at risk of physical damage, loss or theft. However, few mechanisms are available to reduce the risk of losing the data stored on these devices. In this dissertation, we try to address this concern by designing a cooperative backup service for mobile devices. The service leverages encounters and spontaneous interactions among participating devices, such that each device stores data on behalf of other devices. We first provide an analytical evaluation of the dependability gains of the proposed service. Distributed storage mechanisms are explored and evaluated. Security concerns arising from thecooperation among mutually suspicious principals are identified, and core mechanisms are proposed to allow them to be addressed. Finally, we present our prototype implementation of the cooperative backup servic

    Efficient Passive Clustering and Gateways selection MANETs

    Get PDF
    Passive clustering does not employ control packets to collect topological information in ad hoc networks. In our proposal, we avoid making frequent changes in cluster architecture due to repeated election and re-election of cluster heads and gateways. Our primary objective has been to make Passive Clustering more practical by employing optimal number of gateways and reduce the number of rebroadcast packets

    A disk-resident suffix tree index and generic framework for managing tunable indexes

    Get PDF
    This thesis introduces two related technologies. The first is a disk-resident index for biological sequence data, and the second is a framework and toolkit for the management of operational parameters for applications of which this index is typical. The Top-Compressed Suffix Tree is a novel data structure that can be used to provide a scalable, disk-resident index for large sequences. This data structure is based on the suffix tree, but has been designed to overcome the problems associated with using such structures on secondary memory. Top-Compressed Suffix Trees can be constructed incrementally, allowing indexes to be created that are larger than the amount of available main memory. Correspondingly, querying such an index only requires part of the data structure to be resident in main memory, thus allowing support for on-demand faulting and eviction of index sections during search. Such an index may be of great benefit to scientists requiring efficient access to vast repositories of genomic data. The Generic Index Development and Operation Framework (GIDOF) is a framework and toolkit that supports various tasks relating to the management of operational parameters. The performance of an index's implementation is typically influenced by several operational parameters parameters that must be tuned carefully if optimum performance is to be obtained. Indexes implemented using GIDOF can be structured in such a way that values of selected operational parameters can be adjusted; resulting in an index implementation that can be tuned to suit a given workload or system environment. This thesis presents a detailed description of the design of both the Top-Compressed Suffix Tree and the algorithms that operate over it. Extensive performance measurements are then presented and discussed, covering such aspects of index performance as construction time, average query performance and the size of the completed index. An overview of the GIDOF parameter model and toolkit is then given together with examples of how this framework can be used to manage tunable indexes, such as the Top-Compressed Suffix Tree

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

    Preserving the Quality of Architectural Tactics in Source Code

    Get PDF
    In any complex software system, strong interdependencies exist between requirements and software architecture. Requirements drive architectural choices while also being constrained by the existing architecture and by what is economically feasible. This makes it advisable to concurrently specify the requirements, to devise and compare alternative architectural design solutions, and ultimately to make a series of design decisions in order to satisfy each of the quality concerns. Unfortunately, anecdotal evidence has shown that architectural knowledge tends to be tacit in nature, stored in the heads of people, and lost over time. Therefore, developers often lack comprehensive knowledge of underlying architectural design decisions and inadvertently degrade the quality of the architecture while performing maintenance activities. In practice, this problem can be addressed through preserving the relationships between the requirements, architectural design decisions and their implementations in the source code, and then using this information to keep developers aware of critical architectural aspects of the code. This dissertation presents a novel approach that utilizes machine learning techniques to recover and preserve the relationships between architecturally significant requirements, architectural decisions and their realizations in the implemented code. Our approach for recovering architectural decisions includes the two primary stages of training and classification. In the first stage, the classifier is trained using code snippets of different architectural decisions collected from various software systems. During this phase, the classifier learns the terms that developers typically use to implement each architectural decision. These ``indicator terms\u27\u27 represent method names, variable names, comments, or the development APIs that developers inevitably use to implement various architectural decisions. A probabilistic weight is then computed for each potential indicator term with respect to each type of architectural decision. The weight estimates how strongly an indicator term represents a specific architectural tactics/decisions. For example, a term such as \emph{pulse} is highly representative of the heartbeat tactic but occurs infrequently in the authentication. After learning the indicator terms, the classifier can compute the likelihood that any given source file implements a specific architectural decision. The classifier was evaluated through several different experiments including classical cross-validation over code snippets of 50 open source projects and on the entire source code of a large scale software system. Results showed that classifier can reliably recognize a wide range of architectural decisions. The technique introduced in this dissertation is used to develop the Archie tool suite. Archie is a plug-in for Eclipse and is designed to detect wide range of architectural design decisions in the code and to protect them from potential degradation during maintenance activities. It has several features for performing change impact analysis of architectural concerns at both the code and design level and proactively keep developers informed of underlying architectural decisions during maintenance activities. Archie is at the stage of technology transfer at the US Department of Homeland Security where it is purely used to detect and monitor security choices. Furthermore, this outcome is integrated into the Department of Homeland Security\u27s Software Assurance Market Place (SWAMP) to advance research and development of secure software systems

    Optimisation massivement multi-tâche sur grappes de calcul hétérogènes – Application aux problèmes de permutation

    Get PDF
    Branch-and-Bound (B&B) is a frequently used tree-search exploratory method for the exact resolution of combinatorial optimization problems (COPs). However, in practice, only small problem instances can be solved on a sequential computer, as B&B generates often generates a huge amount of subproblems to be evaluated. In order to solve large COPs, we revisit the design and implementation of massively parallel B&B on top of large heterogeneous clusters, integrating multi-core CPUs, many-core processors and GPUs.For the efficient storage and management of subproblems an original data structure (IVM) dedicated to permutation problems is used. Because of the highly irregular and unpredictable shape of the B&B tree, dynamic load balancing between parallel exploration processes is one of the main issues addressed in this thesis. Based on a compact encoding of the search space in the form of intervals, work stealing strategies for multi-core and GPU are proposed, as well as hierarchical approaches for load balancing in distributed memory multi-CPU/multi-GPU systems. Three permutation problems, the Flowshop Scheduling Problem (FSP), the Quadratic Assignment Problem (QAP) and the n-Queens puzzle problem are used as test-cases.The resolution, in 9 hours, of a FSP instance with an estimated sequential execution time of 22 years demonstrates the scalability of the proposed algorithms on a cluster composed of 36 GPUs.L'algorithme Branch-and-Bound (B&B) est une méthode de recherche arborescente fréquemment utilisé pour la résolution exacte de problèmes d'optimisation combinatoire (POC). Néanmoins, seules des petites instances peuvent être effectivement résolues sur une machine séquentielle, le nombre de sous-problèmes à évaluer étant souvent très grand. Visant la resolution de POC de grande taille, nous réexaminons la conception et l'implémentation d'algorithmes B&B massivement parallèles sur de larges plateformes hétérogènes de calcul, intégrant des processeurs multi-coeurs, many-cores et et processeurs graphiques (GPUs). Pour une représentation compacte en mémoire des sous-problèmes une structure de données originale (IVM), dédiée aux problèmes de permutation est utilisée. En raison de la forte irrégularité de l'arbre de recherche, l'équilibrage de charge dynamique entre processus d'exploration parallèles occupe une place centrale dans cette thèse. Basés sur un encodage compact de l'espace de recherche sous forme d'intervalles, des stratégies de vol de tâches sont proposées pour processeurs multi-core et GPU, ainsi une approche hiérarchique pour l'équilibrage de charge dans les systèmes multi-GPU et multi-CPU à mémoire distribuée. Trois problèmes d'optimisation définis sur l'ensemble des permutations, le problème d'ordonnancement Flow-Shop (FSP), d'affectation quadratique (QAP) et le problème des n-dames sont utilisés comme cas d'étude. La resolution en 9 heures d'une instance du FSP dont le temps de résolution séquentiel est estimé à 22 ans demontre la capacité de passage à l'échelle des algorithmes proposés sur une grappe de calcul composé de 36 GPUs

    Timing Predictability in Future Multi-Core Avionics Systems

    Full text link

    LIPIcs, Volume 248, ISAAC 2022, Complete Volume

    Get PDF
    LIPIcs, Volume 248, ISAAC 2022, Complete Volum
    corecore