52 research outputs found

    Implicit Actions and Non-blocking Failure Recovery with MPI

    Full text link
    Scientific applications have long embraced the MPI as the environment of choice to execute on large distributed systems. The User-Level Failure Mitigation (ULFM) specification extends the MPI standard to address resilience and enable MPI applications to restore their communication capability after a failure. This works builds upon the wide body of experience gained in the field to eliminate a gap between current practice and the ideal, more asynchronous, recovery model in which the fault tolerance activities of multiple components can be carried out simultaneously and overlap. This work proposes to: (1) provide the required consistency in fault reporting to applications (i.e., enable an application to assess the success of a computational phase without incurring an unacceptable performance hit); (2) bring forward the building blocks that permit the effective scoping of fault recovery in an application, so that independent components in an application can recover without interfering with each other, and separate groups of processes in the application can recover independently or in unison; and (3) overlap recovery activities necessary to restore the consistency of the system (e.g., eviction of faulty processes from the communication group) with application recovery activities (e.g., dataset restoration from checkpoints).Comment: Accepted in FTXS'22 https://sites.google.com/view/ftxs202

    Epidemic failure detection and consensus for extreme parallelism

    Get PDF
    Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum’s User Level Failure Mitigation proposal has introduced an operation, MPI Comm shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI Comm shrink operation requires a failure detection and consensus algorithm. This paper presents three novel failure detection and consensus algorithms using Gossiping. Stochastic pinging is used to quickly detect failures during the execution of the algorithm, failures are then disseminated to all the fault-free processes in the system and consensus on the failures is detected using the three consensus techniques. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that the stochastic pinging detects all the failures in the system. In all the algorithms, the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus. The third approach is a three-phase distributed failure detection and consensus algorithm and provides consistency guarantees even in very large and extreme-scale systems while at the same time being memory and bandwidth efficient

    Project Final Report: HPC-Colony II

    Full text link

    Hydrodynamics-Biology Coupling for Algae Culture and Biofuel Production

    Get PDF
    International audienceBiofuel production from microalgae represents an acute optimization problem for industry. There is a wide range of parameters that must be taken into account in the development of this technology. Here, mathematical modelling has a vital role to play. The potential of microalgae as a source of biofuel and as a technological solution for CO2 fixation is the subject of intense academic and industrial research. Large-scale production of microalgae has potential for biofuel applications owing to the high productivity that can be attained in high-rate raceway ponds. We show, through 3D numerical simulations, that our approach is capable of discriminating between situations where the paddle wheel is rapidly moving water or slowly agitating the process. Moreover, the simulated velocity fields can provide lagrangian trajectories of the algae. The resulting light pattern to which each cell is submitted when travelling from light (surface) to dark (bottom) can then be derived. It will then be reproduced in lab experiments to study photosynthesis under realistic light patterns

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

    Design and Implementation of a Scalable Membership Service for Supercomputer Resiliency-Aware Runtime

    Full text link
    As HPC systems and applications get bigger and more complex, we are approaching an era in which resiliency and run-time elasticity concerns become paramount. We offer a building block for an alternative resiliency approach in which computations will be able to make progress while components fail, in addition to enabling a dynamic set of nodes throughout a computation lifetime. The core of our solution is a hierarchical scalable membership service providing eventual consistency semantics. An attribute replication service is used for hierarchy organization, and is exposed to external applications. Our solution is based on P2P technologies and provides resiliency and elastic runtime support at ultra large scales. Resulting middleware is general purpose while exploiting HPC platform unique features and architecture. We have implemented and tested this system on BlueGene/P with Linux, and using worst-case analysis, evaluated the service scalability as effective for up to 1M nodes

    Connectivity recovery in epidemic membership protocols

    Get PDF
    Epidemic protocols are a bio-inspired communication and computation paradigm for extreme-scale network system based on randomized communication. The protocols rely on a membership service to build decentralized and random overlay topologies. In a weakly connected overlay topology, a naive mechanism of membership protocols can break the connectivity, thus impairing the accuracy of the application. This work investigates the factors in membership protocols that cause the loss of global connectivity and introduces the first topology connectivity recovery mechanism. The mechanism is integrated into the Expander Membership Protocol, which is then evaluated against other membership protocols. The analysis shows that the proposed connectivity recovery mechanism is effective in preserving topology connectivity and also helps to improve the application performance in terms of convergence speed

    Robust and efficient membership management in large-scale dynamic networks

    Get PDF
    Epidemic protocols are a bio-inspired communication and computation paradigm for large-scale networked systems based on randomised communication. These protocols rely on a membership service to build decentralised and random overlay topologies. In large-scale, dynamic network environments, node churn and failures may have a detrimental effect on the structure of the overlay topologies with negative impact on the efficiency and the accuracy of applications. Most importantly, there exists the risk of a permanent loss of global connectivity that would prevent the correct convergence of applications. This work investigates to what extent a dynamic network environment may negatively affect the performance of Epidemic membership protocols. A novel Enhanced Expander Membership Protocol (EMP+) based on the expansion properties of graphs is presented. The proposed protocol is evaluated against other membership protocols and the comparative analysis shows that EMP+ can support faster application convergence and is the first membership protocol to provide robustness against global network connectivity problems

    Scalability in the Presence of Variability

    Get PDF
    Supercomputers are used to solve some of the world’s most computationally demanding problems. Exascale systems, to be comprised of over one million cores and capable of 10^18 floating point operations per second, will probably exist by the early 2020s, and will provide unprecedented computational power for parallel computing workloads. Unfortunately, while these machines hold tremendous promise and opportunity for applications in High Performance Computing (HPC), graph processing, and machine learning, it will be a major challenge to fully realize their potential, because to do so requires balanced execution across the entire system and its millions of processing elements. When different processors take different amounts of time to perform the same amount of work, performance imbalance arises, large portions of the system sit idle, and time and energy are wasted. Larger systems incorporate more processors and thus greater opportunity for imbalance to arise, as well as larger performance/energy penalties when it does. This phenomenon is referred to as performance variability and is the focus of this dissertation. In this dissertation, we explain how to design system software to mitigate variability on large scale parallel machines. Our approaches span (1) the design, implementation, and evaluation of a new high performance operating system to reduce some classes of performance variability, (2) a new performance evaluation framework to holistically characterize key features of variability on new and emerging architectures, and (3) a distributed modeling framework that derives predictions of how and where imbalance is manifesting in order to drive reactive operations such as load balancing and speed scaling. Collectively, these efforts provide a holistic set of tools to promote scalability through the mitigation of variability

    Fault tolerant decentralized deep neural networks

    Get PDF
    Dissertação de mestrado integrado em Informatics EngineeringMachine Learning is trending in computer science, especially Deep Learning. Training algorithms that follow this approach to Machine Learning routinely deal with vast amounts of data. Processing these enormous quantities of data requires complex computation tasks that can take a long time to produce results. Distributing computation efforts across multiple machines makes sense in this context, as it allows conclusive results to be available in a shorter time frame. Distributing the training of a Deep Neural Network is not a trivial procedure. Various architectures have been proposed, following two different paradigms. The most common one follows a centralized approach, where a centralized entity, broadly named parameter server, synchronizes and coordinates the updates generated by a number of workers. The alternative discards the centralized unit, assuming a decentralized architecture. The synchronization between the multiple workers is assured by communication techniques that average gradients between a node and its peers. High-end clusters are the ideal environment to deploy Deep Learning systems. Low latency between nodes assures low idle times for workers, increasing the overall system performance. These setups, however, are expensive and are only available to a limited number of entities. On the other end, there is a continuous growth of edge devices with potentially vast amounts of available computational resources. In this dissertation, we aim to implement a fault tolerant decentralized Deep Neural Net work training framework, capable of handling the high latency and unreliability characteristic of edge networks. To manage communication between nodes, we employ decentralized algorithms capable of estimating parameters globallyMachine Learning, mais especificamente Deep Learning, é um campo emergente nas ciências da computação. Algoritmos de treino aplicados em Deep Learning lidam muito frequentemente com vastas quantidades de dados. Processar estas enormes quantidades de dados requer operações computacionais complexas que demoram demasiado tempo para produzir resultados. Distribuir o esforço computacional por múltiplas máquinas faz todo o sentido neste contexto e permite um aumento significativo de desempenho. Distribuir o método de treino de uma rede neuronal não é um processo trivial. Várias arquiteturas têm sido propostas, seguindo dois diferentes paradigmas. O mais comum segue uma abordagem centralizada, onde uma entidade central, normalmente denominada de parameter server, sincroniza e coordena todas as atualizações produzidas pelos workers. A alternativa passa por descartar a entidade centralizada, assumindo uma arquitetura descentralizada. A sincronização entre workers é assegurada através de estratégias de comunicação descentralizadas. Clusters de alta performance são o ambiente ideal para a implementação de sistemas de Deep Learning. A baixa latência entre nodos assegura baixos períodos de inatividade nos workers, aumentando assim o rendimento do sistema. Estas instalações, contudo, são muito custosas, estando apenas disponíveis para um pequeno número de entidades. Por outro lado, o número de equipamentos nas extremidades da rede, com baixo aproveitamento de poder computacional, continua a crescer, o que torna o seu uso desejável. Nesta dissertação, visamos implementar um ambiente de treino de redes neuronais descentralizado e tolerante a faltas, apto a lidar com alta latência nas comunicações e baixa estabilidade nos nodos, caraterística de redes na extremidade. Para coordenar a comunicação entre os nodos, empregamos algoritmos de agregação, capazes de criar uma visão geral de parâmetros numa topologia
    • …
    corecore