1,431 research outputs found

    Fault tolerance of MPI applications in exascale systems: The ULFM solution

    Get PDF
    [Abstract] The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running applications executing in future exascale systems, not only to ensure the completion of their execution in these systems but also to improve their energy consumption. Although the Message Passing Interface (MPI) is the most popular programming model for distributed-memory HPC systems, as of now, it does not provide any fault-tolerant construct for users to handle failures. Thus, the recovery procedure is postponed until the application is aborted and re-spawned. The proposal of the User Level Failure Mitigation (ULFM) interface in the MPI forum provides new opportunities in this field, enabling the implementation of resilient MPI applications, system runtimes, and programming language constructs able to detect and react to failures without aborting their execution. This paper presents a global overview of the resilience interfaces provided by the ULFM specification, covers archetypal usage patterns and building blocks, and surveys the wide variety of application-driven solutions that have exploited them in recent years. The large and varied number of approaches in the literature proves that ULFM provides the necessary flexibility to implement efficient fault-tolerant MPI applications. All the proposed solutions are based on application-driven recovery mechanisms, which allows reducing the overhead and obtaining the required level of efficiency needed in the future exascale platforms.Ministerio de Economía y Competitividad and FEDER; TIN2016-75845-PXunta de Galicia; ED431C 2017/04National Science Foundation of the United States; NSF-SI2 #1664142Exascale Computing Project; 17-SC-20-SCHoneywell International, Inc.; DE-NA000352

    Resilient application co-scheduling with processor redistribution

    Get PDF
    Recently, the benefits of co-scheduling several applications have been demonstrated in a fault-free context, both in terms of performance and energy savings. However, large-scale computer systems are confronted to frequent failures, and resilience techniques must be employed to ensure the completion of large applications. Indeed, failures may create severe imbalance between applications, and significantly degrade performance. In this paper, we propose to redistribute the resources assigned to each application upon the striking of failures, in order to minimize the expected completion time of a set of co-scheduled applications. First we introduce a formal model and establish complexity results. When no redistribution is allowed, we can minimize the expected completion time in polynomial time, while the problem becomes NP-complete with redistributions, even in a fault-free context. Therefore, we design polynomial-time heuristics that perform redistributions and account for processor failures. A fault simulator is used to perform extensive simulations that demonstrate the usefulness of redistribution and the performance of the proposed heuristics

    Innovator resilience potential: A process perspective of individual resilience as influenced by innovation project termination

    Get PDF
    Innovation projects fail at an astonishing rate. Yet, the negative effects of innovation project failures on the team members of these projects have been largely neglected in research streams that deal with innovation project failures. After such setbacks, it is vital to maintain or even strengthen project members’ innovative capabilities for subsequent innovation projects. For this, the concept of resilience, i.e. project members’ potential to positively adjust (or even grow) after a setback such as an innovation project failure, is fundamental. We develop the second-order construct of innovator resilience potential, which consists of six components – self-efficacy, outcome expectancy, optimism, hope, self-esteem, and risk propensity – that are important for project members’ potential of innovative functioning in innovation projects subsequent to a failure. We illustrate our theoretical findings by means of a qualitative study of a terminated large-scale innovation project, and derive implications for research and management

    Combining malleability and I/O control mechanisms to enhance the execution of multiple applications

    Get PDF
    This work presents a common framework that integrates CLARISSE, a cross-layer runtime for the I/O software stack, and FlexMPI, a runtime that provides dynamic load balancing and malleability capabilities for MPI applications. This integration is performed both at application level, as libraries executed within the application, as well as at central-controller level, as external components that manage the execution of different applications. We show that a cooperation between both runtimes provides important benefits for overall system performance: first, by means of monitoring, the CPU, communication and I/O performances of all executing applications are collected, providing a holistic view of the complete platform utilization. Secondly, we introduce a coordinated way of using CLARISSE and FlexMPI control mechanisms, based on two different optimization strategies, with the aim of improving both the application I/O and overall system performance. Finally, we present a detailed description of this proposal, as well as an empirical evaluation of the framework on a cluster showing significant performance improvements at both application and wide-platform levels. We demonstrate that with this proposal the overall I/O time of an application can be reduced by up to 49% and the aggregated FLOPS of all running applications can be increased by 10% with respect to the baseline case. (C) 2018 Elsevier Inc. All rights reserved.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work has been partially supported by the Spanish “Ministerio de Economia y Competitividad” under the project grant TIN2016-79637-P “Towards Unification of HPC and Big Data paradigms” and EU under the COST Program Action IC1305, Network for Sustainable Ultrascale Computing (NESUS)

    Supporting automatic recovery in offloaded distributed programming models through MPI-3 techniques

    Get PDF
    In this paper we describe the design of fault tolerance capabilities for general-purpose offload semantics, based on the OmpSs programming model. Using ParaStation MPI, a production MPI-3.1 implementation, we explore the features that, being standard compliant, an MPI stack must support to provide the necessary fault tolerance guarantees, based on MPI's dynamic process management. Our results, including synthetic benchmarks and applications, reveal low runtime overhead and efficient recovery, demonstrating that the existing MPI standard provided us with sufficient mechanisms to implement an effective and efficient fault-tolerant solution.This research received funding from the European Community’s 7th Framework Programme via the DEEP-ER project under Grant Agreement no. 610476. This work has also been supported by the Spanish Ministry of Science and Innovation (contract TIN2012-34557) and by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266. The authors thank Jorge Bell´on, from BSC, for his technical support with the Nanos++ internals.Peer ReviewedPostprint (author's final draft

    Application-level Fault Tolerance and Resilience in HPC Applications

    Get PDF
    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo] As necesidades computacionais das distintas ramas da ciencia medraron enormemente nos últimos anos, o que provocou un gran crecemento no rendemento proporcionado polos supercomputadores. Cada vez constrúense sistemas de computación de altas prestacións de maior tamaño, con máis recursos hardware de distintos tipos, o que fai que as taxas de fallo destes sistemas tamén medren. Polo tanto, o estudo de técnicas de tolerancia a fallos eficientes é indispensábel para garantires que os programas científicos poidan completar a súa execución, evitando ademais que se dispare o consumo de enerxía. O checkpoint/restart é unha das técnicas máis populares. Sen embargo, a maioría da investigación levada a cabo nas últimas décadas céntrase en estratexias stop-and-restart para aplicacións de memoria distribuída tralo acontecemento dun fallo-parada. Esta tese propón técnicas checkpoint/restart a nivel de aplicación para os modelos de programación paralela roáis populares en supercomputación. Implementáronse protocolos de checkpointing para aplicacións híbridas MPI-OpenMP e aplicacións heteroxéneas baseadas en OpenCL, en ámbolos dous casos prestando especial coidado á portabilidade e maleabilidade da solución. En canto a aplicacións de memoria distribuída, proponse unha solución de resiliencia que pode ser empregada de forma xenérica en aplicacións MPI SPMD, permitindo detectar e reaccionar a fallos-parada sen abortar a execución. Neste caso, os procesos fallidos vólvense a lanzar e o estado da aplicación recupérase cunha volta atrás global. A maiores, esta solución de resiliencia optimizouse implementando unha volta atrás local, na que só os procesos fallidos volven atrás, empregando un protocolo de almacenaxe de mensaxes para garantires a consistencia e o progreso da execución. Por último, propónse a extensión dunha librería de checkpointing para facilitares a implementación de estratexias de recuperación ad hoc ante conupcións de memoria. En moitas ocasións, estos erros poden ser xestionados a nivel de aplicación, evitando desencadear un fallo-parada e permitindo unha recuperación máis eficiente.[Resumen] El rápido aumento de las necesidades de cómputo de distintas ramas de la ciencia ha provocado un gran crecimiento en el rendimiento ofrecido por los supercomputadores. Cada vez se construyen sistemas de computación de altas prestaciones mayores, con más recursos hardware de distintos tipos, lo que hace que las tasas de fallo del sistema aumenten. Por tanto, el estudio de técnicas de tolerancia a fallos eficientes resulta indispensable para garantizar que los programas científicos puedan completar su ejecución, evitando además que se dispare el consumo de energía. La técnica checkpoint/restart es una de las más populares. Sin embargo, la mayor parte de la investigación en este campo se ha centrado en estrategias stop-and-restart para aplicaciones de memoria distribuida tras la ocurrencia de fallos-parada. Esta tesis propone técnicas checkpoint/restart a nivel de aplicación para los modelos de programación paralela más populares en supercomputación. Se han implementado protocolos de checkpointing para aplicaciones híbridas MPI-OpenMP y aplicaciones heterogéneas basadas en OpenCL, prestando en ambos casos especial atención a la portabilidad y la maleabilidad de la solución. Con respecto a aplicaciones de memoria distribuida, se propone una solución de resiliencia que puede ser usada de forma genérica en aplicaciones MPI SPMD, permitiendo detectar y reaccionar a fallosparada sin abortar la ejecución. En su lugar, se vuelven a lanzar los procesos fallidos y se recupera el estado de la aplicación con una vuelta atrás global. A mayores, esta solución de resiliencia ha sido optimizada implementando una vuelta atrás local, en la que solo los procesos fallidos vuelven atrás, empleando un protocolo de almacenaje de mensajes para garantizar la consistencia y el progreso de la ejecución. Por último, se propone una extensión de una librería de checkpointing para facilitar la implementación de estrategias de recuperación ad hoc ante corrupciones de memoria. Muchas veces, este tipo de errores puede gestionarse a nivel de aplicación, evitando desencadenar un fallo-parada y permitiendo una recuperación más eficiente.[Abstract] The rapid increase in the computational demands of science has lead to a pronounced growth in the performance offered by supercomputers. As High Performance Computing (HPC) systems grow larger, including more hardware components of different types, the system's failure rate becomes higher. Efficient fault tolerance techniques are essential not only to ensure the execution completion but also to save energy. Checkpoint/restart is one of the most popular fault tolerance techniques. However, most of the research in this field is focused on stop-and-restart strategies for distributed-memory applications in the event of fail-stop failures. Thís thesis focuses on the implementation of application-level checkpoint/restart solutions for the most popular parallel programming models used in HPC. Hence, we have implemented checkpointing solutions to cope with fail-stop failures in hybrid MPI-OpenMP applications and OpenCL-based programs. Both strategies maximize the restart portability and malleability, ie., the recovery can take place on machines with different CPU / accelerator architectures, and/ or operating systems, and can be adapted to the available resources (number of cores/accelerators). Regarding distributed-memory applications, we propose a resilience solution that can be generally applied to SPMD MPI programs. Resilient applications can detect and react to failures without aborting their execution upon fail-stop failures. Instead, failed processes are re-spawned, and the application state is recovered through a global rollback. Moreover, we have optimized this resilience proposal by implementing a local rollback protocol, in which only failed processes rollback to a previous state, while message logging enables global consistency and further progress of the computation. Finally, we have extended a checkpointing library to facilitate the implementation of ad hoc recovery strategies in the event of soft errors) caused by memory corruptions. Many times, these errors can be handled at the software-Ievel, tIms, avoiding fail-stop failures and enabling a more efficient recovery

    Resource-aware Programming in a High-level Language - Improved performance with manageable effort on clustered MPSoCs

    Get PDF
    Bis 2001 bedeutete Moores und Dennards Gesetz eine Verdoppelung der Ausführungszeit alle 18 Monate durch verbesserte CPUs. Heute ist Nebenläufigkeit das dominante Mittel zur Beschleunigung von Supercomputern bis zu mobilen Geräten. Allerdings behindern neuere Phänomene wie "Dark Silicon" zunehmend eine weitere Beschleunigung durch Hardware. Um weitere Beschleunigung zu erreichen muss sich auch die Soft­ware mehr ihrer Hardware Resourcen gewahr werden. Verbunden mit diesem Phänomen ist eine immer heterogenere Hardware. Supercomputer integrieren Beschleuniger wie GPUs. Mobile SoCs (bspw. Smartphones) integrieren immer mehr Fähigkeiten. Spezialhardware auszunutzen ist eine bekannte Methode, um den Energieverbrauch zu senken, was ein weiterer wichtiger Aspekt ist, welcher mit der reinen Geschwindigkeit abgewogen werde muss. Zum Beispiel werden Supercomputer auch nach "Performance pro Watt" bewertet. Zur Zeit sind systemnahe low-level Programmierer es gewohnt über Hardware nachzudenken, während der gemeine high-level Programmierer es vorzieht von der Plattform möglichst zu abstrahieren (bspw. Cloud). "High-level" bedeutet nicht, dass Hardware irrelevant ist, sondern dass sie abstrahiert werden kann. Falls Sie eine Java-Anwendung für Android entwickeln, kann der Akku ein wichtiger Aspekt sein. Irgendwann müssen aber auch Hochsprachen resourcengewahr werden, um Geschwindigkeit oder Energieverbrauch zu verbessern. Innerhalb des Transregio "Invasive Computing" habe ich an diesen Problemen gearbeitet. In meiner Dissertation stelle ich ein Framework vor, mit dem man Hochsprachenanwendungen resourcengewahr machen kann, um so die Leistung zu verbessern. Das könnte beispielsweise erhöhte Effizienz oder schnellerer Ausführung für das System als Ganzes bringen. Ein Kerngedanke dabei ist, dass Anwendungen sich nicht selbst optimieren. Stattdessen geben sie alle Informationen an das Betriebssystem. Das Betriebssystem hat eine globale Sicht und trifft Entscheidungen über die Resourcen. Diesen Prozess nennen wir "Invasion". Die Aufgabe der Anwendung ist es, sich an diese Entscheidungen anzupassen, aber nicht selbst welche zu fällen. Die Herausforderung besteht darin eine Sprache zu definieren, mit der Anwendungen Resourcenbedingungen und Leistungsinformationen kommunizieren. So eine Sprache muss ausdrucksstark genug für komplexe Informationen, erweiterbar für neue Resourcentypen, und angenehm für den Programmierer sein. Die zentralen Beiträge dieser Dissertation sind: Ein theoretisches Modell der Resourcen-Verwaltung, um die Essenz des resourcengewahren Frameworks zu beschreiben, die Korrektheit der Entscheidungen des Betriebssystems bezüglich der Bedingungen einer Anwendung zu begründen und zum Beweis meiner Thesen von Effizienz und Beschleunigung in der Theorie. Ein Framework und eine Übersetzungspfad resourcengewahrer Programmierung für die Hochsprache X10. Zur Bewertung des Ansatzes haben wir Anwendungen aus dem High Performance Computing implementiert. Eine Beschleunigung von 5x konnte gemessen werden. Ein Speicherkonsistenzmodell für die X10 Programmiersprache, da dies ein notwendiger Schritt zu einer formalen Semantik ist, die das theoretische Modell und die konkrete Implementierung verknüpft. Zusammengefasst zeige ich, dass resourcengewahre Programmierung in Hoch\-sprachen auf zukünftigen Architekturen mit vielen Kernen mit vertretbarem Aufwand machbar ist und die Leistung verbessert

    Medium access control in wireless network-on-chip: a context analysis

    Get PDF
    © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Wireless on-chip communication is a promising candidate to address the performance and efficiency issues that arise when scaling current NoC techniques to manycore processors. A WNoC can serve global and broadcast traffic with ultra-low latency even in thousand-core chips, thus acting as a natural complement to conventional and throughput-oriented wireline NoCs. However, the development of MAC strategies needed to efficiently share the wireless medium among the increasing number of cores remains a considerable challenge given the singularities of the environment and the novelty of the research area. In this position article, we present a context analysis describing the physical constraints, performance objectives, and traffic characteristics of the on-chip communication paradigm. We summarize the main differences with respect to traditional wireless scenarios, and then discuss their implications on the design of MAC protocols for manycore WNoC, with the ultimate goal of kickstarting this arguably unexplored research area.Peer ReviewedPostprint (author's final draft

    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)

    Get PDF
    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.The PhD Symposium was a very good opportunity for the young researchers to share information and knowledge, to present their current research, and to discuss topics with other students in order to look for synergies and common research topics. The idea was very successful and the assessment made by the PhD Student was very good. It also helped to achieve one of the major goals of the NESUS Action: to establish an open European research network targeting sustainable solutions for ultrascale computing aiming at cross fertilization among HPC, large scale distributed systems, and big data management, training, contributing to glue disparate researchers working across different areas and provide a meeting ground for researchers in these separate areas to exchange ideas, to identify synergies, and to pursue common activities in research topics such as sustainable software solutions (applications and system software stack), data management, energy efficiency, and resilience.European Cooperation in Science and Technology. COS

    A Control-Theoretic Design And Analysis Framework For Resilient Hard Real-Time Systems

    Get PDF
    We introduce a new design metric called system-resiliency which characterizes the maximum unpredictable external stresses that any hard-real-time performance mode can withstand. Our proposed systemresiliency framework addresses resiliency determination for real-time systems with physical and hardware limitations. Furthermore, our framework advises the system designer about the feasible trade-offs between external system resources for the system operating modes on a real-time system that operates in a multi-parametric resiliency environment. Modern multi-modal real-time systems degrade the system’s operational modes as a response to unpredictable external stimuli. During these mode transitions, real-time systems should demonstrate a reliable and graceful degradation of service. Many control-theoretic-based system design approaches exist. Although they permit real-time systems to operate under various physical constraints, none of them allows the system designer to predict the system-resiliency over multi-constrained operating environment. Our framework fills this gap; the proposed framework consists of two components: the design-phase and runtime control. With the design-phase analysis, the designer predicts the behavior of the real-time system for variable external conditions. Also, the runtime controller navigates the system to the best desired target using advanced control-theoretic techniques. Further, our framework addresses the system resiliency of both uniprocessor and multicore processor systems. As a proof of concept, we first introduce a design metric called thermal-resiliency, which characterizes the maximum external thermal stress that any hard-real-time performance mode can withstand. We verify the thermal-resiliency for the external thermal stresses on a uniprocessor system through a physical testbed. We show how to solve some of the issues and challenges of designing predictable real-time systems that guarantee hard deadlines even under transitions between modes in an unpredictable thermal environment where environmental temperature may dynamically change using our new metric. We extend the derivation of thermal-resiliency to multicore systems and determine the limitations of external thermal stress that any hard-real-time performance mode can withstand. Our control-theoretic framework allows the system designer to allocate asymmetric processing resources upon a multicore proiii cessor and still maintain thermal constraints. In addition, we develop real-time-scheduling sub-components that are necessary to fully implement our framework; toward this goal, we investigate the potential utility of parallelization for meeting real-time constraints and minimizing energy. Under malleable gang scheduling of implicit-deadline sporadic tasks upon multiprocessors, we show the non-necessity of dynamic voltage/frequency regarding optimality of our scheduling problem. We adapt the canonical schedule for DVFS multiprocessor platforms and propose a polynomial-time optimal processor/frequency-selection algorithm. Finally, we verify the correctness of our framework through multiple measurable physical and hardware constraints and complete our work on developing a generalized framework
    corecore