67 research outputs found

    Heterogeneity aware fault tolerance for extreme scale computing

    Get PDF
    Upcoming Extreme Scale, or Exascale, Computing Systems are expected to deliver a peak performance of at least 10^18 floating point operations per second (FLOPS), primarily through significant expansion in scale. A major concern for such large scale systems, however, is how to deal with failures in the system. This is because the impact of failures on system efficiency, while utilizing existing fault tolerance techniques, generally also increases with scale. Hence, current research effort in this area has been directed at optimizing various aspects of fault tolerance techniques to reduce their overhead at scale. One characteristic that has been overlooked so far, however, is heterogeneity, specifically in the rate at which individual components of the underlying system fail, and in the execution profile of a parallel application running on such a system. In this thesis, we investigate the implications of such types of heterogeneity for fault tolerance in large scale high performance computing (HPC) systems. To that end, we 1) study how knowledge of heterogeneity in system failure likelihoods can be utilized to make current fault tolerance schemes more efficient, 2) assess the feasibility of utilizing application imbalance for improved fault tolerance at scale, and 3) propose and evaluate changes to system level resource managers in order to achieve reliable job placement over resources with unequal failure likelihoods. The results in this thesis, taken together, demonstrate that heterogeneity in failure likelihoods significantly changes the landscape of fault tolerance for large scale HPC systems

    Reliability for exascale computing : system modelling and error mitigation for task-parallel HPC applications

    Get PDF
    As high performance computing (HPC) systems continue to grow, their fault rate increases. Applications running on these systems have to deal with rates on the order of hours or days. Furthermore, some studies for future Exascale systems predict the rates to be on the order of minutes. As a result, efficient fault tolerance solutions are needed to be able to tolerate frequent failures. A fault tolerance solution for future HPC and Exascale systems must be low-cost, efficient and highly scalable. It should have low overhead in fault-free execution and provide fast restart because long-running applications are expected to experience many faults during the execution. Meanwhile task-based dataflow parallel programming models (PM) are becoming a popular paradigm in HPC applications at large scale. For instance, we see the adaptation of task-based dataflow parallelism in OpenMP 4.0, OmpSs PM, Argobots and Intel Threading Building Blocks. In this thesis we propose fault-tolerance solutions for task-parallel dataflow HPC applications. Specifically, first we design and implement a checkpoint/restart and message-logging framework to recover from errors. We then develop performance models to investigate the benefits of our task-level frameworks when integrated with system-wide checkpointing. Moreover, we design and implement selective task replication mechanisms to detect and recover from silent data corruptions in task-parallel dataflow HPC applications. Finally, we introduce a runtime-based coding scheme to detect and recover from memory errors in these applications. Considering the span of all of our schemes, we see that they provide a fairly high failure coverage where both computation and memory is protected against errors.A medida que los Sistemas de Cómputo de Alto rendimiento (HPC por sus siglas en inglés) siguen creciendo, también las tasas de fallos aumentan. Las aplicaciones que se ejecutan en estos sistemas tienen una tasa de fallos que pueden estar en el orden de horas o días. Además, algunos estudios predicen que los fallos estarán en el orden de minutos en los Sistemas Exascale. Por lo tanto, son necesarias soluciones eficientes para la tolerancia a fallos que puedan tolerar fallos frecuentes. Las soluciones para tolerancia a fallos en los Sistemas futuros de HPC y Exascale tienen que ser de bajo costo, eficientes y altamente escalable. El sobrecosto en la ejecución sin fallos debe ser bajo y también se debe proporcionar reinicio rápido, ya que se espera que las aplicaciones de larga duración experimenten muchos fallos durante la ejecución. Por otra parte, los modelos de programación paralelas basados en tareas ordenadas de acuerdo a sus dependencias de datos, se están convirtiendo en un paradigma popular en aplicaciones HPC a gran escala. Por ejemplo, los siguientes modelos de programación paralela incluyen este tipo de modelo de programación OpenMP 4.0, OmpSs, Argobots e Intel Threading Building Blocks. En esta tesis proponemos soluciones de tolerancia a fallos para aplicaciones de HPC programadas en un modelo de programación paralelo basado tareas. Específicamente, en primer lugar, diseñamos e implementamos mecanismos “checkpoint/restart” y “message-logging” para recuperarse de los errores. Para investigar los beneficios de nuestras herramientas a nivel de tarea cuando se integra con los “system-wide checkpointing” se han desarrollado modelos de rendimiento. Por otra parte, diseñamos e implementamos mecanismos de replicación selectiva de tareas que permiten detectar y recuperarse de daños de datos silenciosos en aplicaciones programadas siguiendo el modelo de programación paralela basadas en tareas. Por último, se introduce un esquema de codificación que funciona en tiempo de ejecución para detectar y recuperarse de los errores de la memoria en estas aplicaciones. Todos los esquemas propuestos, en conjunto, proporcionan una cobertura bastante alta a los fallos tanto si estos se producen el cálculo o en la memoria.Postprint (published version

    Resource management for extreme scale high performance computing systems in the presence of failures

    Get PDF
    2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as data centers and supercomputers, coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. Applications executing in parallel on individual multicore processors also suffer from decreased performance and increased energy use as a result of applications being forced to share resources, in particular, the contention from multiple application threads sharing the last-level cache causes performance degradation. These challenges make it increasingly important to characterize and optimize the performance and behavior of applications that execute in these systems. To address these challenges, in this dissertation we propose a framework for intelligently characterizing and managing extreme-scale HPC system resources. We devise various techniques to mitigate the negative effects of failures and resource contention in HPC systems. In particular, we develop new HPC resource management techniques for intelligently utilizing system resources through the (a) optimal scheduling of applications to HPC nodes and (b) the optimal configuration of fault resilience protocols. These resource management techniques employ information obtained from historical analysis as well as theoretical and machine learning methods for predictions. We use these data to characterize system performance, energy use, and application behavior when operating under the uncertainty of performance degradation from both system failures and resource contention. We investigate how to better characterize and model the negative effects from system failures as well as application co-location on large-scale HPC computing systems. Our analysis of application and system behavior also investigates: the interrelated effects of network usage of applications and fault resilience protocols; checkpoint interval selection and its sensitivity to system parameters for various checkpoint-based fault resilience protocols; and performance comparisons of various promising strategies for fault resilience in exascale-sized systems

    Failure analysis and reliability -aware resource allocation of parallel applications in High Performance Computing systems

    Get PDF
    The demand for more computational power to solve complex scientific problems has been driving the physical size of High Performance Computing (HPC) systems to hundreds and thousands of nodes. Uninterrupted execution of large scale parallel applications naturally becomes a major challenge because a single node failure interrupts the entire application, and the reliability of a job completion decreases with increasing the number of nodes. Accurate reliability knowledge of a HPC system enables runtime systems such as resource management and applications to minimize performance loss due to random failures while also providing better Quality Of Service (QOS) for computational users. This dissertation makes three major contributions for reliability evaluation and resource management in HPC systems. First we study the failure properties of HPC systems and observe that Times To Failure (TTF\u27s) of individual compute nodes follow a time-varying failure rate based distribution like Weibull distribution. We then propose a model for the TTF distribution of a system of k independent nodes when individual nodes exhibit time varying failure rates. Based on the reliability of the proposed TTF model, we develop reliability-aware resource allocation algorithms and evaluated them on actual parallel workloads and failure data of a HPC system. Our observations indicate that applying time varying failure rate-based reliability function combined with some heuristics reduce the performance loss due to unexpected failures by as much as 30 to 53 percent. Finally, we also study the effect of reliability with respect to the number of nodes and propose reliability-aware optimal k node allocation algorithm for large scale parallel applications. Our simulation results of comparing the optimal k node algorithm indicate that choosing the number of nodes for large scale parallel applications based on the reliability of compute nodes can reduce the overall completion time and waste time when the k may be smaller than the total number of nodes in the system

    Substituting Failure Avoidance for Redundancy in Storage Fault Tolerance

    Get PDF
    The primary mechanism for overcoming faults in modern storage systems is to introduce redundancy in the form of replication and error correcting codes. The costs of such redundancy in hardware, system availability and overall complexity can be substantial, depending on the number and pattern of faults that are handled. This dissertation describes and analyzes, via simulation, a system that seeks to use disk failure avoidance to reduce the need for costly redundancy by using adaptive heuristics that anticipate such failures. While a number of predictive factors can be used, this research focuses on the three leading candidates of SMART errors, age and vintage. This approach can predict where near term disk failures are more likely to occur, enabling proactive movement/replication of at-risk data, thus maintaining data integrity and availability. This strategy can reduce costs due to redundant storage without compromising these important requirements

    Design Space Exploration and Resource Management of Multi/Many-Core Systems

    Get PDF
    The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends

    Dependable Embedded Systems

    Get PDF
    This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems

    Efficient resilience analysis and decision-making for complex engineering systems

    Get PDF
    Modern societies around the world are increasingly dependent on the smooth functionality of progressively more complex systems, such as infrastructure systems, digital systems like the internet, and sophisticated machinery. They form the cornerstones of our technologically advanced world and their efficiency is directly related to our well-being and the progress of society. However, these important systems are constantly exposed to a wide range of threats of natural, technological, and anthropogenic origin. The emergence of global crises such as the COVID-19 pandemic and the ongoing threat of climate change have starkly illustrated the vulnerability of these widely ramified and interdependent systems, as well as the impossibility of predicting threats entirely. The pandemic, with its widespread and unexpected impacts, demonstrated how an external shock can bring even the most advanced systems to a standstill, while the ongoing climate change continues to produce unprecedented risks to system stability and performance. These global crises underscore the need for systems that can not only withstand disruptions, but also, recover from them efficiently and rapidly. The concept of resilience and related developments encompass these requirements: analyzing, balancing, and optimizing the reliability, robustness, redundancy, adaptability, and recoverability of systems -- from both technical and economic perspectives. This cumulative dissertation, therefore, focuses on developing comprehensive and efficient tools for resilience-based analysis and decision-making of complex engineering systems. The newly developed resilience decision-making procedure is at the core of these developments. It is based on an adapted systemic risk measure, a time-dependent probabilistic resilience metric, as well as a grid search algorithm, and represents a significant innovation as it enables decision-makers to identify an optimal balance between different types of resilience-enhancing measures, taking into account monetary aspects. Increasingly, system components have significant inherent complexity, requiring them to be modeled as systems themselves. Thus, this leads to systems-of-systems with a high degree of complexity. To address this challenge, a novel methodology is derived by extending the previously introduced resilience framework to multidimensional use cases and synergistically merging it with an established concept from reliability theory, the survival signature. The new approach combines the advantages of both original components: a direct comparison of different resilience-enhancing measures from a multidimensional search space leading to an optimal trade-off in terms of system resilience, and a significant reduction in computational effort due to the separation property of the survival signature. It enables that once a subsystem structure has been computed -- a typically computational expensive process -- any characterization of the probabilistic failure behavior of components can be validated without having to recompute the structure. In reality, measurements, expert knowledge, and other sources of information are loaded with multiple uncertainties. For this purpose, an efficient method based on the combination of survival signature, fuzzy probability theory, and non-intrusive stochastic simulation (NISS) is proposed. This results in an efficient approach to quantify the reliability of complex systems, taking into account the entire uncertainty spectrum. The new approach, which synergizes the advantageous properties of its original components, achieves a significant decrease in computational effort due to the separation property of the survival signature. In addition, it attains a dramatic reduction in sample size due to the adapted NISS method: only a single stochastic simulation is required to account for uncertainties. The novel methodology not only represents an innovation in the field of reliability analysis, but can also be integrated into the resilience framework. For a resilience analysis of existing systems, the consideration of continuous component functionality is essential. This is addressed in a further novel development. By introducing the continuous survival function and the concept of the Diagonal Approximated Signature as a corresponding surrogate model, the existing resilience framework can be usefully extended without compromising its fundamental advantages. In the context of the regeneration of complex capital goods, a comprehensive analytical framework is presented to demonstrate the transferability and applicability of all developed methods to complex systems of any type. The framework integrates the previously developed resilience, reliability, and uncertainty analysis methods. It provides decision-makers with the basis for identifying resilient regeneration paths in two ways: first, in terms of regeneration paths with inherent resilience, and second, regeneration paths that lead to maximum system resilience, taking into account technical and monetary factors affecting the complex capital good under analysis. In summary, this dissertation offers innovative contributions to efficient resilience analysis and decision-making for complex engineering systems. It presents universally applicable methods and frameworks that are flexible enough to consider system types and performance measures of any kind. This is demonstrated in numerous case studies ranging from arbitrary flow networks, functional models of axial compressors to substructured infrastructure systems with several thousand individual components.Moderne Gesellschaften sind weltweit zunehmend von der reibungslosen Funktionalität immer komplexer werdender Systeme, wie beispielsweise Infrastruktursysteme, digitale Systeme wie das Internet oder hochentwickelten Maschinen, abhängig. Sie bilden die Eckpfeiler unserer technologisch fortgeschrittenen Welt, und ihre Effizienz steht in direktem Zusammenhang mit unserem Wohlbefinden sowie dem Fortschritt der Gesellschaft. Diese wichtigen Systeme sind jedoch einer ständigen und breiten Palette von Bedrohungen natürlichen, technischen und anthropogenen Ursprungs ausgesetzt. Das Auftreten globaler Krisen wie die COVID-19-Pandemie und die anhaltende Bedrohung durch den Klimawandel haben die Anfälligkeit der weit verzweigten und voneinander abhängigen Systeme sowie die Unmöglichkeit einer Gefahrenvorhersage in voller Gänze eindrücklich verdeutlicht. Die Pandemie mit ihren weitreichenden und unerwarteten Auswirkungen hat gezeigt, wie ein externer Schock selbst die fortschrittlichsten Systeme zum Stillstand bringen kann, während der anhaltende Klimawandel immer wieder beispiellose Risiken für die Systemstabilität und -leistung hervorbringt. Diese globalen Krisen unterstreichen den Bedarf an Systemen, die nicht nur Störungen standhalten, sondern sich auch schnell und effizient von ihnen erholen können. Das Konzept der Resilienz und die damit verbundenen Entwicklungen umfassen diese Anforderungen: Analyse, Abwägung und Optimierung der Zuverlässigkeit, Robustheit, Redundanz, Anpassungsfähigkeit und Wiederherstellbarkeit von Systemen -- sowohl aus technischer als auch aus wirtschaftlicher Sicht. In dieser kumulativen Dissertation steht daher die Entwicklung umfassender und effizienter Instrumente für die Resilienz-basierte Analyse und Entscheidungsfindung von komplexen Systemen im Mittelpunkt. Das neu entwickelte Resilienz-Entscheidungsfindungsverfahren steht im Kern dieser Entwicklungen. Es basiert auf einem adaptierten systemischen Risikomaß, einer zeitabhängigen, probabilistischen Resilienzmetrik sowie einem Gittersuchalgorithmus und stellt eine bedeutende Innovation dar, da es Entscheidungsträgern ermöglicht, ein optimales Gleichgewicht zwischen verschiedenen Arten von Resilienz-steigernden Maßnahmen unter Berücksichtigung monetärer Aspekte zu identifizieren. Zunehmend weisen Systemkomponenten eine erhebliche Eigenkomplexität auf, was dazu führt, dass sie selbst als Systeme modelliert werden müssen. Hieraus ergeben sich Systeme aus Systemen mit hoher Komplexität. Um diese Herausforderung zu adressieren, wird eine neue Methodik abgeleitet, indem das zuvor eingeführte Resilienzrahmenwerk auf multidimensionale Anwendungsfälle erweitert und synergetisch mit einem etablierten Konzept aus der Zuverlässigkeitstheorie, der Überlebenssignatur, zusammengeführt wird. Der neue Ansatz kombiniert die Vorteile beider ursprünglichen Komponenten: Einerseits ermöglicht er einen direkten Vergleich verschiedener Resilienz-steigernder Maßnahmen aus einem mehrdimensionalen Suchraum, der zu einem optimalen Kompromiss in Bezug auf die Systemresilienz führt. Andererseits ermöglicht er durch die Separationseigenschaft der Überlebenssignatur eine signifikante Reduktion des Rechenaufwands. Sobald eine Subsystemstruktur berechnet wurde -- ein typischerweise rechenintensiver Prozess -- kann jede Charakterisierung des probabilistischen Ausfallverhaltens von Komponenten validiert werden, ohne dass die Struktur erneut berechnet werden muss. In der Realität sind Messungen, Expertenwissen sowie weitere Informationsquellen mit vielfältigen Unsicherheiten belastet. Hierfür wird eine effiziente Methode vorgeschlagen, die auf der Kombination von Überlebenssignatur, unscharfer Wahrscheinlichkeitstheorie und nicht-intrusiver stochastischer Simulation (NISS) basiert. Dadurch entsteht ein effizienter Ansatz zur Quantifizierung der Zuverlässigkeit komplexer Systeme unter Berücksichtigung des gesamten Unsicherheitsspektrums. Der neue Ansatz, der die vorteilhaften Eigenschaften seiner ursprünglichen Komponenten synergetisch zusammenführt, erreicht eine bedeutende Verringerung des Rechenaufwands aufgrund der Separationseigenschaft der Überlebenssignatur. Er erzielt zudem eine drastische Reduzierung der Stichprobengröße aufgrund der adaptierten NISS-Methode: Es wird nur eine einzige stochastische Simulation benötigt, um Unsicherheiten zu berücksichtigen. Die neue Methodik stellt nicht nur eine Neuerung auf dem Gebiet der Zuverlässigkeitsanalyse dar, sondern kann auch in das Resilienzrahmenwerk integriert werden. Für eine Resilienzanalyse von real existierenden Systemen ist die Berücksichtigung kontinuierlicher Komponentenfunktionalität unerlässlich. Diese wird in einer weiteren Neuentwicklung adressiert. Durch die Einführung der kontinuierlichen Überlebensfunktion und dem Konzept der Diagonal Approximated Signature als entsprechendes Ersatzmodell kann das bestehende Resilienzrahmenwerk sinnvoll erweitert werden, ohne seine grundlegenden Vorteile zu beeinträchtigen. Im Kontext der Regeneration komplexer Investitionsgüter wird ein umfassendes Analyserahmenwerk vorgestellt, um die Übertragbarkeit und Anwendbarkeit aller entwickelten Methoden auf komplexe Systeme jeglicher Art zu demonstrieren. Das Rahmenwerk integriert die zuvor entwickelten Methoden der Resilienz-, Zuverlässigkeits- und Unsicherheitsanalyse. Es bietet Entscheidungsträgern die Basis für die Identifikation resilienter Regenerationspfade in zweierlei Hinsicht: Zum einen im Sinne von Regenerationspfaden mit inhärenter Resilienz und zum anderen Regenerationspfade, die zu einer maximalen Systemresilienz unter Berücksichtigung technischer und monetärer Einflussgrößen des zu analysierenden komplexen Investitionsgutes führen. Zusammenfassend bietet diese Dissertation innovative Beiträge zur effizienten Resilienzanalyse und Entscheidungsfindung für komplexe Ingenieursysteme. Sie präsentiert universell anwendbare Methoden und Rahmenwerke, die flexibel genug sind, um beliebige Systemtypen und Leistungsmaße zu berücksichtigen. Dies wird in zahlreichen Fallstudien von willkürlichen Flussnetzwerken, funktionalen Modellen von Axialkompressoren bis hin zu substrukturierten Infrastruktursystemen mit mehreren tausend Einzelkomponenten demonstriert
    corecore