40,365 research outputs found

    A Pattern Language for High-Performance Computing Resilience

    Full text link
    High-performance computing systems (HPC) provide powerful capabilities for modeling, simulation, and data analytics for a broad class of computational problems. They enable extreme performance of the order of quadrillion floating-point arithmetic calculations per second by aggregating the power of millions of compute, memory, networking and storage components. With the rapidly growing scale and complexity of HPC systems for achieving even greater performance, ensuring their reliable operation in the face of system degradations and failures is a critical challenge. System fault events often lead the scientific applications to produce incorrect results, or may even cause their untimely termination. The sheer number of components in modern extreme-scale HPC systems and the complex interactions and dependencies among the hardware and software components, the applications, and the physical environment makes the design of practical solutions that support fault resilience a complex undertaking. To manage this complexity, we developed a methodology for designing HPC resilience solutions using design patterns. We codified the well-known techniques for handling faults, errors and failures that have been devised, applied and improved upon over the past three decades in the form of design patterns. In this paper, we present a pattern language to enable a structured approach to the development of HPC resilience solutions. The pattern language reveals the relations among the resilience patterns and provides the means to explore alternative techniques for handling a specific fault model that may have different efficiency and complexity characteristics. Using the pattern language enables the design and implementation of comprehensive resilience solutions as a set of interconnected resilience patterns that can be instantiated across layers of the system stack.Comment: Proceedings of the 22nd European Conference on Pattern Languages of Program

    Replicode: A Constructivist Programming Paradigm and Language

    Get PDF
    Replicode is a language designed to encode short parallel programs and executable models, and is centered on the notions of extensive pattern-matching and dynamic code production. The language is domain independent and has been designed to build systems that are modelbased and model-driven, as production systems that can modify their own code. More over, Replicode supports the distribution of knowledge and computation across clusters of computing nodes. This document describes Replicode and its executive, i.e. the system that executes Replicode constructions. The Replicode executive is meant to run on Linux 64 bits and Windows 7 32/64 bits platforms and interoperate with custom C++ code. The motivations for the Replicode language, the constructivist paradigm it rests on, and the higher-level AI goals targeted by its construction, are described by Thórisson (2012), Nivel and Thórisson (2009), and Thórisson and Nivel (2009a, 2009b). An overview presents the main concepts of the language. Section 3 describes the general structure of Replicode objects and describes pattern matching. Section 4 describes the execution model of Replicode and section 5 describes how computation and knowledge are structured and controlled. Section 6 describes the high-level reasoning facilities offered by the system. Finally, section 7 describes how the computation is distributed over a cluster of computing nodes. Consult Annex 1 for a formal definition of Replicode, Annex 2 for a specification of the executive, Annex 3 for the specification of the executable code format (r-code) and its C++ API, and Annex 4 for the definition of the Replicode Extension C++ API

    Robustness-Driven Resilience Evaluation of Self-Adaptive Software Systems

    Get PDF
    An increasingly important requirement for certain classes of software-intensive systems is the ability to self-adapt their structure and behavior at run-time when reacting to changes that may occur to the system, its environment, or its goals. A major challenge related to self-adaptive software systems is the ability to provide assurances of their resilience when facing changes. Since in these systems, the components that act as controllers of a target system incorporate highly complex software, there is the need to analyze the impact that controller failures might have on the services delivered by the system. In this paper, we present a novel approach for evaluating the resilience of self-adaptive software systems by applying robustness testing techniques to the controller to uncover failures that can affect system resilience. The approach for evaluating resilience, which is based on probabilistic model checking, quantifies the probability of satisfaction of system properties when the target system is subject to controller failures. The feasibility of the proposed approach is evaluated in the context of an industrial middleware system used to monitor and manage highly populated networks of devices, which was implemented using the Rainbow framework for architecture-based self-adaptation

    Engineering Resilient Collective Adaptive Systems by Self-Stabilisation

    Get PDF
    Collective adaptive systems are an emerging class of networked computational systems, particularly suited in application domains such as smart cities, complex sensor networks, and the Internet of Things. These systems tend to feature large scale, heterogeneity of communication model (including opportunistic peer-to-peer wireless interaction), and require inherent self-adaptiveness properties to address unforeseen changes in operating conditions. In this context, it is extremely difficult (if not seemingly intractable) to engineer reusable pieces of distributed behaviour so as to make them provably correct and smoothly composable. Building on the field calculus, a computational model (and associated toolchain) capturing the notion of aggregate network-level computation, we address this problem with an engineering methodology coupling formal theory and computer simulation. On the one hand, functional properties are addressed by identifying the largest-to-date field calculus fragment generating self-stabilising behaviour, guaranteed to eventually attain a correct and stable final state despite any transient perturbation in state or topology, and including highly reusable building blocks for information spreading, aggregation, and time evolution. On the other hand, dynamical properties are addressed by simulation, empirically evaluating the different performances that can be obtained by switching between implementations of building blocks with provably equivalent functional properties. Overall, our methodology sheds light on how to identify core building blocks of collective behaviour, and how to select implementations that improve system performance while leaving overall system function and resiliency properties unchanged.Comment: To appear on ACM Transactions on Modeling and Computer Simulatio

    Lurching Toward Chernobyl: Dysfunctions of Real-Time Computation

    Get PDF
    Cognitive biological structures, social organizations, and computing machines operating in real time are subject to Rate Distortion Theorem constraints driven by the homology between information source uncertainty and free energy density. This exposes the unitary structure/environment system to a relentless entropic torrent compounded by sudden large deviations causing increased distortion between intent and impact, particularly as demands escalate. The phase transitions characteristic of information phenomena suggest that, rather than graceful decay under increasing load, these structures will undergo punctuated degradation akin to spontaneous symmetry breaking in physical systems. Rate distortion problems, that also affect internal structural dynamics, can become synergistic with limitations equivalent to the inattentional blindness of natural cognitive process. These mechanisms, and their interactions, are unlikely to scale well, so that, depending on architecture, enlarging the structure or its duties may lead to a crossover point at which added resources must be almost entirely devoted to ensuring system stability -- a form of allometric scaling familiar from biological examples. This suggests a critical need to tune architecture to problem type and system demand. A real-time computational structure and its environment are a unitary phenomenon, and environments are usually idiosyncratic. Thus the resulting path dependence in the development of pathology could often require an individualized approach to remediation more akin to an arduous psychiatric intervention than to the traditional engineering or medical quick fix. Failure to recognize the depth of these problems seems likely to produce a relentless chain of the Chernobyl-like failures that are necessary, bot often insufficient, for remediation under our system
    corecore