7,034 research outputs found

    DeSyRe: on-Demand System Reliability

    No full text
    The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints

    Autonomous Fault Detection in Self-Healing Systems using Restricted Boltzmann Machines

    Get PDF
    Autonomously detecting and recovering from faults is one approach for reducing the operational complexity and costs associated with managing computing environments. We present a novel methodology for autonomously generating investigation leads that help identify systems faults, and extends our previous work in this area by leveraging Restricted Boltzmann Machines (RBMs) and contrastive divergence learning to analyse changes in historical feature data. This allows us to heuristically identify the root cause of a fault, and demonstrate an improvement to the state of the art by showing feature data can be predicted heuristically beyond a single instance to include entire sequences of information.Comment: Published and presented in the 11th IEEE International Conference and Workshops on Engineering of Autonomic and Autonomous Systems (EASe 2014

    Policy Enforcement with Proactive Libraries

    Full text link
    Software libraries implement APIs that deliver reusable functionalities. To correctly use these functionalities, software applications must satisfy certain correctness policies, for instance policies about the order some API methods can be invoked and about the values that can be used for the parameters. If these policies are violated, applications may produce misbehaviors and failures at runtime. Although this problem is general, applications that incorrectly use API methods are more frequent in certain contexts. For instance, Android provides a rich and rapidly evolving set of APIs that might be used incorrectly by app developers who often implement and publish faulty apps in the marketplaces. To mitigate this problem, we introduce the novel notion of proactive library, which augments classic libraries with the capability of proactively detecting and healing misuses at run- time. Proactive libraries blend libraries with multiple proactive modules that collect data, check the correctness policies of the libraries, and heal executions as soon as the violation of a correctness policy is detected. The proactive modules can be activated or deactivated at runtime by the users and can be implemented without requiring any change to the original library and any knowledge about the applications that may use the library. We evaluated proactive libraries in the context of the Android ecosystem. Results show that proactive libraries can automati- cally overcome several problems related to bad resource usage at the cost of a small overhead.Comment: O. Riganelli, D. Micucci and L. Mariani, "Policy Enforcement with Proactive Libraries" 2017 IEEE/ACM 12th International Symposium on Software Engineering for Adaptive and Self-Managing Systems (SEAMS), Buenos Aires, Argentina, 2017, pp. 182-19

    Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications

    Full text link
    Many scientific problems require multiple distinct computational tasks to be executed in order to achieve a desired solution. We introduce the Ensemble Toolkit (EnTK) to address the challenges of scale, diversity and reliability they pose. We describe the design and implementation of EnTK, characterize its performance and integrate it with two distinct exemplar use cases: seismic inversion and adaptive analog ensembles. We perform nine experiments, characterizing EnTK overheads, strong and weak scalability, and the performance of two use case implementations, at scale and on production infrastructures. We show how EnTK meets the following general requirements: (i) implementing dedicated abstractions to support the description and execution of ensemble applications; (ii) support for execution on heterogeneous computing infrastructures; (iii) efficient scalability up to O(10^4) tasks; and (iv) fault tolerance. We discuss novel computational capabilities that EnTK enables and the scientific advantages arising thereof. We propose EnTK as an important addition to the suite of tools in support of production scientific computing
    corecore