7,034 research outputs found
DeSyRe: on-Demand System Reliability
The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints
Autonomous Fault Detection in Self-Healing Systems using Restricted Boltzmann Machines
Autonomously detecting and recovering from faults is one approach for
reducing the operational complexity and costs associated with managing
computing environments. We present a novel methodology for autonomously
generating investigation leads that help identify systems faults, and extends
our previous work in this area by leveraging Restricted Boltzmann Machines
(RBMs) and contrastive divergence learning to analyse changes in historical
feature data. This allows us to heuristically identify the root cause of a
fault, and demonstrate an improvement to the state of the art by showing
feature data can be predicted heuristically beyond a single instance to include
entire sequences of information.Comment: Published and presented in the 11th IEEE International Conference and
Workshops on Engineering of Autonomic and Autonomous Systems (EASe 2014
Policy Enforcement with Proactive Libraries
Software libraries implement APIs that deliver reusable functionalities. To
correctly use these functionalities, software applications must satisfy certain
correctness policies, for instance policies about the order some API methods
can be invoked and about the values that can be used for the parameters. If
these policies are violated, applications may produce misbehaviors and failures
at runtime. Although this problem is general, applications that incorrectly use
API methods are more frequent in certain contexts. For instance, Android
provides a rich and rapidly evolving set of APIs that might be used incorrectly
by app developers who often implement and publish faulty apps in the
marketplaces. To mitigate this problem, we introduce the novel notion of
proactive library, which augments classic libraries with the capability of
proactively detecting and healing misuses at run- time. Proactive libraries
blend libraries with multiple proactive modules that collect data, check the
correctness policies of the libraries, and heal executions as soon as the
violation of a correctness policy is detected. The proactive modules can be
activated or deactivated at runtime by the users and can be implemented without
requiring any change to the original library and any knowledge about the
applications that may use the library. We evaluated proactive libraries in the
context of the Android ecosystem. Results show that proactive libraries can
automati- cally overcome several problems related to bad resource usage at the
cost of a small overhead.Comment: O. Riganelli, D. Micucci and L. Mariani, "Policy Enforcement with
Proactive Libraries" 2017 IEEE/ACM 12th International Symposium on Software
Engineering for Adaptive and Self-Managing Systems (SEAMS), Buenos Aires,
Argentina, 2017, pp. 182-19
Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications
Many scientific problems require multiple distinct computational tasks to be
executed in order to achieve a desired solution. We introduce the Ensemble
Toolkit (EnTK) to address the challenges of scale, diversity and reliability
they pose. We describe the design and implementation of EnTK, characterize its
performance and integrate it with two distinct exemplar use cases: seismic
inversion and adaptive analog ensembles. We perform nine experiments,
characterizing EnTK overheads, strong and weak scalability, and the performance
of two use case implementations, at scale and on production infrastructures. We
show how EnTK meets the following general requirements: (i) implementing
dedicated abstractions to support the description and execution of ensemble
applications; (ii) support for execution on heterogeneous computing
infrastructures; (iii) efficient scalability up to O(10^4) tasks; and (iv)
fault tolerance. We discuss novel computational capabilities that EnTK enables
and the scientific advantages arising thereof. We propose EnTK as an important
addition to the suite of tools in support of production scientific computing
- …