45 research outputs found
Restart-Based Fault-Tolerance: System Design and Schedulability Analysis
Embedded systems in safety-critical environments are continuously required to
deliver more performance and functionality, while expected to provide verified
safety guarantees. Nonetheless, platform-wide software verification (required
for safety) is often expensive. Therefore, design methods that enable
utilization of components such as real-time operating systems (RTOS), without
requiring their correctness to guarantee safety, is necessary.
In this paper, we propose a design approach to deploy safe-by-design embedded
systems. To attain this goal, we rely on a small core of verified software to
handle faults in applications and RTOS and recover from them while ensuring
that timing constraints of safety-critical tasks are always satisfied. Faults
are detected by monitoring the application timing and fault-recovery is
achieved via full platform restart and software reload, enabled by the short
restart time of embedded systems. Schedulability analysis is used to ensure
that the timing constraints of critical plant control tasks are always
satisfied in spite of faults and consequent restarts. We derive schedulability
results for four restart-tolerant task models. We use a simulator to evaluate
and compare the performance of the considered scheduling models
Improving quality of service in application clusters
Quality of service (QoS) requirements, which include availability, integrity, performance and responsiveness are increasingly needed by science and engineering applications. Rising computational demands and data mining present a new challenge in the IT world. As our needs for more processing, research and analysis increase, performance and reliability degrade exponentially. In this paper we present a software system that manages quality of service for Unix based distributed application clusters. Our approach is synthetic and involves intelligent agents that make use of static and dynamic ontologies to monitor, diagnose and correct faults at run time, over a private network. Finally, we provide experimental results from our pilot implementation in a production environment
Recommended from our members
A Uniform Programming Abstraction for Effecting Autonomic Adaptations onto Software Systems
Most general-purpose work towards autonomic or self-managing systems has emphasized the front end of the feedback control loop, with some also concerned with controlling the back end enactment of runtime adaptations -- but usually employing an effector technology peculiar to one type of target system. While completely generic "one size fits all" effector technologies seem implausible, we propose a general purpose programming model and interaction layer that abstracts away from the peculiarities of target specific effectors,enabling a uniform approach to controlling and coordinating the low-level execution of reconfigurations, repairs,micro-reboots, etc
Recommended from our members
A Uniform Programming Abstraction for Effecting Autonomic Adaptations onto Software Systems
Most general-purpose work towards autonomic or self-managing systems has emphasized the front end of the feedback control loop, with some also concerned with controlling the back end enactment of runtime adaptations but usually employing an effector technology peculiar to one type of target system. While completely generic 'one size fits all' effector technologies seem implausible, we propose a general-purpose programming model and interaction layer that abstracts away from the peculiarities of target-specific effectors, enabling a uniform approach to controlling and coordinating the low-level execution of reconfigurations, repairs, micro-reboots, etc
Autonomous Recovery in Componentized Internet Applications
In this paper we show how to reduce downtime of J2EE applications by rapidly and automatically recovering from transient and intermittent software failures, without requiring application modifications. Our prototype combines three application-agnostic techniques: macroanalysis for fault detection and localization, microrebooting for rapid recovery, and external management of recovery actions. The individual techniques are autonomous and work across a wide range of componentized Internet applications, making them well-suited to the rapidly changing software of Internet services. The proposed framework has been integrated with JBoss, an open-source J2EE application server. Our prototype provides an execution platform that can automatically recover J2EE applications within seconds of the manifestation of a fault. Our system can provide a subset of a system's active end users with the illusion of continuous uptime, in spite of failures occurring behind the scenes, even when there is no functional redundancy in the system
Behind the Last Line of Defense -- Surviving SoC Faults and Intrusions
Today, leveraging the enormous modular power, diversity and flexibility of
manycore systems-on-a-chip (SoCs) requires careful orchestration of complex
resources, a task left to low-level software, e.g. hypervisors. In current
architectures, this software forms a single point of failure and worthwhile
target for attacks: once compromised, adversaries gain access to all
information and full control over the platform and the environment it controls.
This paper proposes Midir, an enhanced manycore architecture, effecting a
paradigm shift from SoCs to distributed SoCs. Midir changes the way platform
resources are controlled, by retrofitting tile-based fault containment through
well known mechanisms, while securing low-overhead quorum-based consensus on
all critical operations, in particular privilege management and, thus,
management of containment domains. Allowing versatile redundancy management,
Midir promotes resilience for all software levels, including at low level. We
explain this architecture, its associated algorithms and hardware mechanisms
and show, for the example of a Byzantine fault tolerant microhypervisor, that
it outperforms the highly efficient MinBFT by one order of magnitude