90,330 research outputs found
Towards Middleware for Fault-tolerance in Distributed Real-time and Embedded Systems
Abstract. Distributed real-time and embedded (DRE) systems often require support for multiple simultaneous quality of service (QoS) properties, such as real-timeliness and fault tolerance, that operate within resource constrained environments. These resource constraints motivate the need for a lightweight middleware infrastructure, while the need for simultaneous QoS properties require the middleware to provide fault tolerance capabilities that respect time-critical needs of DRE systems. Conventional middleware solutions, such as Fault-tolerant CORBA (FT-CORBA) and Continuous Availability API for J2EE, have limited utility for DRE systems because they are heavyweight (e.g., the complexity of their feature-rich fault tolerance capabilities consumes excessive runtime resources), yet incomplete (e.g., they lack mechanisms that enable fault tolerance while maintaining real-time predictability). This paper provides three contributions to the development and standardization of lightweight real-time and fault-tolerant middleware for DRE systems. First, we discuss the challenges in realizing real-time faulttolerant solutions for DRE systems using contemporary middleware. Second, we describe recent progress towards standardizing a CORBA lightweight fault-tolerance specification for DRE systems. Third, we present the architecture of FLARe, which is a prototype based on the OMG real-time fault-tolerant CORBA middleware standardization efforts that is lightweight (e.g., leverages only those server-and client-side mechanisms required for real-time systems) and predictable (e.g., provides fault-tolerant mechanisms that respect time-critical performance needs of DRE systems)
The embedded operating system project
This progress report describes research towards the design and construction of embedded operating systems for real-time advanced aerospace applications. The applications concerned require reliable operating system support that must accommodate networks of computers. The report addresses the problems of constructing such operating systems, the communications media, reconfiguration, consistency and recovery in a distributed system, and the issues of realtime processing. A discussion is included on suitable theoretical foundations for the use of atomic actions to support fault tolerance and data consistency in real-time object-based systems. In particular, this report addresses: atomic actions, fault tolerance, operating system structure, program development, reliability and availability, and networking issues. This document reports the status of various experiments designed and conducted to investigate embedded operating system design issues
Extending rely-guarantee thinking to handle real-time scheduling
The reference point for developing any artefact is its specification; to develop software for- mally, a formal specification is required. For sequential programs, pre and post conditions (together with abstract objects) suffice; rely and guarantee conditions extend the scope of formal development approaches to tackle concurrency. In addition, real-time systems need ways of both requiring progress and relating that progress to some notion of time. This paper extends rely-guarantee ideas to cope with specifications of—and assumptions about— real-time schedulers. Furthermore it shows how the approach helps identify and specify fault-tolerance aspects of such schedulers by systematically challenging the assumption
Energy-aware Fault-tolerant Scheduling for Hard Real-time Systems
Over the past several decades, we have experienced tremendous growth of real-time systems in both scale and complexity. This progress is made possible largely due to advancements in semiconductor technology that have enabled the continuous scaling and massive integration of transistors on a single chip. In the meantime, however, the relentless transistor scaling and integration have dramatically increased the power consumption and degraded the system reliability substantially. Traditional real-time scheduling techniques with the sole emphasis on guaranteeing timing constraints have become insufficient.
In this research, we studied the problem of how to develop advanced scheduling methods on hard real-time systems that are subject to multiple design constraints, in particular, timing, energy consumption, and reliability constraints. To this end, we first investigated the energy minimization problem with fault-tolerance requirements for dynamic-priority based hard real-time tasks on a single-core processor. Three scheduling algorithms have been developed to judiciously make tradeoffs between fault tolerance and energy reduction since both design objectives usually conflict with each other. We then shifted our research focus from single-core platforms to multi-core platforms as the latter are becoming mainstream. Specifically, we launched our research in fault-tolerant multi-core scheduling for fixed-priority tasks as fixed-priority scheduling is one of the most commonly used schemes in the industry today. For such systems, we developed several checkpointing-based partitioning strategies with the joint consideration of fault tolerance and energy minimization. At last, we exploited the implicit relations between real-time tasks in order to judiciously make partitioning decisions with the aim of improving system schedulability.
According to the simulation results, our design strategies have been shown to be very promising for emerging systems and applications where timeliness, fault-tolerance, and energy reduction need to be simultaneously addressed
Real-Time Decoding for Fault-Tolerant Quantum Computing: Progress, Challenges and Outlook
Quantum computing is poised to solve practically useful problems which are
computationally intractable for classical supercomputers. However, the current
generation of quantum computers are limited by errors that may only partially
be mitigated by developing higher-quality qubits. Quantum error correction
(QEC) will thus be necessary to ensure fault tolerance. QEC protects the
logical information by cyclically measuring syndrome information about the
errors. An essential part of QEC is the decoder, which uses the syndrome to
compute the likely effect of the errors on the logical degrees of freedom and
provide a tentative correction. The decoder must be accurate, fast enough to
keep pace with the QEC cycle (e.g., on a microsecond timescale for
superconducting qubits) and with hard real-time system integration to support
logical operations. As such, real-time decoding is essential to realize
fault-tolerant quantum computing and to achieve quantum advantage. In this
work, we highlight some of the key challenges facing the implementation of
real-time decoders while providing a succinct summary of the progress to-date.
Furthermore, we lay out our perspective for the future development and provide
a possible roadmap for the field of real-time decoding in the next few years.
As the quantum hardware is anticipated to scale up, this perspective article
will provide a guidance for researchers, focusing on the most pressing issues
in real-time decoding and facilitating the development of solutions across
quantum and computer science
Real-time and fault tolerance in distributed control software
Closed loop control systems typically contain multitude of spatially distributed sensors and actuators operated simultaneously. So those systems are parallel and distributed in their essence. But mapping this parallelism onto the given distributed hardware architecture, brings in some additional requirements: safe multithreading, optimal process allocation, real-time scheduling of bus and network resources. Nowadays, fault tolerance methods and fast even online reconfiguration are becoming increasingly important. All those often conflicting requirements, make design and implementation of real-time distributed control systems an extremely difficult task, that requires substantial knowledge in several areas of control and computer science. Although many design methods have been proposed so far, none of them had succeeded to cover all important aspects of the problem at hand. [1] Continuous increase of production in embedded market, makes a simple and natural design methodology for real-time systems needed more then ever
Recommended from our members
Software safety : a definition and some preliminary thoughts
Software safety is the subject of a research project in its initial stages at the University of California Irvine. This research deals with critical real-time software where the cost of an error is high, e.g. human life. In this paper software techniques having a bearing on safety are described and evaluated. Initial definitions of software safety concepts are presented along with some preliminary thoughts and research questions
Tolerating Correlated Failures in Massively Parallel Stream Processing Engines
Fault-tolerance techniques for stream processing engines can be categorized
into passive and active approaches. A typical passive approach periodically
checkpoints a processing task's runtime states and can recover a failed task by
restoring its runtime state using its latest checkpoint. On the other hand, an
active approach usually employs backup nodes to run replicated tasks. Upon
failure, the active replica can take over the processing of the failed task
with minimal latency. However, both approaches have their own inadequacies in
Massively Parallel Stream Processing Engines (MPSPE). The passive approach
incurs a long recovery latency especially when a number of correlated nodes
fail simultaneously, while the active approach requires extra replication
resources. In this paper, we propose a new fault-tolerance framework, which is
Passive and Partially Active (PPA). In a PPA scheme, the passive approach is
applied to all tasks while only a selected set of tasks will be actively
replicated. The number of actively replicated tasks depends on the available
resources. If tasks without active replicas fail, tentative outputs will be
generated before the completion of the recovery process. We also propose
effective and efficient algorithms to optimize a partially active replication
plan to maximize the quality of tentative outputs. We implemented PPA on top of
Storm, an open-source MPSPE and conducted extensive experiments using both real
and synthetic datasets to verify the effectiveness of our approach
- …