29,163 research outputs found

    A Symbiotic Approach to Designing Cross-Layer QoS in Embedded Real-Time Systems

    Get PDF
    International audienceNowadays there is an increasing need for embedded systems to support intensive computing while maintaining traditional hard real-time and fault-tolerant properties. Extending the principle of multi-core systems, we are exploring the use of distributed processing units interconnected via a high performance mesh network as a way of supporting distributed real-time applications. Fault-tolerance can then be ensured through dynamic allocation of both computing and communication resources. We postulate that enhancing QoS (Quality of Service) for real-time applications entails the development of a cross-layer support of high-level requirements, thus requiring a deep knowledge of the underlying networks. In this paper, we propose a new simulation/emulation/experimentation framework, ERICA, for designing such a feature. ERICA integrates both a network simulator and an actual hardware network to allow implementation and evaluation of different QoS-guaranteeing mechanisms. It also supports real-software-in-the-loop, i.e. running of real applications and middleware over these networks. Each component can evolve separately or together in a symbiotic manner, also making teamwork more flexible. We present in more detail our discrete-event simulation approach and the in-silicon implementation with which we cross-check our solutions in order to bring real performance aspects to our work. We also discuss the challenges of running real-software-in-the-loop in a real-time context, i.e. how to bridge it with a network simulator, and how to deal with time consistency

    Fairness Properties of the Trusting Failure Detector

    Get PDF
    In 1985 it was shown by Fischer et al. that consensus, a fundamental problem in distributed computing, was impossible in asynchronous distributed systems in the presence of even just one process failure. This result prompted a search for alternative system models that were capable of solving such problems and culminated in the development of two helpful constructs: partially synchronous system models and failure detectors. Partially synchronous system models seek to solve the problem of identifying process crashes by constraining the real-time behavior of the underlying system. In the resulting models, crashed processes can be detected indirectly through the use of timeouts. Failure detectors, on the other hand, address process crashes by directly providing (potentially inaccurate) information on failures. As a result, failure detectors were viewed as abstractions of real-time information. Pike et al. proposed a different perspective on failure detectors; as abstracting fairness properties. Fairness in a system imposes bounds on the relative frequencies of communication and execution between processes in a system, and it was shown that four frequently-used failure detectors from the Chandra-Toueg hierarchy (P, ♢P, S, ♢S) encapsulate these fairness properties. This discovery suggests that failure detectors may be better understood as abstractions of fairness rather than real-time properties as well as demonstrates the possibility to communicate results between systems augmented with failure detectors and partially synchronous system models. In this thesis, we will be discussing an extension of the Pike et al. result to the trusting failure detector. The trusting failure detector is the weakest failure detector to implement the problem of fault-tolerant mutual exclusion: a fundamental primitive for distributed computing

    Computing in the RAIN: a reliable array of independent nodes

    Get PDF
    The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology

    Introduction to the special section on dependable network computing

    Get PDF
    Dependable network computing is becoming a key part of our daily economic and social life. Every day, millions of users and businesses are utilizing the Internet infrastructure for real-time electronic commerce transactions, scheduling important events, and building relationships. While network traffic and the number of users are rapidly growing, the mean-time between failures (MTTF) is surprisingly short; according to recent studies, in the majority of Internet backbone paths, the MTTF is 28 days. This leads to a strong requirement for highly dependable networks, servers, and software systems. The challenge is to build interconnected systems, based on available technology, that are inexpensive, accessible, scalable, and dependable. This special section provides insights into a number of these exciting challenges

    A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

    Full text link
    Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

    Replica determinism and flexible scheduling in hard real-time dependable systems

    Get PDF
    Fault-tolerant real-time systems are typically based on active replication where replicated entities are required to deliver their outputs in an identical order within a given time interval. Distributed scheduling of replicated tasks, however, violates this requirement if on-line scheduling, preemptive scheduling, or scheduling of dissimilar replicated task sets is employed. This problem of inconsistent task outputs has been solved previously by coordinating the decisions of the local schedulers such that replicated tasks are executed in an identical order. Global coordination results either in an extremely high communication effort to agree on each schedule decision or in an overly restrictive execution model where on-line scheduling, arbitrary preemptions, and nonidentically replicated task sets are not allowed. To overcome these restrictions, a new method, called timed messages, is introduced. Timed messages guarantee deterministic operation by presenting consistent message versions to the replicated tasks. This approach is based on simulated common knowledge and a sparse time base. Timed messages are very effective since they neither require communication between the local scheduler nor do they restrict usage of on-line flexible scheduling, preemptions and nonidentically replicated task sets
    corecore