64,188 research outputs found

    Distributed operating systems

    Get PDF
    In the past five years, distributed operating systems research has gone through a consolidation phase. On a large number of design issues there is now considerable consensus between different research groups.\ud \ud In this paper, an overview of recent research in distributed systems is given. In turn, the paper discusses overall system structure, protection issues, file system designs, problems and solutions for fault tolerance and a mechanism that is rapidly becoming very important for efficient distributed systems design: hints.\ud \ud An attempt was made to provide sufficient references to interesting research projects for the reader to find material for more detailed study

    Distributed intelligent robotics : research & development in fault-tolerant control and size/position identification : a thesis presented in partial fulfilment of the requirements for the degree of Master of Engineering in Computer Systems Engineering at Massey University

    Get PDF
    This thesis presents research conducted on aspects of intelligent robotic systems. In the past two decades, robotics has become one of the most rapidly expanding and developing fields of science. Robotics can be considered as the science of using artificial intelligence in the physical world. Many areas of study exist in robotics. Among these, two fields that are of paramount importance in real world applications are fault tolerance, and sensory systems. Fault tolerance is necessary since a robot in the real world could encounter internal faults, and may also have to continue functioning under adverse conditions. Sensory mechanisms are essential since a robot will possess little intelligence if it does not have methods of acquiring information about its environment. Both these fields are researched in this thesis. In particular, emphasis is placed on distributed intelligent autonomous systems. Experiments and simulations have been conducted to investigate design for fault tolerance. A suitable platform was also chosen for an implementation of a visual system, as an example of a working sensory mechanism

    A System for Distributed Mechanisms: Design, Implementation and Applications

    Full text link
    We describe here a structured system for distributed mechanism design appropriate for both Intranet and Internet applications. In our approach the players dynamically form a network in which they know neither their neighbours nor the size of the network and interact to jointly take decisions. The only assumption concerning the underlying communication layer is that for each pair of processes there is a path of neighbours connecting them. This allows us to deal with arbitrary network topologies. We also discuss the implementation of this system which consists of a sequence of layers. The lower layers deal with the operations that implement the basic primitives of distributed computing, namely low level communication and distributed termination, while the upper layers use these primitives to implement high level communication among players, including broadcasting and multicasting, and distributed decision making. This yields a highly flexible distributed system whose specific applications are realized as instances of its top layer. This design is implemented in Java. The system supports at various levels fault-tolerance and includes a provision for distributed policing the purpose of which is to exclude `dishonest' players. Also, it can be used for repeated creation of dynamically formed networks of players interested in a joint decision making implemented by means of a tax-based mechanism. We illustrate its flexibility by discussing a number of implemented examples.Comment: 36 pages; revised and expanded versio

    Distributed Planetary Object Name Service: Issues and Design Principles

    Get PDF
    The ONS is a central lookup service used in the EPCglobal network for retrieving the location of information about a specific EPC. This centralized solution lacks scalability and fault tolerance. We present the design principles of a distributed solution for ONS lookup service. In distributed systems, the problem of providing a scalable location service requires a dynamic mechanism to associate identification and location. We show that the use of Distributed Hash Tables (DHT) is a good candidate for distributing as it provides such a mechanism. We then outline how to adapt the DHT principles (operations on objects or nodes) to the ONS distribution problem

    A Distributed Platform for Mechanism Design

    Full text link
    We describe a structured system for distributed mechanism design. It consists of a sequence of layers. The lower layers deal with the operations relevant for distributed computing only, while the upper layers are concerned only with communication among players, including broadcasting and multicasting, and distributed decision making. This yields a highly flexible distributed system whose specific applications are realized as instances of its top layer. This design supports fault-tolerance, prevents manipulations and makes it possible to implement distributed policing. The system is implemented in Java. We illustrate it by discussing a number of implemented examples.Comment: 6 pages. To appear in the Proc. of International Conference on Computational Intelligence for Modelling, Control and Automation, IEEE Societ

    Design of BATRUN Distributed Processing System

    Get PDF
    This paper discusses the design of BATRUN Distributed Processing System (DPS). We have developed this system to automate the execution of jobs in a cluster of workstations where machines belong to different owners. The objective is to use a general purpose cluster as one massive computer for processing large applications. In contrast to a dedicated cluster, the scheduling in BATRUN DPS must ensure that only the idle cycles are used for distributed computing and local users, when they are operating, have the full control of their machines. BATRUN DPS has several unique features: (1) group-based scheduling policy to ensure execution priority based on ownership of machines, (2) multi-cell distributed design to eliminate a single point failure as well as to guarantee better fault tolerance and scalability. The implementation of the system is based on multi-threading and remote procedure call mechanism

    FastRecover: simple and effective fault recovery in a distributed operator-based stream processing engine

    Get PDF
    Fault tolerance is a key requirement in large-scale distributed stream processing engines (SPEs), especially those that run atop commodity hardware. Currently, fault tolerance in popular distributed SPEs is either inadequate (e.g., those without automatic recovery of operator states) or complex and inefficient (e.g., those with transactional semantics). There are two major considerations in the design of an effective fault tolerance mechanism: the overhead of additional checkpointing operations during normal processing, and the time required to recover and return to normal processing when a failure happens. The main challenge lies in that faster recovery requires higher checkpointing overhead, and vice versa. This thesis presents FastRecover, a novel fault tolerance mechanism for distributed SPEs that strikes a balance between recovery time and checkpointing overhead. Specifically, given an application topology consisting of interconnected operators, and an upper bound on checkpoint overhead, FastRecover computes the optimal expected recovery time, as well as the strategy used for checkpointing and recovery in each operator. The main idea of FastRecover is to compute an optimal partitioning of the streaming operator topology into independent segments; for each segment, FastRecover backs up its input tuples and periodically checkpoints the states of operators therein. During recovery for a particular segment, FastRecover restores each affected operator state in the segment to the latest checkpoint, and replays the inputs of the segment since then. Both checkpointing and recovery utilize the parallel processing capabilities of the distributed SPE. Extensive experiments demonstrate that FastRecover achieves an average of 50% reduction in expected recovery time compared to simple solutions. The experiments also show that the total expected recovery time varies proportionally to the total computational recovery time and recovery latency in tests with simulated failures, and hence is a good measure to optimize

    Alpha Entanglement Codes: Practical Erasure Codes to Archive Data in Unreliable Environments

    Full text link
    Data centres that use consumer-grade disks drives and distributed peer-to-peer systems are unreliable environments to archive data without enough redundancy. Most redundancy schemes are not completely effective for providing high availability, durability and integrity in the long-term. We propose alpha entanglement codes, a mechanism that creates a virtual layer of highly interconnected storage devices to propagate redundant information across a large scale storage system. Our motivation is to design flexible and practical erasure codes with high fault-tolerance to improve data durability and availability even in catastrophic scenarios. By flexible and practical, we mean code settings that can be adapted to future requirements and practical implementations with reasonable trade-offs between security, resource usage and performance. The codes have three parameters. Alpha increases storage overhead linearly but increases the possible paths to recover data exponentially. Two other parameters increase fault-tolerance even further without the need of additional storage. As a result, an entangled storage system can provide high availability, durability and offer additional integrity: it is more difficult to modify data undetectably. We evaluate how several redundancy schemes perform in unreliable environments and show that alpha entanglement codes are flexible and practical codes. Remarkably, they excel at code locality, hence, they reduce repair costs and become less dependent on storage locations with poor availability. Our solution outperforms Reed-Solomon codes in many disaster recovery scenarios.Comment: The publication has 12 pages and 13 figures. This work was partially supported by Swiss National Science Foundation SNSF Doc.Mobility 162014, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN

    A metaobject architecture for fault-tolerant distributed systems : the FRIENDS approach

    Get PDF
    The FRIENDS system developed at LAAS-CNRS is a metalevel architecture providing libraries of metaobjects for fault tolerance, secure communication, and group-based distributed applications. The use of metaobjects provides a nice separation of concerns between mechanisms and applications. Metaobjects can be used transparently by applications and can be composed according to the needs of a given application, a given architecture, and its underlying properties. In FRIENDS, metaobjects are used recursively to add new properties to applications. They are designed using an object oriented design method and implemented on top of basic system services. This paper describes the FRIENDS software-based architecture, the object-oriented development of metaobjects, the experiments that we have done, and summarizes the advantages and drawbacks of a metaobject approach for building fault-tolerant system
    • …
    corecore