2,263 research outputs found

    Complexity of Multi-Value Byzantine Agreement

    Full text link
    In this paper, we consider the problem of maximizing the throughput of Byzantine agreement, given that the sum capacity of all links in between nodes in the system is finite. We have proposed a highly efficient Byzantine agreement algorithm on values of length l>1 bits. This algorithm uses error detecting network codes to ensure that fault-free nodes will never disagree, and routing scheme that is adaptive to the result of error detection. Our algorithm has a bit complexity of n(n-1)l/(n-t), which leads to a linear cost (O(n)) per bit agreed upon, and overcomes the quadratic lower bound (Omega(n^2)) in the literature. Such linear per bit complexity has only been achieved in the literature by allowing a positive probability of error. Our algorithm achieves the linear per bit complexity while guaranteeing agreement is achieved correctly even in the worst case. We also conjecture that our algorithm can be used to achieve agreement throughput arbitrarily close to the agreement capacity of a network, when the sum capacity is given

    A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

    Full text link
    Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

    Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 2: Army fault tolerant architecture design and analysis

    Get PDF
    Described here is the Army Fault Tolerant Architecture (AFTA) hardware architecture and components and the operating system. The architectural and operational theory of the AFTA Fault Tolerant Data Bus is discussed. The test and maintenance strategy developed for use in fielded AFTA installations is presented. An approach to be used in reducing the probability of AFTA failure due to common mode faults is described. Analytical models for AFTA performance, reliability, availability, life cycle cost, weight, power, and volume are developed. An approach is presented for using VHSIC Hardware Description Language (VHDL) to describe and design AFTA's developmental hardware. A plan is described for verifying and validating key AFTA concepts during the Dem/Val phase. Analytical models and partial mission requirements are used to generate AFTA configurations for the TF/TA/NOE and Ground Vehicle missions

    Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 1: Army fault tolerant architecture overview

    Get PDF
    Digital computing systems needed for Army programs such as the Computer-Aided Low Altitude Helicopter Flight Program and the Armored Systems Modernization (ASM) vehicles may be characterized by high computational throughput and input/output bandwidth, hard real-time response, high reliability and availability, and maintainability, testability, and producibility requirements. In addition, such a system should be affordable to produce, procure, maintain, and upgrade. To address these needs, the Army Fault Tolerant Architecture (AFTA) is being designed and constructed under a three-year program comprised of a conceptual study, detailed design and fabrication, and demonstration and validation phases. Described here are the results of the conceptual study phase of the AFTA development. Given here is an introduction to the AFTA program, its objectives, and key elements of its technical approach. A format is designed for representing mission requirements in a manner suitable for first order AFTA sizing and analysis, followed by a discussion of the current state of mission requirements acquisition for the targeted Army missions. An overview is given of AFTA's architectural theory of operation

    Optimal mobile byzantine fault tolerant distributed storage: [Extended Abstract]

    Get PDF
    We present an optimal emulation of a server based regular read/write storage in a synchronous round-free messagepassing system that is subject to mobile Byzantine failures and prove that the problem is impossible to solve in asynchronous settings. In a system with n servers implementing a regular register, our construction tolerates faults (or attacks) that can be abstracted by agents that are moved (in an arbitrary and unforeseen manner) by a computationally unbounded adversary from a server to another in order to deviate the server's computation. When a server is infected by an adversarial agent, it behaves arbitrarily until the adversary decides to "move" the agent to another server. We investigate the case where the movements of the mobile Byzantine agents are decided by the adversary and are completely decoupled from the message communication delay. Our emulation spans two awareness models: servers with and without self-diagnosis mechanism. In the first case servers are aware that the mobile Byzantine agent has left and hence they can stop running the protocol until they recover a correct state while in the second case, servers are not aware of their faulty state and continue to run the protocol using an incorrect local state. Our results, proven optimal with respect to the threshold of the tolerated mobile Byzantine faults in the first model, are significantly different from the round-based synchronous models. Another interesting side result of our study is that, contrary to the round-based synchronous consensus implementation for systems prone to mobile Byzantine faults, our storage emulation does not rely on the necessity of a core of correct processes all along the computation. That is, every server in the system can be compromised by the mobile Byzantine agents at some point in the computation. This leads to another interesting conclusion: storage is easier than consensus in synchronous settings, when the system is hit by mobile Byzantine failures

    Synchronization and fault-masking in redundant real-time systems

    Get PDF
    A real time computer may fail because of massive component failures or not responding quickly enough to satisfy real time requirements. An increase in redundancy - a conventional means of improving reliability - can improve the former but can - in some cases - degrade the latter considerably due to the overhead associated with redundancy management, namely the time delay resulting from synchronization and voting/interactive consistency techniques. The implications of synchronization and voting/interactive consistency algorithms in N-modular clusters on reliability are considered. All these studies were carried out in the context of real time applications. As a demonstrative example, we have analyzed results from experiments conducted at the NASA Airlab on the Software Implemented Fault Tolerance (SIFT) computer. This analysis has indeed indicated that in most real time applications, it is better to employ hardware synchronization instead of software synchronization and not allow reconfiguration

    A survey of modern exogenous fault detection and diagnosis methods for swarm robotics

    Get PDF
    Swarm robotic systems are heavily inspired by observations of social insects. This often leads to robust-ness being viewed as an inherent property of them. However, this has been shown to not always be thecase. Because of this, fault detection and diagnosis in swarm robotic systems is of the utmost importancefor ensuring the continued operation and success of the swarm. This paper provides an overview of recentwork in the field of exogenous fault detection and diagnosis in swarm robotics, focusing on the four areaswhere research is concentrated: immune system, data modelling, and blockchain-based fault detectionmethods and local-sensing based fault diagnosis methods. Each of these areas have significant advan-tages and disadvantages which are explored in detail. Though the work presented here represents a sig-nificant advancement in the field, there are still large areas that require further research. Specifically,further research is required in testing these methods on real robotic swarms, fault diagnosis methods,and integrating fault detection, diagnosis and recovery methods in order to create robust swarms thatcan be used for non-trivial tasks
    • …
    corecore