17,062 research outputs found

    Fault Tolerance in Networks of Bounded Degree

    Get PDF
    Achieving processor cooperation in the presence of faults is a major problem in distributed systems. Popular paradigms such as Byzantine agreement have been studied principally in the context of a complete network. Indeed, Dolev [J. Algorithms, 3 (1982), pp. 14–30] and Hadzilacos [Issues of Fault Tolerance in Concurrent Computations, Ph.D. thesis, Harvard University, Cambridge, MA, 1984] have shown that Ω(t) connectivity is necessary if the requirement is that all nonfaulty processors decide unanimously, where t is the number of faults to be tolerated. We believe that in forseeable technologies the number of faults will grow with the size of the network while the degree will remain practically fixed. We therefore raise the question whether it is possible to avoid the connectivity requirements by slightly lowering our expectations. In many practical situations we may be willing to “lose” some correct processors and settle for cooperation between the vast majority of the processors. Thus motivated, we present a general simulation technique by which vertices (processors) in almost any network of bounded degree can simulate an algorithm designed for the complete network. The simulation has the property that although some correct processors may be cut off from the majority of the network by faulty processors, the vast majority of the correct processors will be able to communicate among themselves undisturbed by the (arbitrary) behavior of the faulty nodes. We define a new paradigm for distributed computing, almost-everywhere agreement, in which we require only that almost all correct processors reach consensus. Unlike the traditional Byzantine agreement problem, almost-everywhere agreement can be solved on networks of bounded degree. Specifically, we can simulate any sufficiently resilient Byzantine agreement algorithm on a network of bounded degree using our communication scheme described above. Although we “lose” some correct processors, effectively treating them as faulty, the vast majority of correct processors decide on a common value

    Sparse Allreduce: Efficient Scalable Communication for Power-Law Data

    Full text link
    Many large datasets exhibit power-law statistics: The web graph, social networks, text data, click through data etc. Their adjacency graphs are termed natural graphs, and are known to be difficult to partition. As a consequence most distributed algorithms on these graphs are communication intensive. Many algorithms on natural graphs involve an Allreduce: a sum or average of partitioned data which is then shared back to the cluster nodes. Examples include PageRank, spectral partitioning, and many machine learning algorithms including regression, factor (topic) models, and clustering. In this paper we describe an efficient and scalable Allreduce primitive for power-law data. We point out scaling problems with existing butterfly and round-robin networks for Sparse Allreduce, and show that a hybrid approach improves on both. Furthermore, we show that Sparse Allreduce stages should be nested instead of cascaded (as in the dense case). And that the optimum throughput Allreduce network should be a butterfly of heterogeneous degree where degree decreases with depth into the network. Finally, a simple replication scheme is introduced to deal with node failures. We present experiments showing significant improvements over existing systems such as PowerGraph and Hadoop

    On The Robustness of a Neural Network

    Get PDF
    With the development of neural networks based machine learning and their usage in mission critical applications, voices are rising against the \textit{black box} aspect of neural networks as it becomes crucial to understand their limits and capabilities. With the rise of neuromorphic hardware, it is even more critical to understand how a neural network, as a distributed system, tolerates the failures of its computing nodes, neurons, and its communication channels, synapses. Experimentally assessing the robustness of neural networks involves the quixotic venture of testing all the possible failures, on all the possible inputs, which ultimately hits a combinatorial explosion for the first, and the impossibility to gather all the possible inputs for the second. In this paper, we prove an upper bound on the expected error of the output when a subset of neurons crashes. This bound involves dependencies on the network parameters that can be seen as being too pessimistic in the average case. It involves a polynomial dependency on the Lipschitz coefficient of the neurons activation function, and an exponential dependency on the depth of the layer where a failure occurs. We back up our theoretical results with experiments illustrating the extent to which our prediction matches the dependencies between the network parameters and robustness. Our results show that the robustness of neural networks to the average crash can be estimated without the need to neither test the network on all failure configurations, nor access the training set used to train the network, both of which are practically impossible requirements.Comment: 36th IEEE International Symposium on Reliable Distributed Systems 26 - 29 September 2017. Hong Kong, Chin

    Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery

    Get PDF
    The trend of downsizing transistors and operating voltage scaling has made the processor chip more sensitive against radiation phenomena making soft errors an important challenge. New reliability techniques for handling soft errors in the logic and memories that allow meeting the desired failures-in-time (FIT) target are key to keep harnessing the benefits of Moore's law. The failure to scale the soft error rate caused by particle strikes, may soon limit the total number of cores that one may have running at the same time. This paper proposes a light-weight and scalable architecture to eliminate silent data corruption errors (SDC) and detected unrecoverable errors (DUE) of a core. The architecture uses acoustic wave detectors for error detection. We propose to recover by confining the errors in the cache hierarchy, allowing us to deal with the relatively long detection latencies. Our results show that the proposed mechanism protects the whole core (logic, latches and memory arrays) incurring performance overhead as low as 0.60%. © 2014 IEEE.Peer ReviewedPostprint (author's final draft

    Stochastic Analysis of a Churn-Tolerant Structured Peer-to-Peer Scheme

    Full text link
    We present and analyze a simple and general scheme to build a churn (fault)-tolerant structured Peer-to-Peer (P2P) network. Our scheme shows how to "convert" a static network into a dynamic distributed hash table(DHT)-based P2P network such that all the good properties of the static network are guaranteed with high probability (w.h.p). Applying our scheme to a cube-connected cycles network, for example, yields a O(logN)O(\log N) degree connected network, in which every search succeeds in O(logN)O(\log N) hops w.h.p., using O(logN)O(\log N) messages, where NN is the expected stable network size. Our scheme has an constant storage overhead (the number of nodes responsible for servicing a data item) and an O(logN)O(\log N) overhead (messages and time) per insertion and essentially no overhead for deletions. All these bounds are essentially optimal. While DHT schemes with similar guarantees are already known in the literature, this work is new in the following aspects: (1) It presents a rigorous mathematical analysis of the scheme under a general stochastic model of churn and shows the above guarantees; (2) The theoretical analysis is complemented by a simulation-based analysis that validates the asymptotic bounds even in moderately sized networks and also studies performance under changing stable network size; (3) The presented scheme seems especially suitable for maintaining dynamic structures under churn efficiently. In particular, we show that a spanning tree of low diameter can be efficiently maintained in constant time and logarithmic number of messages per insertion or deletion w.h.p. Keywords: P2P Network, DHT Scheme, Churn, Dynamic Spanning Tree, Stochastic Analysis

    Storage and Search in Dynamic Peer-to-Peer Networks

    Full text link
    We study robust and efficient distributed algorithms for searching, storing, and maintaining data in dynamic Peer-to-Peer (P2P) networks. P2P networks are highly dynamic networks that experience heavy node churn (i.e., nodes join and leave the network continuously over time). Our goal is to guarantee, despite high node churn rate, that a large number of nodes in the network can store, retrieve, and maintain a large number of data items. Our main contributions are fast randomized distributed algorithms that guarantee the above with high probability (whp) even under high adversarial churn: 1. A randomized distributed search algorithm that (whp) guarantees that searches from as many as no(n)n - o(n) nodes (nn is the stable network size) succeed in O(logn){O}(\log n)-rounds despite O(n/log1+δn){O}(n/\log^{1+\delta} n) churn, for any small constant δ>0\delta > 0, per round. We assume that the churn is controlled by an oblivious adversary (that has complete knowledge and control of what nodes join and leave and at what time, but is oblivious to the random choices made by the algorithm). 2. A storage and maintenance algorithm that guarantees (whp) data items can be efficiently stored (with only Θ(logn)\Theta(\log{n}) copies of each data item) and maintained in a dynamic P2P network with churn rate up to O(n/log1+δn){O}(n/\log^{1+\delta} n) per round. Our search algorithm together with our storage and maintenance algorithm guarantees that as many as no(n)n - o(n) nodes can efficiently store, maintain, and search even under O(n/log1+δn){O}(n/\log^{1+\delta} n) churn per round. Our algorithms require only polylogarithmic in nn bits to be processed and sent (per round) by each node. To the best of our knowledge, our algorithms are the first-known, fully-distributed storage and search algorithms that provably work under highly dynamic settings (i.e., high churn rates per step).Comment: to appear at SPAA 201
    corecore