202 research outputs found

    How to Elect a Leader Faster than a Tournament

    Full text link
    The problem of electing a leader from among nn contenders is one of the fundamental questions in distributed computing. In its simplest formulation, the task is as follows: given nn processors, all participants must eventually return a win or lose indication, such that a single contender may win. Despite a considerable amount of work on leader election, the following question is still open: can we elect a leader in an asynchronous fault-prone system faster than just running a Θ(logn)\Theta(\log n)-time tournament, against a strong adaptive adversary? In this paper, we answer this question in the affirmative, improving on a decades-old upper bound. We introduce two new algorithmic ideas to reduce the time complexity of electing a leader to O(logn)O(\log^* n), using O(n2)O(n^2) point-to-point messages. A non-trivial application of our algorithm is a new upper bound for the tight renaming problem, assigning nn items to the nn participants in expected O(log2n)O(\log^2 n) time and O(n2)O(n^2) messages. We complement our results with lower bound of Ω(n2)\Omega(n^2) messages for solving these two problems, closing the question of their message complexity

    Однородные сети с распределенной системой реконфигураций

    Get PDF
    Запропоновано метод та алгоритм реконфігурації розрядномодульних однорідних мереж (РМОМ) з розподіленою реконфігурацією резервних та функціонуючих модулів. Запропонований підхід може застосовуватися до РМОМ різної розмірності та призначення, в яких несправний модуль виявляється вбудованими засобами діагностування, а реконфігурація здійснюється під управлінням HOST процесора.This paper presents an effective reconfiguration method and algorithm of reconfiguring one and twodimensional degradable arrays with four – port switches, when processing elements of arrays become faulty

    The Parallel Persistent Memory Model

    Full text link
    We consider a parallel computational model that consists of PP processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory. The model allows for each processor to fault with bounded probability, and possibly restart. On faulting all processor state and local ephemeral memory are lost, but the persistent memory remains. This model is motivated by upcoming non-volatile memories that are as fast as existing random access memory, are accessible at the granularity of cache lines, and have the capability of surviving power outages. It is further motivated by the observation that in large parallel systems, failure of processors and their caches is not unusual. Within the model we develop a framework for developing locality efficient parallel algorithms that are resilient to failures. There are several challenges, including the need to recover from failures, the desire to do this in an asynchronous setting (i.e., not blocking other processors when one fails), and the need for synchronization primitives that are robust to failures. We describe approaches to solve these challenges based on breaking computations into what we call capsules, which have certain properties, and developing a work-stealing scheduler that functions properly within the context of failures. The scheduler guarantees a time bound of O(W/PA+D(P/PA)log1/fW)O(W/P_A + D(P/P_A) \lceil\log_{1/f} W\rceil) in expectation, where WW and DD are the work and depth of the computation (in the absence of failures), PAP_A is the average number of processors available during the computation, and f1/2f \le 1/2 is the probability that a capsule fails. Within the model and using the proposed methods, we develop efficient algorithms for parallel sorting and other primitives.Comment: This paper is the full version of a paper at SPAA 2018 with the same nam

    Local majorities, coalitions and monopolies in graphs: a review

    Get PDF
    AbstractThis paper provides an overview of recent developments concerning the process of local majority voting in graphs, and its basic properties, from graph theoretic and algorithmic standpoints

    Testing and reconfiguration of VLSI linear arrays

    Get PDF
    AbstractAchieving fault tolerance through incorporation of redundancy and reconfiguration is quite common. In this paper we study the fault tolerance of linear arrays of N processors with k bypass links whose maximum length is g. We consider both arrays with bidirectional links and unidirectional links.We first consider the problem of testing whether a set of n faulty processors is catastrophic, i.e., precludes reconfiguration. We provide new testing algorithms which improve and generalize known testing algorithms. For bidirectional arrays we provide an O(kn) time testing algorithm and for unidirectional arrays we provide an O(n) time algorithm for the case k = 1, and an O(kn log k) time algorithm, for the case k 1.When the fault pattern is not catastrophic we study the problem of finding an optimal reconfiguration of the array. We consider optimality with respect to two parameters: the size of the reconfigured array and the number of redundant links to activate. Considering optimality with respect to the size of the reconfigured array, we prove that the problem is NP-hard in the strong sense if the bypass links are bidirectional, while it can be solved in O(kng) time if the bypass links are unidirectional. Considering optimality with respect to the number of bypass links to activate, we prove that the problem can be solved in O(kn) time if the bypass links are bidirectional, and in O(kng) time if the bypass links are unidirectional

    Fair Leader Election for Rational Agents in Asynchronous Rings and Networks

    Full text link
    We study a game theoretic model where a coalition of processors might collude to bias the outcome of the protocol, where we assume that the processors always prefer any legitimate outcome over a non-legitimate one. We show that the problems of Fair Leader Election and Fair Coin Toss are equivalent, and focus on Fair Leader Election. Our main focus is on a directed asynchronous ring of nn processors, where we investigate the protocol proposed by Abraham et al. \cite{abraham2013distributed} and studied in Afek et al. \cite{afek2014distributed}. We show that in general the protocol is resilient only to sub-linear size coalitions. Specifically, we show that Ω(nlogn)\Omega(\sqrt{n\log n}) randomly located processors or Ω(n3)\Omega(\sqrt[3]{n}) adversarially located processors can force any outcome. We complement this by showing that the protocol is resilient to any adversarial coalition of size O(n4)O(\sqrt[4]{n}). We propose a modification to the protocol, and show that it is resilient to every coalition of size Θ(n)\Theta(\sqrt{n}), by exhibiting both an attack and a resilience result. For every k1k \geq 1, we define a family of graphs Gk{\mathcal{G}}_{k} that can be simulated by trees where each node in the tree simulates at most kk processors. We show that for every graph in Gk{\mathcal{G}}_{k}, there is no fair leader election protocol that is resilient to coalitions of size kk. Our result generalizes a previous result of Abraham et al. \cite{abraham2013distributed} that states that for every graph, there is no fair leader election protocol which is resilient to coalitions of size n2\lceil \frac{n}{2} \rceil.Comment: 48 pages, PODC 201

    Resiliency in numerical algorithm design for extreme scale simulations

    Get PDF
    This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.Peer Reviewed"Article signat per 36 autors/es: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik G ̈oddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortiz, Francesco Rizzi, Ulrich Rude, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thonnes, Andreas Wagner and Barbara Wohlmuth"Postprint (author's final draft

    Achieving consensus in fault-tolerant distributed computer systems : protocols, lower bounds, and simulations

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1987.Vita.Bibliography: p. 145-149.by Brian A. Coan.Ph.D
    corecore