202 research outputs found
How to Elect a Leader Faster than a Tournament
The problem of electing a leader from among contenders is one of the
fundamental questions in distributed computing. In its simplest formulation,
the task is as follows: given processors, all participants must eventually
return a win or lose indication, such that a single contender may win. Despite
a considerable amount of work on leader election, the following question is
still open: can we elect a leader in an asynchronous fault-prone system faster
than just running a -time tournament, against a strong adaptive
adversary?
In this paper, we answer this question in the affirmative, improving on a
decades-old upper bound. We introduce two new algorithmic ideas to reduce the
time complexity of electing a leader to , using
point-to-point messages. A non-trivial application of our algorithm is a new
upper bound for the tight renaming problem, assigning items to the
participants in expected time and messages. We
complement our results with lower bound of messages for solving
these two problems, closing the question of their message complexity
Однородные сети с распределенной системой реконфигураций
Запропоновано метод та алгоритм реконфігурації розрядномодульних однорідних мереж
(РМОМ) з розподіленою реконфігурацією резервних та функціонуючих модулів. Запропонований підхід може застосовуватися до РМОМ різної розмірності та призначення, в яких несправний модуль виявляється вбудованими засобами діагностування, а реконфігурація здійснюється
під управлінням HOST процесора.This paper presents an effective reconfiguration method and algorithm of reconfiguring one and twodimensional
degradable arrays with four – port switches, when processing elements of arrays become
faulty
The Parallel Persistent Memory Model
We consider a parallel computational model that consists of processors,
each with a fast local ephemeral memory of limited size, and sharing a large
persistent memory. The model allows for each processor to fault with bounded
probability, and possibly restart. On faulting all processor state and local
ephemeral memory are lost, but the persistent memory remains. This model is
motivated by upcoming non-volatile memories that are as fast as existing random
access memory, are accessible at the granularity of cache lines, and have the
capability of surviving power outages. It is further motivated by the
observation that in large parallel systems, failure of processors and their
caches is not unusual.
Within the model we develop a framework for developing locality efficient
parallel algorithms that are resilient to failures. There are several
challenges, including the need to recover from failures, the desire to do this
in an asynchronous setting (i.e., not blocking other processors when one
fails), and the need for synchronization primitives that are robust to
failures. We describe approaches to solve these challenges based on breaking
computations into what we call capsules, which have certain properties, and
developing a work-stealing scheduler that functions properly within the context
of failures. The scheduler guarantees a time bound of in expectation, where and are the work and
depth of the computation (in the absence of failures), is the average
number of processors available during the computation, and is the
probability that a capsule fails. Within the model and using the proposed
methods, we develop efficient algorithms for parallel sorting and other
primitives.Comment: This paper is the full version of a paper at SPAA 2018 with the same
nam
Local majorities, coalitions and monopolies in graphs: a review
AbstractThis paper provides an overview of recent developments concerning the process of local majority voting in graphs, and its basic properties, from graph theoretic and algorithmic standpoints
Testing and reconfiguration of VLSI linear arrays
AbstractAchieving fault tolerance through incorporation of redundancy and reconfiguration is quite common. In this paper we study the fault tolerance of linear arrays of N processors with k bypass links whose maximum length is g. We consider both arrays with bidirectional links and unidirectional links.We first consider the problem of testing whether a set of n faulty processors is catastrophic, i.e., precludes reconfiguration. We provide new testing algorithms which improve and generalize known testing algorithms. For bidirectional arrays we provide an O(kn) time testing algorithm and for unidirectional arrays we provide an O(n) time algorithm for the case k = 1, and an O(kn log k) time algorithm, for the case k 1.When the fault pattern is not catastrophic we study the problem of finding an optimal reconfiguration of the array. We consider optimality with respect to two parameters: the size of the reconfigured array and the number of redundant links to activate. Considering optimality with respect to the size of the reconfigured array, we prove that the problem is NP-hard in the strong sense if the bypass links are bidirectional, while it can be solved in O(kng) time if the bypass links are unidirectional. Considering optimality with respect to the number of bypass links to activate, we prove that the problem can be solved in O(kn) time if the bypass links are bidirectional, and in O(kng) time if the bypass links are unidirectional
Fair Leader Election for Rational Agents in Asynchronous Rings and Networks
We study a game theoretic model where a coalition of processors might collude
to bias the outcome of the protocol, where we assume that the processors always
prefer any legitimate outcome over a non-legitimate one. We show that the
problems of Fair Leader Election and Fair Coin Toss are equivalent, and focus
on Fair Leader Election.
Our main focus is on a directed asynchronous ring of processors, where we
investigate the protocol proposed by Abraham et al.
\cite{abraham2013distributed} and studied in Afek et al.
\cite{afek2014distributed}. We show that in general the protocol is resilient
only to sub-linear size coalitions. Specifically, we show that
randomly located processors or
adversarially located processors can force any outcome. We complement this by
showing that the protocol is resilient to any adversarial coalition of size
.
We propose a modification to the protocol, and show that it is resilient to
every coalition of size , by exhibiting both an attack and a
resilience result. For every , we define a family of graphs
that can be simulated by trees where each node in the tree
simulates at most processors. We show that for every graph in
, there is no fair leader election protocol that is
resilient to coalitions of size . Our result generalizes a previous result
of Abraham et al. \cite{abraham2013distributed} that states that for every
graph, there is no fair leader election protocol which is resilient to
coalitions of size .Comment: 48 pages, PODC 201
Resiliency in numerical algorithm design for extreme scale simulations
This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.Peer Reviewed"Article signat per 36 autors/es: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik G ̈oddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortiz,
Francesco Rizzi, Ulrich Rude, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thonnes, Andreas Wagner and Barbara Wohlmuth"Postprint (author's final draft
Achieving consensus in fault-tolerant distributed computer systems : protocols, lower bounds, and simulations
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1987.Vita.Bibliography: p. 145-149.by Brian A. Coan.Ph.D
- …