218 research outputs found
Space-Optimal Majority in Population Protocols
Population protocols are a model of distributed computing, in which
agents with limited local state interact randomly, and cooperate to
collectively compute global predicates. An extensive series of papers, across
different communities, has examined the computability and complexity
characteristics of this model. Majority, or consensus, is a central task, in
which agents need to collectively reach a decision as to which one of two
states or had a higher initial count. Two complexity metrics are
important: the time that a protocol requires to stabilize to an output
decision, and the state space size that each agent requires.
It is known that majority requires states per agent to
allow for poly-logarithmic time stabilization, and that states
are sufficient. Thus, there is an exponential gap between the upper and lower
bounds.
We address this question. We provide a new lower bound of
states for any protocol which stabilizes in time, for any constant. This result is conditional on basic monotonicity and output
assumptions, satisfied by all known protocols. Technically, it represents a
significant departure from previous lower bounds. Instead of relying on dense
configurations, we introduce a new surgery technique to construct executions
which contradict the correctness of algorithms that stabilize too fast.
Subsequently, our lower bound applies to general initial configurations.
We give an algorithm for majority which uses states, and
stabilizes in time. Central to the algorithm is a new leaderless
phase clock, which allows nodes to synchronize in phases of
consecutive interactions using states per node. We also employ our
phase clock to build a leader election algorithm with states,
which stabilizes in time
Are Lock-Free Concurrent Algorithms Practically Wait-Free?
Lock-free concurrent algorithms guarantee that some concurrent operation will
always make progress in a finite number of steps. Yet programmers prefer to
treat concurrent code as if it were wait-free, guaranteeing that all operations
always make progress. Unfortunately, designing wait-free algorithms is
generally a very complex task, and the resulting algorithms are not always
efficient. While obtaining efficient wait-free algorithms has been a long-time
goal for the theory community, most non-blocking commercial code is only
lock-free.
This paper suggests a simple solution to this problem. We show that, for a
large class of lock- free algorithms, under scheduling conditions which
approximate those found in commercial hardware architectures, lock-free
algorithms behave as if they are wait-free. In other words, programmers can
keep on designing simple lock-free algorithms instead of complex wait-free
ones, and in practice, they will get wait-free progress.
Our main contribution is a new way of analyzing a general class of lock-free
algorithms under a stochastic scheduler. Our analysis relates the individual
performance of processes with the global performance of the system using Markov
chain lifting between a complex per-process chain and a simpler system progress
chain. We show that lock-free algorithms are not only wait-free with
probability 1, but that in fact a general subset of lock-free algorithms can be
closely bounded in terms of the average number of steps required until an
operation completes.
To the best of our knowledge, this is the first attempt to analyze progress
conditions, typically stated in relation to a worst case adversary, in a
stochastic model capturing their expected asymptotic behavior.Comment: 25 page
Asynchronous Optimization Methods for Efficient Training of Deep Neural Networks with Guarantees
Asynchronous distributed algorithms are a popular way to reduce
synchronization costs in large-scale optimization, and in particular for neural
network training. However, for nonsmooth and nonconvex objectives, few
convergence guarantees exist beyond cases where closed-form proximal operator
solutions are available. As most popular contemporary deep neural networks lead
to nonsmooth and nonconvex objectives, there is now a pressing need for such
convergence guarantees. In this paper, we analyze for the first time the
convergence of stochastic asynchronous optimization for this general class of
objectives. In particular, we focus on stochastic subgradient methods allowing
for block variable partitioning, where the shared-memory-based model is
asynchronously updated by concurrent processes. To this end, we first introduce
a probabilistic model which captures key features of real asynchronous
scheduling between concurrent processes; under this model, we establish
convergence with probability one to an invariant set for stochastic subgradient
methods with momentum.
From the practical perspective, one issue with the family of methods we
consider is that it is not efficiently supported by machine learning
frameworks, as they mostly focus on distributed data-parallel strategies. To
address this, we propose a new implementation strategy for shared-memory based
training of deep neural networks, whereby concurrent parameter servers are
utilized to train a partitioned but shared model in single- and multi-GPU
settings. Based on this implementation, we achieve on average 1.2x speed-up in
comparison to state-of-the-art training methods for popular image
classification tasks without compromising accuracy
Relaxed Schedulers Can Efficiently Parallelize Iterative Algorithms
There has been significant progress in understanding the parallelism inherent
to iterative sequential algorithms: for many classic algorithms, the depth of
the dependence structure is now well understood, and scheduling techniques have
been developed to exploit this shallow dependence structure for efficient
parallel implementations. A related, applied research strand has studied
methods by which certain iterative task-based algorithms can be efficiently
parallelized via relaxed concurrent priority schedulers. These allow for high
concurrency when inserting and removing tasks, at the cost of executing
superfluous work due to the relaxed semantics of the scheduler.
In this work, we take a step towards unifying these two research directions,
by showing that there exists a family of relaxed priority schedulers that can
efficiently and deterministically execute classic iterative algorithms such as
greedy maximal independent set (MIS) and matching. Our primary result shows
that, given a randomized scheduler with an expected relaxation factor of in
terms of the maximum allowed priority inversions on a task, and any graph on
vertices, the scheduler is able to execute greedy MIS with only an additive
factor of poly() expected additional iterations compared to an exact (but
not scalable) scheduler. This counter-intuitive result demonstrates that the
overhead of relaxation when computing MIS is not dependent on the input size or
structure of the input graph. Experimental results show that this overhead can
be clearly offset by the gain in performance due to the highly scalable
scheduler. In sum, we present an efficient method to deterministically
parallelize iterative sequential algorithms, with provable runtime guarantees
in terms of the number of executed tasks to completion.Comment: PODC 2018, pages 377-386 in proceeding
The Power of Choice in Priority Scheduling
Consider the following random process: we are given queues, into which
elements of increasing labels are inserted uniformly at random. To remove an
element, we pick two queues at random, and remove the element of lower label
(higher priority) among the two. The cost of a removal is the rank of the label
removed, among labels still present in any of the queues, that is, the distance
from the optimal choice at each step. Variants of this strategy are prevalent
in state-of-the-art concurrent priority queue implementations. Nonetheless, it
is not known whether such implementations provide any rank guarantees, even in
a sequential model.
We answer this question, showing that this strategy provides surprisingly
strong guarantees: Although the single-choice process, where we always insert
and remove from a single randomly chosen queue, has degrading cost, going to
infinity as we increase the number of steps, in the two choice process, the
expected rank of a removed element is while the expected worst-case
cost is . These bounds are tight, and hold irrespective of the
number of steps for which we run the process.
The argument is based on a new technical connection between "heavily loaded"
balls-into-bins processes and priority scheduling.
Our analytic results inspire a new concurrent priority queue implementation,
which improves upon the state of the art in terms of practical performance
GMP*: Well-Tuned Gradual Magnitude Pruning Can Outperform Most BERT-Pruning Methods
We revisit the performance of the classic gradual magnitude pruning (GMP)
baseline for large language models, focusing on the classic BERT benchmark on
various popular tasks. Despite existing evidence in the literature that GMP
performs poorly, we show that a simple and general variant, which we call GMP*,
can match and sometimes outperform more complex state-of-the-art methods. Our
results provide a simple yet strong baseline for future work, highlight the
importance of parameter tuning for baselines, and even improve the performance
of the state-of-the-art second-order pruning method in this setting
Why Extension-Based Proofs Fail
We introduce extension-based proofs, a class of impossibility proofs that
includes valency arguments. They are modelled as an interaction between a
prover and a protocol. Using proofs based on combinatorial topology, it has
been shown that it is impossible to deterministically solve k-set agreement
among n > k > 1 processes in a wait-free manner in certain asynchronous models.
However, it was unknown whether proofs based on simpler techniques were
possible. We show that this impossibility result cannot be obtained for one of
these models by an extension-based proof and, hence, extension-based proofs are
limited in power.Comment: This version of the paper is for the NIS model. Previous versions of
the paper are for the NIIS mode
The Transactional Conflict Problem
The transactional conflict problem arises in transactional systems whenever
two or more concurrent transactions clash on a data item.
While the standard solution to such conflicts is to immediately abort one of
the transactions, some practical systems consider the alternative of delaying
conflict resolution for a short interval, which may allow one of the
transactions to commit. The challenge in the transactional conflict problem is
to choose the optimal length of this delay interval so as to minimize the
overall running time penalty for the conflicting transactions. In this paper,
we propose a family of optimal online algorithms for the transactional conflict
problem.
Specifically, we consider variants of this problem which arise in different
implementations of transactional systems, namely "requestor wins" and
"requestor aborts" implementations: in the former, the recipient of a coherence
request is aborted, whereas in the latter, it is the requestor which has to
abort. Both strategies are implemented by real systems.
We show that the requestor aborts case can be reduced to a classic instance
of the ski rental problem, while the requestor wins case leads to a new version
of this classical problem, for which we derive optimal deterministic and
randomized algorithms.
Moreover, we prove that, under a simplified adversarial model, our algorithms
are constant-competitive with the offline optimum in terms of throughput.
We validate our algorithmic results empirically through a hardware simulation
of hardware transactional memory (HTM), showing that our algorithms can lead to
non-trivial performance improvements for classic concurrent data structures
- …