170 research outputs found

### Space-Optimal Majority in Population Protocols

Population protocols are a model of distributed computing, in which $n$
agents with limited local state interact randomly, and cooperate to
collectively compute global predicates. An extensive series of papers, across
different communities, has examined the computability and complexity
characteristics of this model. Majority, or consensus, is a central task, in
which agents need to collectively reach a decision as to which one of two
states $A$ or $B$ had a higher initial count. Two complexity metrics are
important: the time that a protocol requires to stabilize to an output
decision, and the state space size that each agent requires.
It is known that majority requires $\Omega(\log \log n)$ states per agent to
allow for poly-logarithmic time stabilization, and that $O(\log^2 n)$ states
are sufficient. Thus, there is an exponential gap between the upper and lower
bounds.
We address this question. We provide a new lower bound of $\Omega(\log n)$
states for any protocol which stabilizes in $O( n^{1-c} )$ time, for any $c >
0$ constant. This result is conditional on basic monotonicity and output
assumptions, satisfied by all known protocols. Technically, it represents a
significant departure from previous lower bounds. Instead of relying on dense
configurations, we introduce a new surgery technique to construct executions
which contradict the correctness of algorithms that stabilize too fast.
Subsequently, our lower bound applies to general initial configurations.
We give an algorithm for majority which uses $O(\log n)$ states, and
stabilizes in $O(\log^2 n)$ time. Central to the algorithm is a new leaderless
phase clock, which allows nodes to synchronize in phases of $\Theta(n \log{n})$
consecutive interactions using $O(\log n)$ states per node. We also employ our
phase clock to build a leader election algorithm with $O(\log n )$ states,
which stabilizes in $O(\log^2 n)$ time

### Are Lock-Free Concurrent Algorithms Practically Wait-Free?

Lock-free concurrent algorithms guarantee that some concurrent operation will
always make progress in a finite number of steps. Yet programmers prefer to
treat concurrent code as if it were wait-free, guaranteeing that all operations
always make progress. Unfortunately, designing wait-free algorithms is
generally a very complex task, and the resulting algorithms are not always
efficient. While obtaining efficient wait-free algorithms has been a long-time
goal for the theory community, most non-blocking commercial code is only
lock-free.
This paper suggests a simple solution to this problem. We show that, for a
large class of lock- free algorithms, under scheduling conditions which
approximate those found in commercial hardware architectures, lock-free
algorithms behave as if they are wait-free. In other words, programmers can
keep on designing simple lock-free algorithms instead of complex wait-free
ones, and in practice, they will get wait-free progress.
Our main contribution is a new way of analyzing a general class of lock-free
algorithms under a stochastic scheduler. Our analysis relates the individual
performance of processes with the global performance of the system using Markov
chain lifting between a complex per-process chain and a simpler system progress
chain. We show that lock-free algorithms are not only wait-free with
probability 1, but that in fact a general subset of lock-free algorithms can be
closely bounded in terms of the average number of steps required until an
operation completes.
To the best of our knowledge, this is the first attempt to analyze progress
conditions, typically stated in relation to a worst case adversary, in a
stochastic model capturing their expected asymptotic behavior.Comment: 25 page

### GMP*: Well-Tuned Gradual Magnitude Pruning Can Outperform Most BERT-Pruning Methods

We revisit the performance of the classic gradual magnitude pruning (GMP)
baseline for large language models, focusing on the classic BERT benchmark on
various popular tasks. Despite existing evidence in the literature that GMP
performs poorly, we show that a simple and general variant, which we call GMP*,
can match and sometimes outperform more complex state-of-the-art methods. Our
results provide a simple yet strong baseline for future work, highlight the
importance of parameter tuning for baselines, and even improve the performance
of the state-of-the-art second-order pruning method in this setting

### The Power of Populations in Decentralized Learning Dynamics

We study a distributed multi-armed bandit setting among a population of $n$
memory-constrained nodes in the gossip model: at each round, every node locally
adopts one of $m$ arms, observes a reward drawn from the arm's (adversarially
chosen) distribution, and then communicates with a randomly sampled neighbor,
exchanging information to determine its policy in the next round. We introduce
and analyze several families of dynamics for this task that are decentralized:
each node's decision is entirely local and depends only on its most recently
obtained reward and that of the neighbor it sampled. We show a connection
between the global evolution of these decentralized dynamics with a certain
class of "zero-sum" multiplicative weights update algorithms, and we develop a
general framework for analyzing the population-level regret of these natural
protocols. Using this framework, we derive sublinear regret bounds under a wide
range of parameter regimes (i.e., the size of the population and number of
arms) for both the stationary reward setting (where the mean of each arm's
distribution is fixed over time) and the adversarial reward setting (where
means can vary over time). Further, we show that these protocols can
approximately optimize convex functions over the simplex when the reward
distributions are generated from a stochastic gradient oracle

### QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

Mixture-of-Experts (MoE) architectures offer a general solution to the high
inference costs of large language models (LLMs) via sparse routing, bringing
faster and more accurate models, at the cost of massive parameter counts. For
example, the SwitchTransformer-c2048 model has 1.6 trillion parameters,
requiring 3.2TB of accelerator memory to run efficiently, which makes practical
deployment challenging and expensive. In this paper, we present a solution to
this memory problem, in form of a new compression and execution framework
called QMoE. Specifically, QMoE consists of a scalable algorithm which
accurately compresses trillion-parameter MoEs to less than 1 bit per parameter,
in a custom format co-designed with bespoke GPU decoding kernels to facilitate
efficient end-to-end compressed inference, with minor runtime overheads
relative to uncompressed execution. Concretely, QMoE can compress the 1.6
trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x
compression, 0.8 bits per parameter) at only minor accuracy loss, in less than
a day on a single GPU. This enables, for the first time, the execution of a
trillion-parameter model on affordable commodity hardware, like a single server
with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead
relative to ideal uncompressed inference. The source code and compressed models
are available at github.com/IST-DASLab/qmoe

### Asynchronous Optimization Methods for Efficient Training of Deep Neural Networks with Guarantees

Asynchronous distributed algorithms are a popular way to reduce
synchronization costs in large-scale optimization, and in particular for neural
network training. However, for nonsmooth and nonconvex objectives, few
convergence guarantees exist beyond cases where closed-form proximal operator
solutions are available. As most popular contemporary deep neural networks lead
to nonsmooth and nonconvex objectives, there is now a pressing need for such
convergence guarantees. In this paper, we analyze for the first time the
convergence of stochastic asynchronous optimization for this general class of
objectives. In particular, we focus on stochastic subgradient methods allowing
for block variable partitioning, where the shared-memory-based model is
asynchronously updated by concurrent processes. To this end, we first introduce
a probabilistic model which captures key features of real asynchronous
scheduling between concurrent processes; under this model, we establish
convergence with probability one to an invariant set for stochastic subgradient
methods with momentum.
From the practical perspective, one issue with the family of methods we
consider is that it is not efficiently supported by machine learning
frameworks, as they mostly focus on distributed data-parallel strategies. To
address this, we propose a new implementation strategy for shared-memory based
training of deep neural networks, whereby concurrent parameter servers are
utilized to train a partitioned but shared model in single- and multi-GPU
settings. Based on this implementation, we achieve on average 1.2x speed-up in
comparison to state-of-the-art training methods for popular image
classification tasks without compromising accuracy

### Relaxed Schedulers Can Efficiently Parallelize Iterative Algorithms

There has been significant progress in understanding the parallelism inherent
to iterative sequential algorithms: for many classic algorithms, the depth of
the dependence structure is now well understood, and scheduling techniques have
been developed to exploit this shallow dependence structure for efficient
parallel implementations. A related, applied research strand has studied
methods by which certain iterative task-based algorithms can be efficiently
parallelized via relaxed concurrent priority schedulers. These allow for high
concurrency when inserting and removing tasks, at the cost of executing
superfluous work due to the relaxed semantics of the scheduler.
In this work, we take a step towards unifying these two research directions,
by showing that there exists a family of relaxed priority schedulers that can
efficiently and deterministically execute classic iterative algorithms such as
greedy maximal independent set (MIS) and matching. Our primary result shows
that, given a randomized scheduler with an expected relaxation factor of $k$ in
terms of the maximum allowed priority inversions on a task, and any graph on
$n$ vertices, the scheduler is able to execute greedy MIS with only an additive
factor of poly($k$) expected additional iterations compared to an exact (but
not scalable) scheduler. This counter-intuitive result demonstrates that the
overhead of relaxation when computing MIS is not dependent on the input size or
structure of the input graph. Experimental results show that this overhead can
be clearly offset by the gain in performance due to the highly scalable
scheduler. In sum, we present an efficient method to deterministically
parallelize iterative sequential algorithms, with provable runtime guarantees
in terms of the number of executed tasks to completion.Comment: PODC 2018, pages 377-386 in proceeding

### The Power of Choice in Priority Scheduling

Consider the following random process: we are given $n$ queues, into which
elements of increasing labels are inserted uniformly at random. To remove an
element, we pick two queues at random, and remove the element of lower label
(higher priority) among the two. The cost of a removal is the rank of the label
removed, among labels still present in any of the queues, that is, the distance
from the optimal choice at each step. Variants of this strategy are prevalent
in state-of-the-art concurrent priority queue implementations. Nonetheless, it
is not known whether such implementations provide any rank guarantees, even in
a sequential model.
We answer this question, showing that this strategy provides surprisingly
strong guarantees: Although the single-choice process, where we always insert
and remove from a single randomly chosen queue, has degrading cost, going to
infinity as we increase the number of steps, in the two choice process, the
expected rank of a removed element is $O( n )$ while the expected worst-case
cost is $O( n \log n )$. These bounds are tight, and hold irrespective of the
number of steps for which we run the process.
The argument is based on a new technical connection between "heavily loaded"
balls-into-bins processes and priority scheduling.
Our analytic results inspire a new concurrent priority queue implementation,
which improves upon the state of the art in terms of practical performance

### Why Extension-Based Proofs Fail

We introduce extension-based proofs, a class of impossibility proofs that
includes valency arguments. They are modelled as an interaction between a
prover and a protocol. Using proofs based on combinatorial topology, it has
been shown that it is impossible to deterministically solve k-set agreement
among n > k > 1 processes in a wait-free manner in certain asynchronous models.
However, it was unknown whether proofs based on simpler techniques were
possible. We show that this impossibility result cannot be obtained for one of
these models by an extension-based proof and, hence, extension-based proofs are
limited in power.Comment: This version of the paper is for the NIS model. Previous versions of
the paper are for the NIIS mode

- …