8 research outputs found
Distributed Variance Reduction with Optimal Communication
We consider the problem of distributed variance reduction: machines each
receive probabilistic estimates of an unknown true vector , and must
cooperate to find a common estimate of with lower variance, while
minimizing communication.
Variance reduction is closely related to the well-studied problem of
distributed mean estimation, and is a key procedure in instances of distributed
optimization, such as data-parallel stochastic gradient descent. Previous work
typically assumes an upper bound on the norm of the input vectors, and achieves
an output variance bound in terms of this norm. However, in real applications,
the input vectors can be concentrated around the true vector , but
itself may have large norm. In this case, output variance bounds in
terms of input norm perform poorly, and may even increase variance.
In this paper, we show that output variance need not depend on input norm. We
provide a method of quantization which allows variance reduction to be
performed with solution quality dependent only on input variance, not on input
norm, and show an analogous result for mean estimation. This method is
effective over a wide range of communication regimes, from sublinear to
superlinear in the dimension. We also provide lower bounds showing that in many
cases the communication to output variance trade-off is asymptotically optimal.
Further, we show experimentally that our method yields improvements for common
optimization tasks, when compared to prior approaches to distributed mean
estimation.Comment: 28 pages, 14 figure
Flare: Flexible In-Network Allreduce
The allreduce operation is one of the most commonly used communication
routines in distributed applications. To improve its bandwidth and to reduce
network traffic, this operation can be accelerated by offloading it to network
switches, that aggregate the data received from the hosts, and send them back
the aggregated result. However, existing solutions provide limited
customization opportunities and might provide suboptimal performance when
dealing with custom operators and data types, with sparse data, or when
reproducibility of the aggregation is a concern. To deal with these problems,
in this work we design a flexible programmable switch by using as a building
block PsPIN, a RISC-V architecture implementing the sPIN programming model. We
then design, model, and analyze different algorithms for executing the
aggregation on this architecture, showing performance improvements compared to
state-of-the-art approaches
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models
Large Language Models (LLMs) from the GPT family have become extremely
popular, leading to a race towards reducing their inference costs to allow for
efficient local computation. Yet, the vast majority of existing work focuses on
weight-only quantization, which can reduce runtime costs in the memory-bound
one-token-at-a-time generative setting, but does not address them in
compute-bound scenarios, such as batched inference or prompt processing. In
this paper, we address the general quantization problem, where both weights and
activations should be quantized. We show, for the first time, that the majority
of inference computations for large generative models such as LLaMA, OPT, and
Falcon can be performed with both weights and activations being cast to 4 bits,
in a way that leads to practical speedups, while at the same time maintaining
good accuracy. We achieve this via a hybrid quantization strategy called QUIK,
which compresses most of the weights and activations to 4-bit, while keeping
some outlier weights and activations in higher-precision. The key feature of
our scheme is that it is designed with computational efficiency in mind: we
provide GPU kernels matching the QUIK format with highly-efficient layer-wise
runtimes, which lead to practical end-to-end throughput improvements of up to
3.4x relative to FP16 execution. We provide detailed studies for models from
the OPT, LLaMA-2 and Falcon families, as well as a first instance of accurate
inference using quantization plus 2:4 sparsity. Code is available at:
https://github.com/IST-DASLab/QUIK.Comment: 16 page
The spatial computer: A model for energy-efficient parallel computation
We present a new parallel model of computation suitable for spatial
architectures, for which the energy used for communication heavily depends on
the distance of the communicating processors. In our model, processors have
locations on a conceptual two-dimensional grid, and their distance therein
determines their communication cost. In particular, we introduce the energy
cost of a spatial computation, which measures the total distance traveled by
all messages, and study the depth of communication, which measures the largest
number of hops of a chain of messages. We show matching energy lower and upper
bounds for many foundational problems, including sorting, median selection, and
matrix multiplication. Our model does not depend on any parameters other than
the input shape and size, simplifying algorithm analysis. We also show how to
simulate PRAM algorithms in our model and how to obtain results for a more
complex model that introduces the size of the local memories of the processors
as a parameter
Flare: Flexible in-network allreduce
The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities and might provide suboptimal performance when dealing with custom operators and data types, with sparse data, or when reproducibility of the aggregation is a concern. To deal with these problems, in this work we design a flexible programmable switch by using as a building block PsPIN, a RISC-V architecture implementing the sPIN programming model. We then design, model, and analyze different algorithms for executing the aggregation on this architecture, showing performance improvements compared to state-of-The-Art approaches
Motif Prediction with Graph Neural Networks
Link prediction is one of the central problems in graph mining. However,
recent studies highlight the importance of higher-order network analysis, where
complex structures called motifs are the first-class citizens. We first show
that existing link prediction schemes fail to effectively predict motifs. To
alleviate this, we establish a general motif prediction problem and we propose
several heuristics that assess the chances for a specified motif to appear. To
make the scores realistic, our heuristics consider - among others -
correlations between links, i.e., the potential impact of some arriving links
on the appearance of other links in a given motif. Finally, for highest
accuracy, we develop a graph neural network (GNN) architecture for motif
prediction. Our architecture offers vertex features and sampling schemes that
capture the rich structural properties of motifs. While our heuristics are fast
and do not need any training, GNNs ensure highest accuracy of predicting
motifs, both for dense (e.g., k-cliques) and for sparse ones (e.g., k-stars).
We consistently outperform the best available competitor by more than 10% on
average and up to 32% in area under the curve. Importantly, the advantages of
our approach over schemes based on uncorrelated link prediction increase with
the increasing motif size and complexity. We also successfully apply our
architecture for predicting more arbitrary clusters and communities,
illustrating its potential for graph mining beyond motif analysis
ProbGraph: High-Performance and High-Accuracy Graph Mining with Probabilistic Set Representations
Important graph mining problems such as Clustering are computationally demanding. To significantly accelerate these problems, we propose ProbGraph: a graph representation that enables simple and fast approximate parallel graph mining with strong theoretical guarantees on work, depth, and result accuracy. The key idea is to represent sets of vertices using probabilistic set representations such as Bloom filters. These representations are much faster to process than the original vertex sets thanks to vectorizability and small size. We use these representations as building blocks in important parallel graph mining algorithms such as Clique Counting or Clustering. When enhanced with ProbGraph, these algorithms significantly outperform tuned parallel exact baselines (up to nearly 50x on 32 cores) while ensuring accuracy of more than 90% for many input graph datasets. Our novel bounds and algorithms based on probabilistic set representations with desirable statistical properties are of separate interest for the data analytics community