8 research outputs found

    Distributed Variance Reduction with Optimal Communication

    Full text link
    We consider the problem of distributed variance reduction: nn machines each receive probabilistic estimates of an unknown true vector Δ\Delta, and must cooperate to find a common estimate of Δ\Delta with lower variance, while minimizing communication. Variance reduction is closely related to the well-studied problem of distributed mean estimation, and is a key procedure in instances of distributed optimization, such as data-parallel stochastic gradient descent. Previous work typically assumes an upper bound on the norm of the input vectors, and achieves an output variance bound in terms of this norm. However, in real applications, the input vectors can be concentrated around the true vector Δ\Delta, but Δ\Delta itself may have large norm. In this case, output variance bounds in terms of input norm perform poorly, and may even increase variance. In this paper, we show that output variance need not depend on input norm. We provide a method of quantization which allows variance reduction to be performed with solution quality dependent only on input variance, not on input norm, and show an analogous result for mean estimation. This method is effective over a wide range of communication regimes, from sublinear to superlinear in the dimension. We also provide lower bounds showing that in many cases the communication to output variance trade-off is asymptotically optimal. Further, we show experimentally that our method yields improvements for common optimization tasks, when compared to prior approaches to distributed mean estimation.Comment: 28 pages, 14 figure

    Flare: Flexible In-Network Allreduce

    Full text link
    The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities and might provide suboptimal performance when dealing with custom operators and data types, with sparse data, or when reproducibility of the aggregation is a concern. To deal with these problems, in this work we design a flexible programmable switch by using as a building block PsPIN, a RISC-V architecture implementing the sPIN programming model. We then design, model, and analyze different algorithms for executing the aggregation on this architecture, showing performance improvements compared to state-of-the-art approaches

    QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

    Full text link
    Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference costs to allow for efficient local computation. Yet, the vast majority of existing work focuses on weight-only quantization, which can reduce runtime costs in the memory-bound one-token-at-a-time generative setting, but does not address them in compute-bound scenarios, such as batched inference or prompt processing. In this paper, we address the general quantization problem, where both weights and activations should be quantized. We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit, while keeping some outlier weights and activations in higher-precision. The key feature of our scheme is that it is designed with computational efficiency in mind: we provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x relative to FP16 execution. We provide detailed studies for models from the OPT, LLaMA-2 and Falcon families, as well as a first instance of accurate inference using quantization plus 2:4 sparsity. Code is available at: https://github.com/IST-DASLab/QUIK.Comment: 16 page

    Flare: Flexible in-network allreduce

    No full text
    The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities and might provide suboptimal performance when dealing with custom operators and data types, with sparse data, or when reproducibility of the aggregation is a concern. To deal with these problems, in this work we design a flexible programmable switch by using as a building block PsPIN, a RISC-V architecture implementing the sPIN programming model. We then design, model, and analyze different algorithms for executing the aggregation on this architecture, showing performance improvements compared to state-of-The-Art approaches

    The spatial computer: A model for energy-efficient parallel computation

    Full text link
    We present a new parallel model of computation suitable for spatial architectures, for which the energy used for communication heavily depends on the distance of the communicating processors. In our model, processors have locations on a conceptual two-dimensional grid, and their distance therein determines their communication cost. In particular, we introduce the energy cost of a spatial computation, which measures the total distance traveled by all messages, and study the depth of communication, which measures the largest number of hops of a chain of messages. We show matching energy lower and upper bounds for many foundational problems, including sorting, median selection, and matrix multiplication. Our model does not depend on any parameters other than the input shape and size, simplifying algorithm analysis. We also show how to simulate PRAM algorithms in our model and how to obtain results for a more complex model that introduces the size of the local memories of the processors as a parameter

    Motif Prediction with Graph Neural Networks

    Full text link
    Link prediction is one of the central problems in graph mining. However, recent studies highlight the importance of higher-order network analysis, where complex structures called motifs are the first-class citizens. We first show that existing link prediction schemes fail to effectively predict motifs. To alleviate this, we establish a general motif prediction problem and we propose several heuristics that assess the chances for a specified motif to appear. To make the scores realistic, our heuristics consider - among others - correlations between links, i.e., the potential impact of some arriving links on the appearance of other links in a given motif. Finally, for highest accuracy, we develop a graph neural network (GNN) architecture for motif prediction. Our architecture offers vertex features and sampling schemes that capture the rich structural properties of motifs. While our heuristics are fast and do not need any training, GNNs ensure highest accuracy of predicting motifs, both for dense (e.g., k-cliques) and for sparse ones (e.g., k-stars). We consistently outperform the best available competitor by more than 10% on average and up to 32% in area under the curve. Importantly, the advantages of our approach over schemes based on uncorrelated link prediction increase with the increasing motif size and complexity. We also successfully apply our architecture for predicting more arbitrary clusters and communities, illustrating its potential for graph mining beyond motif analysis

    ProbGraph: High-Performance and High-Accuracy Graph Mining with Probabilistic Set Representations

    No full text
    Important graph mining problems such as Clustering are computationally demanding. To significantly accelerate these problems, we propose ProbGraph: a graph representation that enables simple and fast approximate parallel graph mining with strong theoretical guarantees on work, depth, and result accuracy. The key idea is to represent sets of vertices using probabilistic set representations such as Bloom filters. These representations are much faster to process than the original vertex sets thanks to vectorizability and small size. We use these representations as building blocks in important parallel graph mining algorithms such as Clique Counting or Clustering. When enhanced with ProbGraph, these algorithms significantly outperform tuned parallel exact baselines (up to nearly 50x on 32 cores) while ensuring accuracy of more than 90% for many input graph datasets. Our novel bounds and algorithms based on probabilistic set representations with desirable statistical properties are of separate interest for the data analytics community
    corecore