38 research outputs found

    On the Expressive Power of Self-Attention Matrices

    No full text
    Transformer networks are able to capture patterns in data coming from many domains (text, images, videos, proteins, etc.) with little or no change to architecture components. We perform a theoretical analysis of the core component responsible for signal propagation between elements, i.e. the self-attention matrix. In practice, this matrix typically exhibits two properties: (1) it is sparse, meaning that each token only attends to a small subset of other tokens; and (2) it changes dynamically depending on the input to the module. With these considerations in mind, we ask the following question: Can a fixed self-attention module approximate arbitrary sparse patterns depending on the input? How small is the hidden size dd required for such approximation? We make progress in answering this question and show that the self-attention matrix can provably approximate sparse matrices, where sparsity is in terms of a bounded number of nonzero elements in each row and column. While the parameters of self-attention are fixed, various sparse matrices can be approximated by only modifying the inputs. Our proof is based on the random projection technique and uses the seminal Johnson-Lindenstrauss lemma. Our proof is constructive, enabling us to propose an algorithm for finding adaptive inputs and fixed self-attention parameters in order to approximate a given matrix. In particular, we show that, in order to approximate any sparse matrix up to a given precision defined in terms of preserving matrix element ratios, dd grows only logarithmically with the sequence length LL (i.e. d=O(logL)d = O(\log L))

    The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings

    No full text
    We examine a class of embeddings based on structured random matrices with orthogonal rows which can be applied in many machine learning applications including dimensionality reduction and kernel approximation. For both the Johnson-Lindenstrauss transform and the angular kernel, we show that we can select matrices yielding guaranteed improved performance in accuracy and/or speed compared to earlier methods. We introduce matrices with complex entries which give significant further accuracy improvement. We provide geometric and Markov chain-based perspectives to help understand the benefits, and empirical results which suggest that the approach is helpful in a wider range of applications

    CWY Parametrization: a Solution for Parallelized Optimization of Orthogonal and Stiefel Matrices

    No full text
    We introduce an efficient approach for optimization over orthogonal groups on highly parallel computation units such as GPUs or TPUs. As in earlier work, we parametrize an orthogonal matrix as a product of Householder reflections. However, to overcome low parallelization capabilities of computing Householder reflections sequentially, we propose employing an accumulation scheme called the compact WY (or CWY) transform -- a compact parallelization-friendly matrix representation for the series of Householder reflections. We further develop a novel Truncated CWY (or T-CWY) approach for Stiefel manifold parametrization which has a competitive complexity and, again, yields benefits when computed on GPUs and TPUs. We prove that our CWY and T-CWY methods lead to convergence to a stationary point of the training objective when coupled with stochastic gradient descent. We apply our methods to train recurrent neural network architectures in the tasks of neural machine translation and video prediction

    Excluding Hooks and their Complements

    Get PDF
    The long-standing Erdős-Hajnal conjecture states that for every n-vertex undirected graph H there exists ϵ(H)>0 such that every graph G that does not contain H as an induced subgraph contains a clique or an independent set of size at least nϵ(H). A natural weakening of the conjecture states that the polynomial-size clique/independent set phenomenon occurs if one excludes both H and its complement Hc. These conjectures have been shown to hold for only a handful of graphs: it is not even known if they hold for all graphs on 5vertices.In a recent breakthrough, the symmetrized version of the Erdős-Hajnal conjecture was shown to hold for all paths. The goal of this paper is to show that the symmetrized conjecture holds for all trees on 6(or fewer) vertices. In fact this is a consequence of showing that the symmetrized conjecture holds for any path with a pendant edge at its third vertex; thus we also give a new infinite family of graphs for which the symmetrized conjecture holds

    UFO-BLO: Unbiased First-Order Bilevel Optimization

    No full text
    Bilevel optimization (BLO) is a popular approach with many applications including hyperparameter optimization, neural architecture search, adversarial robustness and model-agnostic meta-learning. However, the approach suffers from time and memory complexity proportional to the length rr of its inner optimization loop, which has led to several modifications being proposed. One such modification is \textit{first-order} BLO (FO-BLO) which approximates outer-level gradients by zeroing out second derivative terms, yielding significant speed gains and requiring only constant memory as rr varies. Despite FO-BLO's popularity, there is a lack of theoretical understanding of its convergence properties. We make progress by demonstrating a rich family of examples where FO-BLO-based stochastic optimization does not converge to a stationary point of the BLO objective. We address this concern by proposing a new FO-BLO-based unbiased estimate of outer-level gradients, enabling us to theoretically guarantee this convergence, with no harm to memory and expected time complexity. Our findings are supported by experimental results on Omniglot and Mini-ImageNet, popular few-shot meta-learning benchmarks

    Sub-Linear Memory: How to Make Performers SLiM

    No full text
    The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous in state-of-the-art solutions for a wide variety of applications. Yet vanilla Transformers are notoriously resource-expensive, requiring O(L2)O(L^2) in serial time and memory as functions of input length LL. Recent works proposed various linear self-attention mechanisms, scaling only as O(L)O(L) for serial computation. We perform a thorough analysis of recent Transformer mechanisms with linear self-attention, Performers, in terms of overall computational complexity. We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sublinear memory as a function of LL (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only O(1)O(1) memory during training, and still requires O(L)O(L) time. This discovered time-memory tradeoff can be used for training or, due to complete backward-compatibility, for fine-tuning on a low-memory device, e.g. a smartphone or an earlier-generation GPU, thus contributing towards decentralized and democratized deep learning

    Debiasing a First-order Heuristic for Approximate Bi-level Optimization

    No full text
    Approximate bi-level optimization (ABLO) consists of (outer-level) optimization problems, involving numerical (inner-level) optimization loops. While ABLO has many applications across deep learning, it suffers from time and memory complexity proportional to the length rr of its inner optimization loop. To address this complexity, an earlier first-order method (FOM) was proposed as a heuristic that omits second derivative terms, yielding significant speed gains and requiring only constant memory. Despite FOM's popularity, there is a lack of theoretical understanding of its convergence properties. We contribute by theoretically characterizing FOM's gradient bias under mild assumptions. We further demonstrate a rich family of examples where FOM-based SGD does not converge to a stationary point of the ABLO objective. We address this concern by proposing an unbiased FOM (UFOM) enjoying constant memory complexity as a function of rr. We characterize the introduced time-variance tradeoff, demonstrate convergence bounds, and find an optimal UFOM for a given ABLO problem. Finally, we propose an efficient adaptive UFOM scheme

    Geometrically coupled monte carlo sampling

    No full text
    © 2018 Curran Associates Inc..All rights reserved. Monte Carlo sampling in high-dimensional, low-sample settings is important in many machine learning tasks. We improve current methods for sampling in Euclidean spaces by avoiding independence, and instead consider ways to couple samples. We show fundamental connections to optimal transport theory, leading to novel sampling algorithms, and providing new theoretical grounding for existing strategies. We compare our new strategies against prior methods for improving sample efficiency, including quasi-Monte Carlo, by studying discrepancy. We explore our findings empirically, and observe benefits of our sampling schemes for reinforcement learning and generative modelling

    Orthogonal Estimation of Wasserstein Distances

    No full text
    Wasserstein distances are increasingly used in a wide variety of applications in machine learning. Sliced Wasserstein distances form an important subclass which may be estimated efficiently through one-dimensional sorting operations. In this paper, we propose a new variant of sliced Wasserstein distance, study the use of orthogonal coupling in Monte Carlo estimation of Wasserstein distances and draw connections with stratified sampling, and evaluate our approaches experimentally in a range of large-scale experiments in generative modelling and reinforcement learning
    corecore