Search CORE

1,563 research outputs found

Learning Compressed Transforms with Low Displacement Rank

Author: Dao Tri
Gu Albert
Rudra Atri
Ré Christopher
Thomas Anna T.
Publication venue
Publication date: 01/01/2019
Field of study

The low displacement rank (LDR) framework for structured matrices represents a matrix through two displacement operators and a low-rank residual. Existing use of LDR matrices in deep learning has applied fixed displacement operators encoding forms of shift invariance akin to convolutions. We introduce a class of LDR matrices with more general displacement operators, and explicitly learn over both the operators and the low-rank component. This class generalizes several previous constructions while preserving compression and efficient computation. We prove bounds on the VC dimension of multi-layer neural networks with structured weight matrices and show empirically that our compact parameterization can reduce the sample complexity of learning. When replacing weight layers in fully-connected, convolutional, and recurrent neural networks for image classification and language modeling tasks, our new classes exceed the accuracy of existing compression approaches, and on some tasks also outperform general unstructured layers while using more than 20x fewer parameters.Comment: NeurIPS 2018. Code available at https://github.com/HazyResearch/structured-net

arXiv.org e-Print Archive

Efficient Inferencing of Compressed Deep Neural Networks

Author: Choudhury Anamitra R.
Goyal Saurabh
Sabharwal Yogish
Verma Ashish
Vooturi Dharma Teja
Publication venue
Publication date: 01/11/2017
Field of study

Large number of weights in deep neural networks makes the models difficult to be deployed in low memory environments such as, mobile phones, IOT edge devices as well as "inferencing as a service" environments on cloud. Prior work has considered reduction in the size of the models, through compression techniques like pruning, quantization, Huffman encoding etc. However, efficient inferencing using the compressed models has received little attention, specially with the Huffman encoding in place. In this paper, we propose efficient parallel algorithms for inferencing of single image and batches, under various memory constraints. Our experimental results show that our approach of using variable batch size for inferencing achieves 15-25\% performance improvement in the inference throughput for AlexNet, while maintaining memory and latency constraints

arXiv.org e-Print Archive

Towards Efficient Large-Scale Graph Neural Network Computing

Author: Dai Yafei
Ma Lingxiao
Miao Youshan
Wu Ming
Xue Jilong
Yang Zhi
Zhou Lidong
Publication venue
Publication date: 19/10/2018
Field of study

Recent deep learning models have moved beyond low-dimensional regular grids such as image, video, and speech, to high-dimensional graph-structured data, such as social networks, brain connections, and knowledge graphs. This evolution has led to large graph-based irregular and sparse models that go beyond what existing deep learning frameworks are designed for. Further, these models are not easily amenable to efficient, at scale, acceleration on parallel hardwares (e.g. GPUs). We introduce NGra, the first parallel processing framework for graph-based deep neural networks (GNNs). NGra presents a new SAGA-NN model for expressing deep neural networks as vertex programs with each layer in well-defined (Scatter, ApplyEdge, Gather, ApplyVertex) graph operation stages. This model not only allows GNNs to be expressed intuitively, but also facilitates the mapping to an efficient dataflow representation. NGra addresses the scalability challenge transparently through automatic graph partitioning and chunk-based stream processing out of GPU core or over multiple GPUs, which carefully considers data locality, data movement, and overlapping of parallel processing and data movement. NGra further achieves efficiency through highly optimized Scatter/Gather operators on GPUs despite its sparsity. Our evaluation shows that NGra scales to large real graphs that none of the existing frameworks can handle directly, while achieving up to about 4 times speedup even at small scales over the multiple-baseline design on TensorFlow

arXiv.org e-Print Archive

Learning Random Fourier Features by Hybrid Constrained Optimization

Author: Wangni Jianqiao
Zhu Jun
Zhuo Jingwei
Publication venue
Publication date: 07/12/2017
Field of study

The kernel embedding algorithm is an important component for adapting kernel methods to large datasets. Since the algorithm consumes a major computation cost in the testing phase, we propose a novel teacher-learner framework of learning computation-efficient kernel embeddings from specific data. In the framework, the high-precision embeddings (teacher) transfer the data information to the computation-efficient kernel embeddings (learner). We jointly select informative embedding functions and pursue an orthogonal transformation between two embeddings. We propose a novel approach of constrained variational expectation maximization (CVEM), where the alternate direction method of multiplier (ADMM) is applied over a nonconvex domain in the maximization step. We also propose two specific formulations based on the prevalent Random Fourier Feature (RFF), the masked and blocked version of Computation-Efficient RFF (CERF), by imposing a random binary mask or a block structure on the transformation matrix. By empirical studies of several applications on different real-world datasets, we demonstrate that the CERF significantly improves the performance of kernel methods upon the RFF, under certain arithmetic operation requirements, and suitable for structured matrix multiplication in Fastfood type algorithms

arXiv.org e-Print Archive

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications

Author: Basu Protonu
Deng Summer
Diril Utku
Dzhulgakov Dmytro
Hazelwood Kim
Jia Bill
Jia Yangqing
Kalaiah Aravind
Khudia Daya
Law James
Malani Parth
Malevich Andrey
Nadathur Satish
Naumov Maxim
Park Jongsoo
Pino Juan
Qiao Lin
Rao Vijay
Rotem Nadav
Schatz Martin
Sidorov Alexander
Sivakumar Viswanath
Smelyanskiy Mikhail
Tulloch Andrew
Wang Xiaodong
Wu Yiming
Yoo Sungjoo
Yuen Hector
Publication venue
Publication date: 29/11/2018
Field of study

The application of deep learning techniques resulted in remarkable improvement of machine learning models. In this paper provides detailed characterizations of deep learning models used in many Facebook social network services. We present computational characteristics of our models, describe high performance optimizations targeting existing systems, point out their limitations and make suggestions for the future general-purpose/accelerated inference hardware. Also, we highlight the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers

arXiv.org e-Print Archive

Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

Author: Bahri Yasaman
Pennington Jeffrey
Schoenholz Samuel S.
Sohl-Dickstein Jascha
Xiao Lechao
Publication venue
Publication date: 10/07/2018
Field of study

In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing hundreds or even thousands of layers. A variety of pathologies such as vanishing/exploding gradients make training such deep networks challenging. While residual connections and batch normalization do enable training at these depths, it has remained unclear whether such specialized architecture designs are truly necessary to train deep CNNs. In this work, we demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme. We derive this initialization scheme theoretically by developing a mean field theory for signal propagation and by characterizing the conditions for dynamical isometry, the equilibration of singular values of the input-output Jacobian matrix. These conditions require that the convolution operator be an orthogonal transformation in the sense that it is norm-preserving. We present an algorithm for generating such random initial orthogonal convolution kernels and demonstrate empirically that they enable efficient training of extremely deep architectures.Comment: ICML 2018 Conference Proceeding

arXiv.org e-Print Archive

Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems

Author: Alvermann Andreas
Fehske Holger
Hager Georg
Kreutzer Moritz
Pieper Andreas
Wellein Gerhard
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/07/2015
Field of study

The Kernel Polynomial Method (KPM) is a well-established scheme in quantum physics and quantum chemistry to determine the eigenvalue density and spectral properties of large sparse matrices. In this work we demonstrate the high optimization potential and feasibility of peta-scale heterogeneous CPU-GPU implementations of the KPM. At the node level we show that it is possible to decouple the sparse matrix problem posed by KPM from main memory bandwidth both on CPU and GPU. To alleviate the effects of scattered data access we combine loosely coupled outer iterations with tightly coupled block sparse matrix multiple vector operations, which enables pure data streaming. All optimizations are guided by a performance analysis and modelling process that indicates how the computational bottlenecks change with each optimization step. Finally we use the optimized node-level KPM with a hybrid-parallel framework to perform large scale heterogeneous electronic structure calculations for novel topological materials on a petascale-class Cray XC30 system.Comment: 10 pages, 12 figure

arXiv.org e-Print Archive

Magnus integrators on multicore CPUs and GPUs

Author: Auer N.
Einkemmer L.
Kandolf P.
Ostermann A.
Publication venue: 'Elsevier BV'
Publication date: 28/03/2018
Field of study

In the present paper we consider numerical methods to solve the discrete Schr\"odinger equation with a time dependent Hamiltonian (motivated by problems encountered in the study of spin systems). We will consider both short-range interactions, which lead to evolution equations involving sparse matrices, and long-range interactions, which lead to dense matrices. Both of these settings show very different computational characteristics. We use Magnus integrators for time integration and employ a framework based on Leja interpolation to compute the resulting action of the matrix exponential. We consider both traditional Magnus integrators (which are extensively used for these types of problems in the literature) as well as the recently developed commutator-free Magnus integrators and implement them on modern CPU and GPU (graphics processing unit) based systems. We find that GPUs can yield a significant speed-up (up to a factor of

10

in the dense case) for these types of problems. In the sparse case GPUs are only advantageous for large problem sizes and the achieved speed-ups are more modest. In most cases the commutator-free variant is superior but especially on the GPU this advantage is rather small. In fact, none of the advantage of commutator-free methods on GPUs (and on multi-core CPUs) is due to the elimination of commutators. This has important consequences for the design of more efficient numerical methods

arXiv.org e-Print Archive

Randomized Algorithms for Matrix Computations

Author: Tropp Joel A.
Publication venue
Publication date: 01/03/2020
Field of study

ACM 204 is a graduate course on randomized algorithms for matrix computations. It was taught for the first time in Winter 2020. The course begins with Monte Carlo algorithms for trace estimation. This is a relatively simple setting that allows us to explore how randomness can be used for matrix computations. We continue with a discussion of the randomized power method and the Lanczos method for estimating the largest eigenvalue of a symmetric matrix. For these algorithms, the randomized starting point regularizes the trajectory of the iterations. The Lanczos iteration and randomized trace estimation fuse together in the stochastic Lanczos quadrature method for estimating the trace of a matrix function. Then we turn to Monte Carlo sampling methods for matrix approximation. This approach is justified by the matrix Bernstein inequality, a powerful tool for matrix approximation. As a simple example, we develop sampling methods for approximate matrix multiplication. In the next part of the course, we study random linear embeddings. These are random matrices that can reduce the dimension of a dataset while approximately preserving its geometry. First, we treat Gaussian embeddings in detail, and then we discuss structured embeddings that can be implemented using fewer computational resources. Afterward, we describe several ways to use random embeddings to solve over-determined least-squares problems. We continue with a detailed treatment of the randomized SVD algorithm, the most widely used technique from this area. We give a complete a priori analysis with detailed error bounds. Then we show how to modify this algorithm for the streaming setting, where the matrix is presented as a sequence of linear updates. Last, we show how to develop an effective algorithm for selecting influential columns and rows from a matrix to obtain skeleton or CUR factorizations. The next section of the course studies kernel matrices that arise in high-dimensional data analysis. We discuss positive-definite kernels and outline the computational issues associated with solving linear algebra problems involving kernels. We introduce random feature approximations and Nyström approximations based on randomized sampling. This area is still not fully developed. The last part of the course gives a complete presentation of the sparse Cholesky algorithm of Kyng & Sachdeva [KS16], including a full proof of correctness

Caltech Authors

Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps

Author: Blonder Amit
Dao Tri
Eichhorn Matthew
Gu Albert
Leszczynski Megan
Rudra Atri
Ré Christopher
Sohoni Nimit S.
Publication venue
Publication date: 05/01/2021
Field of study

Modern neural network architectures use structured linear transformations, such as low-rank matrices, sparse matrices, permutations, and the Fourier transform, to improve inference speed and reduce memory usage compared to general linear maps. However, choosing which of the myriad structured transformations to use (and its associated parameterization) is a laborious task that requires trading off speed, space, and accuracy. We consider a different approach: we introduce a family of matrices called kaleidoscope matrices (K-matrices) that provably capture any structured matrix with near-optimal space (parameter) and time (arithmetic operation) complexity. We empirically validate that K-matrices can be automatically learned within end-to-end pipelines to replace hand-crafted procedures, in order to improve model quality. For example, replacing channel shuffles in ShuffleNet improves classification accuracy on ImageNet by up to 5%. K-matrices can also simplify hand-engineered pipelines -- we replace filter bank feature computation in speech data preprocessing with a learnable kaleidoscope layer, resulting in only 0.4% loss in accuracy on the TIMIT speech recognition task. In addition, K-matrices can capture latent structure in models: for a challenging permuted image classification task, a K-matrix based representation of permutations is able to learn the right latent structure and improves accuracy of a downstream convolutional model by over 9%. We provide a practically efficient implementation of our approach, and use K-matrices in a Transformer network to attain 36% faster end-to-end inference speed on a language translation task.Comment: International Conference on Learning Representations (ICLR) 2020 spotligh

arXiv.org e-Print Archive