1,563 research outputs found
Learning Compressed Transforms with Low Displacement Rank
The low displacement rank (LDR) framework for structured matrices represents
a matrix through two displacement operators and a low-rank residual. Existing
use of LDR matrices in deep learning has applied fixed displacement operators
encoding forms of shift invariance akin to convolutions. We introduce a class
of LDR matrices with more general displacement operators, and explicitly learn
over both the operators and the low-rank component. This class generalizes
several previous constructions while preserving compression and efficient
computation. We prove bounds on the VC dimension of multi-layer neural networks
with structured weight matrices and show empirically that our compact
parameterization can reduce the sample complexity of learning. When replacing
weight layers in fully-connected, convolutional, and recurrent neural networks
for image classification and language modeling tasks, our new classes exceed
the accuracy of existing compression approaches, and on some tasks also
outperform general unstructured layers while using more than 20x fewer
parameters.Comment: NeurIPS 2018. Code available at
https://github.com/HazyResearch/structured-net
Efficient Inferencing of Compressed Deep Neural Networks
Large number of weights in deep neural networks makes the models difficult to
be deployed in low memory environments such as, mobile phones, IOT edge devices
as well as "inferencing as a service" environments on cloud. Prior work has
considered reduction in the size of the models, through compression techniques
like pruning, quantization, Huffman encoding etc. However, efficient
inferencing using the compressed models has received little attention,
specially with the Huffman encoding in place. In this paper, we propose
efficient parallel algorithms for inferencing of single image and batches,
under various memory constraints. Our experimental results show that our
approach of using variable batch size for inferencing achieves 15-25\%
performance improvement in the inference throughput for AlexNet, while
maintaining memory and latency constraints
Towards Efficient Large-Scale Graph Neural Network Computing
Recent deep learning models have moved beyond low-dimensional regular grids
such as image, video, and speech, to high-dimensional graph-structured data,
such as social networks, brain connections, and knowledge graphs. This
evolution has led to large graph-based irregular and sparse models that go
beyond what existing deep learning frameworks are designed for. Further, these
models are not easily amenable to efficient, at scale, acceleration on parallel
hardwares (e.g. GPUs). We introduce NGra, the first parallel processing
framework for graph-based deep neural networks (GNNs). NGra presents a new
SAGA-NN model for expressing deep neural networks as vertex programs with each
layer in well-defined (Scatter, ApplyEdge, Gather, ApplyVertex) graph operation
stages. This model not only allows GNNs to be expressed intuitively, but also
facilitates the mapping to an efficient dataflow representation. NGra addresses
the scalability challenge transparently through automatic graph partitioning
and chunk-based stream processing out of GPU core or over multiple GPUs, which
carefully considers data locality, data movement, and overlapping of parallel
processing and data movement. NGra further achieves efficiency through highly
optimized Scatter/Gather operators on GPUs despite its sparsity. Our evaluation
shows that NGra scales to large real graphs that none of the existing
frameworks can handle directly, while achieving up to about 4 times speedup
even at small scales over the multiple-baseline design on TensorFlow
Learning Random Fourier Features by Hybrid Constrained Optimization
The kernel embedding algorithm is an important component for adapting kernel
methods to large datasets. Since the algorithm consumes a major computation
cost in the testing phase, we propose a novel teacher-learner framework of
learning computation-efficient kernel embeddings from specific data. In the
framework, the high-precision embeddings (teacher) transfer the data
information to the computation-efficient kernel embeddings (learner). We
jointly select informative embedding functions and pursue an orthogonal
transformation between two embeddings. We propose a novel approach of
constrained variational expectation maximization (CVEM), where the alternate
direction method of multiplier (ADMM) is applied over a nonconvex domain in the
maximization step. We also propose two specific formulations based on the
prevalent Random Fourier Feature (RFF), the masked and blocked version of
Computation-Efficient RFF (CERF), by imposing a random binary mask or a block
structure on the transformation matrix. By empirical studies of several
applications on different real-world datasets, we demonstrate that the CERF
significantly improves the performance of kernel methods upon the RFF, under
certain arithmetic operation requirements, and suitable for structured matrix
multiplication in Fastfood type algorithms
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications
The application of deep learning techniques resulted in remarkable
improvement of machine learning models. In this paper provides detailed
characterizations of deep learning models used in many Facebook social network
services. We present computational characteristics of our models, describe high
performance optimizations targeting existing systems, point out their
limitations and make suggestions for the future general-purpose/accelerated
inference hardware. Also, we highlight the need for better co-design of
algorithms, numerics and computing platforms to address the challenges of
workloads often run in data centers
Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks
In recent years, state-of-the-art methods in computer vision have utilized
increasingly deep convolutional neural network architectures (CNNs), with some
of the most successful models employing hundreds or even thousands of layers. A
variety of pathologies such as vanishing/exploding gradients make training such
deep networks challenging. While residual connections and batch normalization
do enable training at these depths, it has remained unclear whether such
specialized architecture designs are truly necessary to train deep CNNs. In
this work, we demonstrate that it is possible to train vanilla CNNs with ten
thousand layers or more simply by using an appropriate initialization scheme.
We derive this initialization scheme theoretically by developing a mean field
theory for signal propagation and by characterizing the conditions for
dynamical isometry, the equilibration of singular values of the input-output
Jacobian matrix. These conditions require that the convolution operator be an
orthogonal transformation in the sense that it is norm-preserving. We present
an algorithm for generating such random initial orthogonal convolution kernels
and demonstrate empirically that they enable efficient training of extremely
deep architectures.Comment: ICML 2018 Conference Proceeding
Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
The Kernel Polynomial Method (KPM) is a well-established scheme in quantum
physics and quantum chemistry to determine the eigenvalue density and spectral
properties of large sparse matrices. In this work we demonstrate the high
optimization potential and feasibility of peta-scale heterogeneous CPU-GPU
implementations of the KPM. At the node level we show that it is possible to
decouple the sparse matrix problem posed by KPM from main memory bandwidth both
on CPU and GPU. To alleviate the effects of scattered data access we combine
loosely coupled outer iterations with tightly coupled block sparse matrix
multiple vector operations, which enables pure data streaming. All
optimizations are guided by a performance analysis and modelling process that
indicates how the computational bottlenecks change with each optimization step.
Finally we use the optimized node-level KPM with a hybrid-parallel framework to
perform large scale heterogeneous electronic structure calculations for novel
topological materials on a petascale-class Cray XC30 system.Comment: 10 pages, 12 figure
Magnus integrators on multicore CPUs and GPUs
In the present paper we consider numerical methods to solve the discrete
Schr\"odinger equation with a time dependent Hamiltonian (motivated by problems
encountered in the study of spin systems). We will consider both short-range
interactions, which lead to evolution equations involving sparse matrices, and
long-range interactions, which lead to dense matrices. Both of these settings
show very different computational characteristics. We use Magnus integrators
for time integration and employ a framework based on Leja interpolation to
compute the resulting action of the matrix exponential. We consider both
traditional Magnus integrators (which are extensively used for these types of
problems in the literature) as well as the recently developed commutator-free
Magnus integrators and implement them on modern CPU and GPU (graphics
processing unit) based systems.
We find that GPUs can yield a significant speed-up (up to a factor of in
the dense case) for these types of problems. In the sparse case GPUs are only
advantageous for large problem sizes and the achieved speed-ups are more
modest. In most cases the commutator-free variant is superior but especially on
the GPU this advantage is rather small. In fact, none of the advantage of
commutator-free methods on GPUs (and on multi-core CPUs) is due to the
elimination of commutators. This has important consequences for the design of
more efficient numerical methods
Randomized Algorithms for Matrix Computations
ACM 204 is a graduate course on randomized algorithms for matrix computations. It was taught for the first time in Winter 2020.
The course begins with Monte Carlo algorithms for trace estimation. This is a relatively simple setting that allows us to explore how randomness can be used for matrix computations. We continue with a discussion of the randomized power method and the Lanczos method for estimating the largest eigenvalue of a symmetric matrix. For these algorithms, the randomized starting point regularizes the trajectory of the iterations. The Lanczos iteration and randomized trace estimation fuse together in the stochastic Lanczos quadrature method for estimating the trace of a matrix function.
Then we turn to Monte Carlo sampling methods for matrix approximation. This approach is justified by the matrix Bernstein inequality, a powerful tool for matrix approximation. As a simple example, we develop sampling methods for approximate matrix multiplication.
In the next part of the course, we study random linear embeddings. These are random matrices that can reduce the dimension of a dataset while approximately preserving its geometry. First, we treat Gaussian embeddings in detail, and then we discuss structured embeddings that can be implemented using fewer computational resources. Afterward, we describe several ways to use random embeddings to solve over-determined least-squares problems.
We continue with a detailed treatment of the randomized SVD algorithm, the most widely used technique from this area. We give a complete a priori analysis with detailed error bounds. Then we show how to modify this algorithm for the streaming setting, where the matrix is presented as a sequence of linear updates. Last, we show how to develop an effective algorithm for selecting influential columns and rows from a matrix to obtain skeleton or CUR factorizations.
The next section of the course studies kernel matrices that arise in high-dimensional data analysis. We discuss positive-definite kernels and outline the computational issues associated with solving linear algebra problems involving kernels. We introduce random feature approximations and Nyström approximations based on randomized sampling. This area is still not fully developed.
The last part of the course gives a complete presentation of the sparse Cholesky algorithm of Kyng & Sachdeva [KS16], including a full proof of correctness
Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps
Modern neural network architectures use structured linear transformations,
such as low-rank matrices, sparse matrices, permutations, and the Fourier
transform, to improve inference speed and reduce memory usage compared to
general linear maps. However, choosing which of the myriad structured
transformations to use (and its associated parameterization) is a laborious
task that requires trading off speed, space, and accuracy. We consider a
different approach: we introduce a family of matrices called kaleidoscope
matrices (K-matrices) that provably capture any structured matrix with
near-optimal space (parameter) and time (arithmetic operation) complexity. We
empirically validate that K-matrices can be automatically learned within
end-to-end pipelines to replace hand-crafted procedures, in order to improve
model quality. For example, replacing channel shuffles in ShuffleNet improves
classification accuracy on ImageNet by up to 5%. K-matrices can also simplify
hand-engineered pipelines -- we replace filter bank feature computation in
speech data preprocessing with a learnable kaleidoscope layer, resulting in
only 0.4% loss in accuracy on the TIMIT speech recognition task. In addition,
K-matrices can capture latent structure in models: for a challenging permuted
image classification task, a K-matrix based representation of permutations is
able to learn the right latent structure and improves accuracy of a downstream
convolutional model by over 9%. We provide a practically efficient
implementation of our approach, and use K-matrices in a Transformer network to
attain 36% faster end-to-end inference speed on a language translation task.Comment: International Conference on Learning Representations (ICLR) 2020
spotligh
- …