3,693 research outputs found
Recent Advances in Convolutional Neural Network Acceleration
In recent years, convolutional neural networks (CNNs) have shown great
performance in various fields such as image classification, pattern
recognition, and multi-media compression. Two of the feature properties, local
connectivity and weight sharing, can reduce the number of parameters and
increase processing speed during training and inference. However, as the
dimension of data becomes higher and the CNN architecture becomes more
complicated, the end-to-end approach or the combined manner of CNN is
computationally intensive, which becomes limitation to CNN's further
implementation. Therefore, it is necessary and urgent to implement CNN in a
faster way. In this paper, we first summarize the acceleration methods that
contribute to but not limited to CNN by reviewing a broad variety of research
papers. We propose a taxonomy in terms of three levels, i.e.~structure level,
algorithm level, and implementation level, for acceleration methods. We also
analyze the acceleration methods in terms of CNN architecture compression,
algorithm optimization, and hardware-based improvement. At last, we give a
discussion on different perspectives of these acceleration and optimization
methods within each level. The discussion shows that the methods in each level
still have large exploration space. By incorporating such a wide range of
disciplines, we expect to provide a comprehensive reference for researchers who
are interested in CNN acceleration.Comment: submitted to Neurocomputin
Deep Learning: Computational Aspects
In this article we review computational aspects of Deep Learning (DL). Deep
learning uses network architectures consisting of hierarchical layers of latent
variables to construct predictors for high-dimensional input-output models.
Training a deep learning architecture is computationally intensive, and
efficient linear algebra libraries is the key for training and inference.
Stochastic gradient descent (SGD) optimization and batch sampling are used to
learn from massive data sets
Julia Language in Machine Learning: Algorithms, Applications, and Open Issues
Machine learning is driving development across many fields in science and
engineering. A simple and efficient programming language could accelerate
applications of machine learning in various fields. Currently, the programming
languages most commonly used to develop machine learning algorithms include
Python, MATLAB, and C/C ++. However, none of these languages well balance both
efficiency and simplicity. The Julia language is a fast, easy-to-use, and
open-source programming language that was originally designed for
high-performance computing, which can well balance the efficiency and
simplicity. This paper summarizes the related research work and developments in
the application of the Julia language in machine learning. It first surveys the
popular machine learning algorithms that are developed in the Julia language.
Then, it investigates applications of the machine learning algorithms
implemented with the Julia language. Finally, it discusses the open issues and
the potential future directions that arise in the use of the Julia language in
machine learning.Comment: Published in Computer Science Revie
MatRox: Modular approach for improving data locality in Hierarchical (Mat)rix App(Rox)imation
Hierarchical matrix approximations have gained significant traction in the
machine learning and scientific community as they exploit available low-rank
structures in kernel methods to compress the kernel matrix. The resulting
compressed matrix, HMatrix, is used to reduce the computational complexity of
operations such as HMatrix-matrix multiplications with tuneable accuracy in an
evaluation phase. Existing implementations of HMatrix evaluations do not
preserve locality and often lead to unbalanced parallel execution with high
synchronization. Also, current solutions require the compression phase to
re-execute if the kernel method or the required accuracy change. In this work,
we describe MatRox, a framework that uses novel structure analysis strategies,
blocking and coarsen, with code specialization and a storage format to improve
locality and create load-balanced parallel tasks for HMatrix-matrix
multiplications. Modularization of the matrix compression phase enables the
reuse of computations when there are changes to the input accuracy and the
kernel function. The MatRox-generated code for matrix-matrix multiplication is
2.98x, 1.60x, and 5.98x faster than library implementations available in GOFMM,
SMASH, and STRUMPACK respectively. Additionally, the ability to reuse portions
of the compression computation for changes to the accuracy leads to up to 2.64x
improvement with MatRox over five changes to accuracy using GOFMM
Software for Sparse Tensor Decomposition on Emerging Computing Architectures
In this paper, we develop software for decomposing sparse tensors that is
portable to and performant on a variety of multicore, manycore, and GPU
computing architectures. The result is a single code whose performance matches
optimized architecture-specific implementations. The key to a portable approach
is to determine multiple levels of parallelism that can be mapped in different
ways to different architectures, and we explain how to do this for the
matricized tensor times Khatri-Rao product (MTTKRP) which is the key kernel in
canonical polyadic tensor decomposition. Our implementation leverages the
Kokkos framework, which enables a single code to achieve high performance
across multiple architectures that differ in how they approach fine-grained
parallelism. We also introduce a new construct for portable thread-local
arrays, which we call compile-time polymorphic arrays. Not only are the
specifics of our approaches and implementation interesting for tuning tensor
computations, but they also provide a roadmap for developing other portable
high-performance codes. As a last step in optimizing performance, we modify the
MTTKRP algorithm itself to do a permuted traversal of tensor nonzeros to reduce
atomic-write contention. We test the performance of our implementation on 16-
and 68-core Intel CPUs and the K80 and P100 NVIDIA GPUs, showing that we are
competitive with state-of-the-art architecture-specific codes while having the
advantage of being able to run on a variety of architectures
A Deep Structured Model with Radius-Margin Bound for 3D Human Activity Recognition
Understanding human activity is very challenging even with the recently
developed 3D/depth sensors. To solve this problem, this work investigates a
novel deep structured model, which adaptively decomposes an activity instance
into temporal parts using the convolutional neural networks (CNNs). Our model
advances the traditional deep learning approaches in two aspects. First, { we
incorporate latent temporal structure into the deep model, accounting for large
temporal variations of diverse human activities. In particular, we utilize the
latent variables to decompose the input activity into a number of temporally
segmented sub-activities, and accordingly feed them into the parts (i.e.
sub-networks) of the deep architecture}. Second, we incorporate a radius-margin
bound as a regularization term into our deep model, which effectively improves
the generalization performance for classification. For model training, we
propose a principled learning algorithm that iteratively (i) discovers the
optimal latent variables (i.e. the ways of activity decomposition) for all
training instances, (ii) { updates the classifiers} based on the generated
features, and (iii) updates the parameters of multi-layer neural networks. In
the experiments, our approach is validated on several complex scenarios for
human activity recognition and demonstrates superior performances over other
state-of-the-art approaches.Comment: 16 pages, 9 figures, to appear in International Journal of Computer
Vision 201
CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-CirculantWeight Matrices
Large-scale deep neural networks (DNNs) are both compute and memory
intensive. As the size of DNNs continues to grow, it is critical to improve the
energy efficiency and performance while maintaining accuracy. For DNNs, the
model size is an important factor affecting performance, scalability and energy
efficiency. Weight pruning achieves good compression ratios but suffers from
three drawbacks: 1) the irregular network structure after pruning; 2) the
increased training complexity; and 3) the lack of rigorous guarantee of
compression ratio and inference accuracy. To overcome these limitations, this
paper proposes CirCNN, a principled approach to represent weights and process
neural networks using block-circulant matrices. CirCNN utilizes the Fast
Fourier Transform (FFT)-based fast multiplication, simultaneously reducing the
computational complexity (both in inference and training) from O(n2) to
O(nlogn) and the storage complexity from O(n2) to O(n), with negligible
accuracy loss. Compared to other approaches, CirCNN is distinct due to its
mathematical rigor: it can converge to the same effectiveness as DNNs without
compression. The CirCNN architecture, a universal DNN inference engine that can
be implemented on various hardware/software platforms with configurable network
architecture. To demonstrate the performance and energy efficiency, we test
CirCNN in FPGA, ASIC and embedded processors. Our results show that CirCNN
architecture achieves very high energy efficiency and performance with a small
hardware footprint. Based on the FPGA implementation and ASIC synthesis
results, CirCNN achieves 6-102X energy efficiency improvements compared with
the best state-of-the-art results.Comment: 14 pages, 15 Figures, conferenc
Billion-scale similarity search with GPUs
Similarity search finds application in specialized database systems handling
complex data such as images or videos, which are typically represented by
high-dimensional features and require specific indexing structures. This paper
tackles the problem of better utilizing GPUs for this task. While GPUs excel at
data-parallel tasks, prior approaches are bottlenecked by algorithms that
expose less parallelism, such as k-min selection, or make poor use of the
memory hierarchy.
We propose a design for k-selection that operates at up to 55% of theoretical
peak performance, enabling a nearest neighbor implementation that is 8.5x
faster than prior GPU state of the art. We apply it in different similarity
search scenarios, by proposing optimized design for brute-force, approximate
and compressed-domain search based on product quantization. In all these
setups, we outperform the state of the art by large margins. Our implementation
enables the construction of a high accuracy k-NN graph on 95 million images
from the Yfcc100M dataset in 35 minutes, and of a graph connecting 1 billion
vectors in less than 12 hours on 4 Maxwell Titan X GPUs. We have open-sourced
our approach for the sake of comparison and reproducibility
dMath: A Scalable Linear Algebra and Math Library for Heterogeneous GP-GPU Architectures
A new scalable parallel math library, dMath, is presented in this paper that
demonstrates leading scaling when using intranode, or internode,
hybrid-parallelism for deep-learning. dMath provides easy-to-use distributed
base primitives and a variety of domain-specific algorithms. These include
matrix multiplication, convolutions, and others allowing for rapid development
of highly scalable applications, including Deep Neural Networks (DNN), whereas
previously one was restricted to libraries that provided effective primitives
for only a single GPU, like Nvidia cublas and cudnn or DNN primitives from
Nervana neon framework. Development of HPC software is difficult,
labor-intensive work, requiring a unique skill set. dMath allows a wide range
of developers to utilize parallel and distributed hardware easily. One
contribution of this approach is that data is stored persistently on the GPU
hardware, avoiding costly transfers between host and device. Advanced memory
management techniques are utilized, including caching of transferred data and
memory reuse through pooling. A key contribution of dMath is that it delivers
performance, portability, and productivity to its specific domain of support.
It enables algorithm and application programmers to quickly solve problems
without managing the significant complexity associated with multi-level
parallelism
Learning efficient sparse and low rank models
Parsimony, including sparsity and low rank, has been shown to successfully
model data in numerous machine learning and signal processing tasks.
Traditionally, such modeling approaches rely on an iterative algorithm that
minimizes an objective function with parsimony-promoting terms. The inherently
sequential structure and data-dependent complexity and latency of iterative
optimization constitute a major limitation in many applications requiring
real-time performance or involving large-scale data. Another limitation
encountered by these modeling techniques is the difficulty of their inclusion
in discriminative learning scenarios. In this work, we propose to move the
emphasis from the model to the pursuit algorithm, and develop a process-centric
view of parsimonious modeling, in which a learned deterministic
fixed-complexity pursuit process is used in lieu of iterative optimization. We
show a principled way to construct learnable pursuit process architectures for
structured sparse and robust low rank models, derived from the iteration of
proximal descent algorithms. These architectures learn to approximate the exact
parsimonious representation at a fraction of the complexity of the standard
optimization methods. We also show that appropriate training regimes allow to
naturally extend parsimonious models to discriminative settings.
State-of-the-art results are demonstrated on several challenging problems in
image and audio processing with several orders of magnitude speedup compared to
the exact optimization algorithms
- …