2,884 research outputs found
EIE: Efficient Inference Engine on Compressed Deep Neural Network
State-of-the-art deep neural networks (DNNs) have hundreds of millions of
connections and are both computationally and memory intensive, making them
difficult to deploy on embedded systems with limited hardware resources and
power budgets. While custom hardware helps the computation, fetching weights
from DRAM is two orders of magnitude more expensive than ALU operations, and
dominates the required power.
Previously proposed 'Deep Compression' makes it possible to fit large DNNs
(AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by
pruning the redundant connections and having multiple connections share the
same weight. We propose an energy efficient inference engine (EIE) that
performs inference on this compressed network model and accelerates the
resulting sparse matrix-vector multiplication with weight sharing. Going from
DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x;
Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x.
Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to
CPU and GPU implementations of the same DNN without compression. EIE has a
processing power of 102GOPS/s working directly on a compressed network,
corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of
AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is
24,000x and 3,400x more energy efficient than a CPU and GPU respectively.
Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy
efficiency and area efficiency.Comment: External Links: TheNextPlatform: http://goo.gl/f7qX0L ; O'Reilly:
https://goo.gl/Id1HNT ; Hacker News: https://goo.gl/KM72SV ; Embedded-vision:
http://goo.gl/joQNg8 ; Talk at NVIDIA GTC'16: http://goo.gl/6wJYvn ; Talk at
Embedded Vision Summit: https://goo.gl/7abFNe ; Talk at Stanford University:
https://goo.gl/6lwuer. Published as a conference paper in ISCA 201
Sparse Matrix Multiplication On An Associative Processor
Sparse matrix multiplication is an important component of linear algebra
computations. Implementing sparse matrix multiplication on an associative
processor (AP) enables high level of parallelism, where a row of one matrix is
multiplied in parallel with the entire second matrix, and where the execution
time of vector dot product does not depend on the vector size. Four sparse
matrix multiplication algorithms are explored in this paper, combining AP and
baseline CPU processing to various levels. They are evaluated by simulation on
a large set of sparse matrices. The computational complexity of sparse matrix
multiplication on AP is shown to be an O(nnz) where nnz is the number of
nonzero elements. The AP is found to be especially efficient in binary sparse
matrix multiplication. AP outperforms conventional solutions in power
efficiency
GraphR: Accelerating Graph Processing Using ReRAM
This paper presents GRAPHR, the first ReRAM-based graph processing
accelerator. GRAPHR follows the principle of near-data processing and explores
the opportunity of performing massive parallel analog operations with low
hardware and energy cost. The analog computation is suit- able for graph
processing because: 1) The algorithms are iterative and could inherently
tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and
Collaborative Filtering) and typical graph algorithms involving integers (e.g.,
BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a
vertex program of a graph algorithm can be expressed in sparse matrix vector
multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We
show that this assumption is generally true for a large set of graph
algorithms. GRAPHR is a novel accelerator architecture consisting of two
components: memory ReRAM and graph engine (GE). The core graph computations are
performed in sparse matrix format in GEs (ReRAM crossbars). The
vector/matrix-based graph computation is not new, but ReRAM offers the unique
opportunity to realize the massive parallelism with unprecedented energy
efficiency and low hardware cost. With small subgraphs processed by GEs, the
gain of performing parallel operations overshadows the wastes due to sparsity.
The experiment results show that GRAPHR achieves a 16.01x (up to 132.67x)
speedup and a 33.82x energy saving on geometric mean compared to a CPU baseline
system. Com- pared to GPU, GRAPHR achieves 1.69x to 2.19x speedup and consumes
4.77x to 8.91x less energy. GRAPHR gains a speedup of 1.16x to 4.12x, and is
3.67x to 10.96x more energy efficiency compared to PIM-based architecture.Comment: Accepted to HPCA 201
A Low-Power Accelerator for Deep Neural Networks with Enlarged Near-Zero Sparsity
It remains a challenge to run Deep Learning in devices with stringent power
budget in the Internet-of-Things. This paper presents a low-power accelerator
for processing Deep Neural Networks in the embedded devices. The power
reduction is realized by avoiding multiplications of near-zero valued data. The
near-zero approximation and a dedicated Near-Zero Approximation Unit (NZAU) are
proposed to predict and skip the near-zero multiplications under certain
thresholds. Compared with skipping zero-valued computations, our design
achieves 1.92X and 1.51X further reduction of the total multiplications in
LeNet-5 and Alexnet respectively, with negligible lose of accuracy. In the
proposed accelerator, 256 multipliers are grouped into 16 independent
Processing Lanes (PL) to support up to 16 neuron activations simultaneously.
With the help of data pre-processing and buffering in each PL, multipliers can
be clock-gated in most of the time even the data is excessively streaming in.
Designed and simulated in UMC 65 nm process, the accelerator operating at 500
MHz is 4X faster than the mobile GPU Tegra K1 in processing the
fully-connected layer FC8 of Alexnet, while consuming 717X less energy.Comment: 5 pages, 6 figure
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
A New Sparse Matrix Vector Multiplication GPU Algorithm Designed for Finite Element Problems
Recently, graphics processors (GPUs) have been increasingly leveraged in a
variety of scientific computing applications. However, architectural
differences between CPUs and GPUs necessitate the development of algorithms
that take advantage of GPU hardware. As sparse matrix vector multiplication
(SPMV) operations are commonly used in finite element analysis, a new SPMV
algorithm and several variations are developed for unstructured finite element
meshes on GPUs. The effective bandwidth of current GPU algorithms and the newly
proposed algorithms are measured and analyzed for 15 sparse matrices of varying
sizes and varying sparsity structures. The effects of optimization and
differences between the new GPU algorithm and its variants are then
subsequently studied. Lastly, both new and current SPMV GPU algorithms are
utilized in the GPU CG Solver in GPU finite element simulations of the heart.
These results are then compared against parallel PETSc finite element
implementation results. The effective bandwidth tests indicate that the new
algorithms compare very favorably with current algorithms for a wide variety of
sparse matrices and can yield very notable benefits. GPU finite element
simulation results demonstrate the benefit of using GPUs for finite element
analysis, and also show that the proposed algorithms can yield speedup factors
up to 12-fold for real finite element applications.Comment: 35 pages, 22 figure
Hardware-Guided Symbiotic Training for Compact, Accurate, yet Execution-Efficient LSTM
Many long short-term memory (LSTM) applications need fast yet compact models.
Neural network compression approaches, such as the grow-and-prune paradigm,
have proved to be promising for cutting down network complexity by skipping
insignificant weights. However, current compression strategies are mostly
hardware-agnostic and network complexity reduction does not always translate
into execution efficiency. In this work, we propose a hardware-guided symbiotic
training methodology for compact, accurate, yet execution-efficient inference
models. It is based on our observation that hardware may introduce substantial
non-monotonic behavior, which we call the latency hysteresis effect, when
evaluating network size vs. inference latency. This observation raises question
about the mainstream smaller-dimension-is-better compression strategy, which
often leads to a sub-optimal model architecture. By leveraging the
hardware-impacted hysteresis effect and sparsity, we are able to achieve the
symbiosis of model compactness and accuracy with execution efficiency, thus
reducing LSTM latency while increasing its accuracy. We have evaluated our
algorithms on language modeling and speech recognition applications. Relative
to the traditional stacked LSTM architecture obtained for the Penn Treebank
dataset, we reduce the number of parameters by 18.0x (30.5x) and measured
run-time latency by up to 2.4x (5.2x) on Nvidia GPUs (Intel Xeon CPUs) without
any accuracy degradation. For the DeepSpeech2 architecture obtained for the AN4
dataset, we reduce the number of parameters by 7.0x (19.4x), word error rate
from 12.9% to 9.9% (10.4%), and measured run-time latency by up to 1.7x (2.4x)
on Nvidia GPUs (Intel Xeon CPUs). Thus, our method yields compact, accurate,
yet execution-efficient inference models
Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments
Architectures with multiple classes of memory media are becoming a common
part of mainstream supercomputer deployments. So called multi-level memories
offer differing characteristics for each memory component including variation
in bandwidth, latency and capacity. This paper investigates the performance of
sparse matrix multiplication kernels on two leading high-performance computing
architectures -- Intel's Knights Landing processor and NVIDIA's Pascal GPU. We
describe a data placement method and a chunking-based algorithm for our kernels
that exploits the existence of the multiple memory spaces in each hardware
platform. We evaluate the performance of these methods w.r.t. standard
algorithms using the auto-caching mechanisms. Our results show that standard
algorithms that exploit cache reuse performed as well as multi-memory-aware
algorithms for architectures such as KNLs where the memory subsystems have
similar latencies. However, for architectures such as GPUs where memory
subsystems differ significantly in both bandwidth and latency,
multi-memory-aware methods are crucial for good performance. In addition, our
new approaches permit the user to run problems that require larger capacities
than the fastest memory of each compute node without depending on the
software-managed cache mechanisms
Pipelined Iterative Solvers with Kernel Fusion for Graphics Processing Units
We revisit the implementation of iterative solvers on discrete graphics
processing units and demonstrate the benefit of implementations using extensive
kernel fusion for pipelined formulations over conventional implementations of
classical formulations. The proposed implementations with both CUDA and OpenCL
are freely available in ViennaCL and are shown to be competitive with or even
superior to other solver packages for graphics processing units. Highest
performance gains are obtained for small to medium-sized systems, while our
implementations are on par with vendor-tuned implementations for very large
systems. Our results are especially beneficial for transient problems, where
many small to medium-sized systems instead of a single big system need to be
solved.Comment: 27 pages, 9 figures, 3 table
GPUQT: An efficient linear-scaling quantum transport code fully implemented on graphics processing units
We present GPUQT, a quantum transport code fully implemented on graphics
processing units. Using this code, one can obtain intrinsic electronic
transport properties of large systems described by a real-space tight-binding
Hamiltonian together with one or more types of disorder. The DC Kubo
conductivity is represented as a time integral of the velocity auto-correlation
or a time derivative of the mean square displacement. Linear scaling (with
respect to the total number of orbitals in the system) computation time and
memory usage are achieved by using various numerical techniques, including
sparse matrix-vector multiplication, random phase approximation of trace,
Chebyshev expansion of quantum evolution operator, and kernel polynomial method
for quantum resolution operator. We describe the inputs and outputs of GPUQT
and give two examples to demonstrate its usage, paying attention to the
interpretations of the results.Comment: 22 pages, 2 figures, computer code availabl
- …