12,458 research outputs found
FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
Due to recent advances in digital technologies, and availability of credible
data, an area of artificial intelligence, deep learning, has emerged, and has
demonstrated its ability and effectiveness in solving complex learning problems
not possible before. In particular, convolution neural networks (CNNs) have
demonstrated their effectiveness in image detection and recognition
applications. However, they require intensive CPU operations and memory
bandwidth that make general CPUs fail to achieve desired performance levels.
Consequently, hardware accelerators that use application specific integrated
circuits (ASICs), field programmable gate arrays (FPGAs), and graphic
processing units (GPUs) have been employed to improve the throughput of CNNs.
More precisely, FPGAs have been recently adopted for accelerating the
implementation of deep learning networks due to their ability to maximize
parallelism as well as due to their energy efficiency. In this paper, we review
recent existing techniques for accelerating deep learning networks on FPGAs. We
highlight the key features employed by the various techniques for improving the
acceleration performance. In addition, we provide recommendations for enhancing
the utilization of FPGAs for CNNs acceleration. The techniques investigated in
this paper represent the recent trends in FPGA-based accelerators of deep
learning networks. Thus, this review is expected to direct the future advances
on efficient hardware accelerators and to be useful for deep learning
researchers.Comment: This article has been accepted for publication in IEEE Access
(December, 2018
Application-Driven Near-Data Processing for Similarity Search
Similarity search is a key to a variety of applications including
content-based search for images and video, recommendation systems, data
deduplication, natural language processing, computer vision, databases,
computational biology, and computer graphics. At its core, similarity search
manifests as k-nearest neighbors (kNN), a computationally simple primitive
consisting of highly parallel distance calculations and a global top-k sort.
However, kNN is poorly supported by today's architectures because of its high
memory bandwidth requirements.
This paper proposes an application-driven near-data processing accelerator
for similarity search: the Similarity Search Associative Memory (SSAM). By
instantiating compute units close to memory, SSAM benefits from the higher
memory bandwidth and density exposed by emerging memory technologies. We
evaluate the SSAM design down to layout on top of the Micron hybrid memory cube
(HMC), and show that SSAM can achieve up to two orders of magnitude
area-normalized throughput and energy efficiency improvement over multicore
CPUs; we also show SSAM is faster and more energy efficient than competing GPUs
and FPGAs. Finally, we show that SSAM is also useful for other data intensive
tasks like kNN index construction, and can be generalized to semantically
function as a high capacity content addressable memory.Comment: 15 pages, 8 figures, 7 table
Morph: Flexible Acceleration for 3D CNN-based Video Understanding
The past several years have seen both an explosion in the use of
Convolutional Neural Networks (CNNs) and the design of accelerators to make CNN
inference practical. In the architecture community, the lion share of effort
has targeted CNN inference for image recognition. The closely related problem
of video recognition has received far less attention as an accelerator target.
This is surprising, as video recognition is more computationally intensive than
image recognition, and video traffic is predicted to be the majority of
internet traffic in the coming years.
This paper fills the gap between algorithmic and hardware advances for video
recognition by providing a design space exploration and flexible architecture
for accelerating 3D Convolutional Neural Networks (3D CNNs) - the core kernel
in modern video understanding. When compared to (2D) CNNs used for image
recognition, efficiently accelerating 3D CNNs poses a significant engineering
challenge due to their large (and variable over time) memory footprint and
higher dimensionality.
To address these challenges, we design a novel accelerator, called Morph,
that can adaptively support different spatial and temporal tiling strategies
depending on the needs of each layer of each target 3D CNN. We codesign a
software infrastructure alongside the Morph hardware to find good-fit
parameters to control the hardware. Evaluated on state-of-the-art 3D CNNs,
Morph achieves up to 3.4x (2.5x average) reduction in energy consumption and
improves performance/watt by up to 5.1x (4x average) compared to a baseline 3D
CNN accelerator, with an area overhead of 5%. Morph further achieves a 15.9x
average energy reduction on 3D CNNs when compared to Eyeriss.Comment: Appears in the proceedings of the 51st Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), 201
A Design Methodology for Efficient Implementation of Deconvolutional Neural Networks on an FPGA
In recent years deep learning algorithms have shown extremely high
performance on machine learning tasks such as image classification and speech
recognition. In support of such applications, various FPGA accelerator
architectures have been proposed for convolutional neural networks (CNNs) that
enable high performance for classification tasks at lower power than CPU and
GPU processors. However, to date, there has been little research on the use of
FPGA implementations of deconvolutional neural networks (DCNNs). DCNNs, also
known as generative CNNs, encode high-dimensional probability distributions and
have been widely used for computer vision applications such as scene
completion, scene segmentation, image creation, image denoising, and
super-resolution imaging. We propose an FPGA architecture for deconvolutional
networks built around an accelerator which effectively handles the complex
memory access patterns needed to perform strided deconvolutions, and that
supports convolution as well. We also develop a three-step design optimization
method that systematically exploits statistical analysis, design space
exploration and VLSI optimization. To verify our FPGA deconvolutional
accelerator design methodology we train DCNNs offline on two representative
datasets using the generative adversarial network method (GAN) run on
Tensorflow, and then map these DCNNs to an FPGA DCNN-plus-accelerator
implementation to perform generative inference on a Xilinx Zynq-7000 FPGA. Our
DCNN implementation achieves a peak performance density of 0.012 GOPs/DSP
Exploring Computation-Communication Tradeoffs in Camera Systems
Cameras are the defacto sensor. The growing demand for real-time and
low-power computer vision, coupled with trends towards high-efficiency
heterogeneous systems, has given rise to a wide range of image processing
acceleration techniques at the camera node and in the cloud. In this paper, we
characterize two novel camera systems that use acceleration techniques to push
the extremes of energy and performance scaling, and explore the
computation-communication tradeoffs in their design. The first case study
targets a camera system designed to detect and authenticate individual faces,
running solely on energy harvested from RFID readers. We design a
multi-accelerator SoC design operating in the sub-mW range, and evaluate it
with real-world workloads to show performance and energy efficiency
improvements over a general purpose microprocessor. The second camera system
supports a 16-camera rig processing over 32 Gb/s of data to produce real-time
3D-360 degree virtual reality video. We design a multi-FPGA processing pipeline
that outperforms CPU and GPU configurations by up to 10x in computation time,
producing panoramic stereo video directly from the camera rig at 30 frames per
second. We find that an early data reduction step, either before complex
processing or offloading, is the most critical optimization for in-camera
systems
Recent Advances in Convolutional Neural Network Acceleration
In recent years, convolutional neural networks (CNNs) have shown great
performance in various fields such as image classification, pattern
recognition, and multi-media compression. Two of the feature properties, local
connectivity and weight sharing, can reduce the number of parameters and
increase processing speed during training and inference. However, as the
dimension of data becomes higher and the CNN architecture becomes more
complicated, the end-to-end approach or the combined manner of CNN is
computationally intensive, which becomes limitation to CNN's further
implementation. Therefore, it is necessary and urgent to implement CNN in a
faster way. In this paper, we first summarize the acceleration methods that
contribute to but not limited to CNN by reviewing a broad variety of research
papers. We propose a taxonomy in terms of three levels, i.e.~structure level,
algorithm level, and implementation level, for acceleration methods. We also
analyze the acceleration methods in terms of CNN architecture compression,
algorithm optimization, and hardware-based improvement. At last, we give a
discussion on different perspectives of these acceleration and optimization
methods within each level. The discussion shows that the methods in each level
still have large exploration space. By incorporating such a wide range of
disciplines, we expect to provide a comprehensive reference for researchers who
are interested in CNN acceleration.Comment: submitted to Neurocomputin
Optimizing Temporal Convolutional Network inference on FPGA-based accelerators
Convolutional Neural Networks are extensively used in a wide range of
applications, commonly including computer vision tasks like image and video
classification, recognition, and segmentation. Recent research results
demonstrate that multilayer(deep) networks involving mono-dimensional
convolutions and dilation can be effectively used in time series and sequences
classification and segmentation, as well as in tasks involving sequence
modelling. These structures, commonly referred to as Temporal Convolutional
Networks (TCNs), have been demonstrated to consistently outperform Recurrent
Neural Networks in terms of accuracy and training time [1]. While FPGA-based
inference accelerators for classic CNNs are widespread, literature is lacking
in a quantitative evaluation of their usability on inference for TCN models. In
this paper we present such an evaluation, considering a CNN accelerator with
specific features supporting TCN kernels as a reference and a set of
state-of-the-art TCNs as a benchmark. Experimental results show that, during
TCN execution, operational intensity can be critical for the overall
performance. We propose a convolution scheduling based on batch processing that
can boost efficiency up to 96% of theoretical peak performance. Overall we can
achieve up to 111,8 GOPS/s and power efficiency of 33,9 GOPS/s/W on an
Ultrascale+ ZU3EG (up to 10x speedup and 3x power efficiency improvement with
respect to pure software implementation)
SqueezeJet: High-level Synthesis Accelerator Design for Deep Convolutional Neural Networks
Deep convolutional neural networks have dominated the pattern recognition
scene by providing much more accurate solutions in computer vision problems
such as object recognition and object detection. Most of these solutions come
at a huge computational cost, requiring billions of multiply-accumulate
operations and, thus, making their use quite challenging in real-time
applications that run on embedded mobile (resource-power constrained) hardware.
This work presents the architecture, the high-level synthesis design, and the
implementation of SqueezeJet, an FPGA accelerator for the inference phase of
the SqueezeNet DCNN architecture, which is designed specifically for use in
embedded systems. Results show that SqueezeJet can achieve 15.16 times speed-up
compared to the software implementation of SqueezeNet running on an embedded
mobile processor with less than 1% drop in top-5 accuracy.Comment: The final publication is available at Springer via
https://doi.org/10.1007/978-3-319-78890-6_
Recent Advances in Efficient Computation of Deep Convolutional Neural Networks
Deep neural networks have evolved remarkably over the past few years and they
are currently the fundamental tools of many intelligent systems. At the same
time, the computational complexity and resource consumption of these networks
also continue to increase. This will pose a significant challenge to the
deployment of such networks, especially in real-time applications or on
resource-limited devices. Thus, network acceleration has become a hot topic
within the deep learning community. As for hardware implementation of deep
neural networks, a batch of accelerators based on FPGA/ASIC have been proposed
in recent years. In this paper, we provide a comprehensive survey of recent
advances in network acceleration, compression and accelerator design from both
algorithm and hardware points of view. Specifically, we provide a thorough
analysis of each of the following topics: network pruning, low-rank
approximation, network quantization, teacher-student networks, compact network
design and hardware accelerators. Finally, we will introduce and discuss a few
possible future directions.Comment: 14 pages, 3 figure
DeCoILFNet: Depth Concatenation and Inter-Layer Fusion based ConvNet Accelerator
Convolutional Neural Networks (CNNs) are rapidly gaining popularity in varied
fields. Due to their increasingly deep and computationally heavy structures, it
is difficult to deploy them on energy constrained mobile applications. Hardware
accelerators such as FPGAs have come up as an attractive alternative. However,
with the limited on-chip memory and computation resources of FPGA, meeting the
high memory throughput requirement and exploiting the parallelism of CNNs is a
major challenge. We propose a high-performance FPGA based architecture - Depth
Concatenation and Inter-Layer Fusion based ConvNet Accelerator - DeCoILFNet
which exploits the intra-layer parallelism of CNNs by flattening across depth
and combines it with a highly pipelined data flow across the layers enabling
inter-layer fusion. This architecture significantly reduces off-chip memory
accesses and maximizes the throughput. Compared to a 3.5GHz hexa-core Intel
Xeon E7 caffe-implementation, our 120MHz FPGA accelerator is 30X faster. In
addition, our design reduces external memory access by 11.5X along with a
speedup of more than 2X in the number of clock cycles compared to
state-of-the-art FPGA accelerators
- …