12,569 research outputs found
Convolutional Neural Networks with Layer Reuse
A convolutional layer in a Convolutional Neural Network (CNN) consists of
many filters which apply convolution operation to the input, capture some
special patterns and pass the result to the next layer. If the same patterns
also occur at the deeper layers of the network, why wouldn't the same
convolutional filters be used also in those layers? In this paper, we propose a
CNN architecture, Layer Reuse Network (LruNet), where the convolutional layers
are used repeatedly without the need of introducing new layers to get a better
performance. This approach introduces several advantages: (i) Considerable
amount of parameters are saved since we are reusing the layers instead of
introducing new layers, (ii) the Memory Access Cost (MAC) can be reduced since
reused layer parameters can be fetched only once, (iii) the number of
nonlinearities increases with layer reuse, and (iv) reused layers get gradient
updates from multiple parts of the network. The proposed approach is evaluated
on CIFAR-10, CIFAR-100 and Fashion-MNIST datasets for image classification
task, and layer reuse improves the performance by 5.14%, 5.85% and 2.29%,
respectively. The source code and pretrained models are publicly available.Comment: Computer Vision and Pattern Recognitio
Mini-batch Serialization: CNN Training with Inter-layer Data Reuse
Training convolutional neural networks (CNNs) requires intense computations
and high memory bandwidth. We find that bandwidth today is over-provisioned
because most memory accesses in CNN training can be eliminated by rearranging
computation to better utilize on-chip buffers and avoid traffic resulting from
large per-layer memory footprints. We introduce the MBS CNN training approach
that significantly reduces memory traffic by partially serializing mini-batch
processing across groups of layers. This optimizes reuse within on-chip buffers
and balances both intra-layer and inter-layer reuse. We also introduce the
WaveCore CNN training accelerator that effectively trains CNNs in the MBS
approach with high functional-unit utilization. Combined, WaveCore and MBS
reduce DRAM traffic by 75%, improve performance by 53%, and save 26% system
energy for modern deep CNN training compared to conventional training
mechanisms and accelerators
Reconciling Feature-Reuse and Overfitting in DenseNet with Specialized Dropout
Recently convolutional neural networks (CNNs) achieve great accuracy in
visual recognition tasks. DenseNet becomes one of the most popular CNN models
due to its effectiveness in feature-reuse. However, like other CNN models,
DenseNets also face overfitting problem if not severer. Existing dropout method
can be applied but not as effective due to the introduced nonlinear
connections. In particular, the property of feature-reuse in DenseNet will be
impeded, and the dropout effect will be weakened by the spatial correlation
inside feature maps. To address these problems, we craft the design of a
specialized dropout method from three aspects, dropout location, dropout
granularity, and dropout probability. The insights attained here could
potentially be applied as a general approach for boosting the accuracy of other
CNN models with similar nonlinear connections. Experimental results show that
DenseNets with our specialized dropout method yield better accuracy compared to
vanilla DenseNet and state-of-the-art CNN models, and such accuracy boost
increases with the model depth.Comment: 10 pages, 5 figure
Strategies for Conceptual Change in Convolutional Neural Networks
A remarkable feature of human beings is their capacity for creative
behaviour, referring to their ability to react to problems in ways that are
novel, surprising, and useful. Transformational creativity is a form of
creativity where the creative behaviour is induced by a transformation of the
actor's conceptual space, that is, the representational system with which the
actor interprets its environment. In this report, we focus on ways of adapting
systems of learned representations as they switch from performing one task to
performing another. We describe an experimental comparison of multiple
strategies for adaptation of learned features, and evaluate how effectively
each of these strategies realizes the adaptation, in terms of the amount of
training, and in terms of their ability to cope with restricted availability of
training data. We show, among other things, that across handwritten digits,
natural images, and classical music, adaptive strategies are systematically
more effective than a baseline method that starts learning from scratch
BENCHIP: Benchmarking Intelligence Processors
The increasing attention on deep learning has tremendously spurred the design
of intelligence processing hardware. The variety of emerging intelligence
processors requires standard benchmarks for fair comparison and system
optimization (in both software and hardware). However, existing benchmarks are
unsuitable for benchmarking intelligence processors due to their non-diversity
and nonrepresentativeness. Also, the lack of a standard benchmarking
methodology further exacerbates this problem. In this paper, we propose
BENCHIP, a benchmark suite and benchmarking methodology for intelligence
processors. The benchmark suite in BENCHIP consists of two sets of benchmarks:
microbenchmarks and macrobenchmarks. The microbenchmarks consist of
single-layer networks. They are mainly designed for bottleneck analysis and
system optimization. The macrobenchmarks contain state-of-the-art industrial
networks, so as to offer a realistic comparison of different platforms. We also
propose a standard benchmarking methodology built upon an industrial software
stack and evaluation metrics that comprehensively reflect the various
characteristics of the evaluated intelligence processors. BENCHIP is utilized
for evaluating various hardware platforms, including CPUs, GPUs, and
accelerators. BENCHIP will be open-sourced soon.Comment: 37pages, 14 figure
MPNA: A Massively-Parallel Neural Array Accelerator with Dataflow Optimization for Convolutional Neural Networks
The state-of-the-art accelerators for Convolutional Neural Networks (CNNs)
typically focus on accelerating only the convolutional layers, but do not
prioritize the fully-connected layers much. Hence, they lack a synergistic
optimization of the hardware architecture and diverse dataflows for the
complete CNN design, which can provide a higher potential for
performance/energy efficiency. Towards this, we propose a novel
Massively-Parallel Neural Array (MPNA) accelerator that integrates two
heterogeneous systolic arrays and respective highly-optimized dataflow patterns
to jointly accelerate both the convolutional (CONV) and the fully-connected
(FC) layers. Besides fully-exploiting the available off-chip memory bandwidth,
these optimized dataflows enable high data-reuse of all the data types (i.e.,
weights, input and output activations), and thereby enable our MPNA to achieve
high energy savings. We synthesized our MPNA architecture using the ASIC design
flow for a 28nm technology, and performed functional and timing validation
using multiple real-world complex CNNs. MPNA achieves 149.7GOPS/W at 280MHz and
consumes 239mW. Experimental results show that our MPNA architecture provides
1.7x overall performance improvement compared to state-of-the-art accelerator,
and 51% energy saving compared to the baseline architecture
FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
Due to recent advances in digital technologies, and availability of credible
data, an area of artificial intelligence, deep learning, has emerged, and has
demonstrated its ability and effectiveness in solving complex learning problems
not possible before. In particular, convolution neural networks (CNNs) have
demonstrated their effectiveness in image detection and recognition
applications. However, they require intensive CPU operations and memory
bandwidth that make general CPUs fail to achieve desired performance levels.
Consequently, hardware accelerators that use application specific integrated
circuits (ASICs), field programmable gate arrays (FPGAs), and graphic
processing units (GPUs) have been employed to improve the throughput of CNNs.
More precisely, FPGAs have been recently adopted for accelerating the
implementation of deep learning networks due to their ability to maximize
parallelism as well as due to their energy efficiency. In this paper, we review
recent existing techniques for accelerating deep learning networks on FPGAs. We
highlight the key features employed by the various techniques for improving the
acceleration performance. In addition, we provide recommendations for enhancing
the utilization of FPGAs for CNNs acceleration. The techniques investigated in
this paper represent the recent trends in FPGA-based accelerators of deep
learning networks. Thus, this review is expected to direct the future advances
on efficient hardware accelerators and to be useful for deep learning
researchers.Comment: This article has been accepted for publication in IEEE Access
(December, 2018
Efficient Processing of Deep Neural Networks: A Tutorial and Survey
Deep neural networks (DNNs) are currently widely used for many artificial
intelligence (AI) applications including computer vision, speech recognition,
and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it
comes at the cost of high computational complexity. Accordingly, techniques
that enable efficient processing of DNNs to improve energy efficiency and
throughput without sacrificing application accuracy or increasing hardware cost
are critical to the wide deployment of DNNs in AI systems.
This article aims to provide a comprehensive tutorial and survey about the
recent advances towards the goal of enabling efficient processing of DNNs.
Specifically, it will provide an overview of DNNs, discuss various hardware
platforms and architectures that support DNNs, and highlight key trends in
reducing the computation cost of DNNs either solely via hardware design changes
or via joint hardware design and DNN algorithm changes. It will also summarize
various development resources that enable researchers and practitioners to
quickly get started in this field, and highlight important benchmarking metrics
and design considerations that should be used for evaluating the rapidly
growing number of DNN hardware designs, optionally including algorithmic
co-designs, being proposed in academia and industry.
The reader will take away the following concepts from this article:
understand the key design considerations for DNNs; be able to evaluate
different DNN hardware implementations with benchmarks and comparison metrics;
understand the trade-offs between various hardware architectures and platforms;
be able to evaluate the utility of various DNN design techniques for efficient
processing; and understand recent implementation trends and opportunities.Comment: Based on tutorial on DNN Hardware at eyeriss.mit.edu/tutorial.htm
Morph: Flexible Acceleration for 3D CNN-based Video Understanding
The past several years have seen both an explosion in the use of
Convolutional Neural Networks (CNNs) and the design of accelerators to make CNN
inference practical. In the architecture community, the lion share of effort
has targeted CNN inference for image recognition. The closely related problem
of video recognition has received far less attention as an accelerator target.
This is surprising, as video recognition is more computationally intensive than
image recognition, and video traffic is predicted to be the majority of
internet traffic in the coming years.
This paper fills the gap between algorithmic and hardware advances for video
recognition by providing a design space exploration and flexible architecture
for accelerating 3D Convolutional Neural Networks (3D CNNs) - the core kernel
in modern video understanding. When compared to (2D) CNNs used for image
recognition, efficiently accelerating 3D CNNs poses a significant engineering
challenge due to their large (and variable over time) memory footprint and
higher dimensionality.
To address these challenges, we design a novel accelerator, called Morph,
that can adaptively support different spatial and temporal tiling strategies
depending on the needs of each layer of each target 3D CNN. We codesign a
software infrastructure alongside the Morph hardware to find good-fit
parameters to control the hardware. Evaluated on state-of-the-art 3D CNNs,
Morph achieves up to 3.4x (2.5x average) reduction in energy consumption and
improves performance/watt by up to 5.1x (4x average) compared to a baseline 3D
CNN accelerator, with an area overhead of 5%. Morph further achieves a 15.9x
average energy reduction on 3D CNNs when compared to Eyeriss.Comment: Appears in the proceedings of the 51st Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), 201
Systimator: A Design Space Exploration Methodology for Systolic Array based CNNs Acceleration on the FPGA-based Edge Nodes
The evolution of IoT based smart applications demand porting of artificial
intelligence algorithms to the edge computing devices. CNNs form a large part
of these AI algorithms. Systolic array based CNN acceleration is being widely
advocated due its ability to allow scalable architectures. However, CNNs are
inherently memory and compute intensive algorithms, and hence pose significant
challenges to be implemented on the resource-constrained edge computing
devices. Memory-constrained low-cost FPGA based devices form a substantial
fraction of these edge computing devices. Thus, when porting to such
edge-computing devices, the designer is left unguided as to how to select a
suitable systolic array configuration that could fit in the available hardware
resources. In this paper we propose Systimator, a design space exploration
based methodology that provides a set of design points that can be mapped
within the memory bounds of the target FPGA device. The methodology is based
upon an analytical model that is formulated to estimate the required resources
for systolic arrays, assuming multiple data reuse patterns. The methodology
further provides the performance estimates for each of the candidate design
points. We show that Systimator provides an in-depth analysis of
resource-requirement of systolic array based CNNs. We provide our resource
estimation results for porting of convolutional layers of TINY YOLO, a CNN
based object detector, on a Xilinx ARTIX 7 FPGA.Comment: 5 Pages, 3 Figures, work in progres
- …